Social IQa

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Social IQa

State-of-the-art frontier
Open
Proprietary

Social IQa Leaderboard

9 models
ContextCostLicense
160B
24B
3
Microsoft
Microsoft
4B
427B
59B
68B
62B
82B
88B
Notice missing or incorrect data?
About this benchmark

What is Social IQa?

The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.

Social IQa is a text benchmark evaluating models on psychology, reasoning, and creativity tasks. LLM Stats tracks 9 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.

Compare leaders on the best AI for psychology, best AI for reasoning and best AI for creativity leaderboards.

Current leaders

Phi-3.5-MoE-instruct from Microsoft currently leads the Social IQa leaderboard with a score of 0.780 across 9 evaluated AI models.

1Phi-3.5-MoE-instructMicrosoft78.0%
2Phi-3.5-mini-instructMicrosoft74.7%
3Phi 4 MiniMicrosoft72.5%

Source paper

Title
SocialIQA: Commonsense Reasoning about Social Interactions
Authors
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and 1 others
Published
Abstract

We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could hear"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).

FAQ

Common questions about the Social IQa benchmark and leaderboard.

What is the Social IQa benchmark?

The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.

What is the Social IQa leaderboard?

The Social IQa leaderboard ranks 9 AI models based on their performance on this benchmark. Currently, Phi-3.5-MoE-instruct by Microsoft leads with a score of 0.780. The average score across all models is 0.589.

What is the highest Social IQa score?

The highest Social IQa score is 0.780, achieved by Phi-3.5-MoE-instruct from Microsoft.

How many models are evaluated on Social IQa?

9 models have been evaluated on the Social IQa benchmark, with 0 verified results and 9 self-reported results.

Where can I find the Social IQa paper?

The Social IQa paper is available at https://arxiv.org/abs/1904.09728. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does Social IQa cover?

Social IQa is categorized under psychology, reasoning, and creativity. The benchmark evaluates text models.

What is the best open-source model on Social IQa?

Phi-3.5-MoE-instruct by Microsoft is the top-ranked open-source model on Social IQa, with a score of 0.780 (rank #1).

How recent are the Social IQa leaderboard results?

The Social IQa leaderboard was last updated in July 2026 and currently includes 9 evaluated models.