Social IQa
Progress Over Time
Interactive timeline showing model performance evolution on Social IQa
Social IQa Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Microsoft | 60B | — | — | ||
| 2 | Microsoft | 4B | — | — | ||
| 3 | Microsoft | 4B | — | — | ||
| 4 | Google | 27B | — | — | ||
| 5 | Google | 9B | — | — | ||
| 6 | Google | 8B | — | — | ||
| 6 | 2B | — | — | |||
| 8 | 2B | — | — | |||
| 8 | Google | 8B | — | — |
What is Social IQa?
The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.
Social IQa is a text benchmark evaluating models on psychology, reasoning, and creativity tasks. LLM Stats tracks 9 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.
Compare leaders on the best AI for psychology, best AI for reasoning and best AI for creativity leaderboards.
Current leaders
Phi-3.5-MoE-instruct from Microsoft currently leads the Social IQa leaderboard with a score of 0.780 across 9 evaluated AI models.
Source paper
- Title
- SocialIQA: Commonsense Reasoning about Social Interactions
- Authors
- Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and 1 others
- Published
- arXiv
- 1904.09728
Abstract
We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could hear"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).
FAQ
Common questions about the Social IQa benchmark and leaderboard.