Winogrande
WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine commonsense reasoning. Uses adversarial filtering to reduce spurious biases and provides a more robust evaluation of whether AI systems truly understand commonsense or exploit statistical shortcuts. Current best AI methods achieve 59.4-79.1% accuracy, significantly below human performance of 94.0%.
Progress Over Time
Interactive timeline showing model performance evolution on Winogrande
State-of-the-art frontier
Open
Proprietary
Winogrande Leaderboard
21 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | OpenAI | 0.875 | — | 33K | $30.00 $60.00 | |
2 | Cohere | 0.854 | 104B | 128K | $0.25 $1.00 | |
3 | Alibaba Cloud / Qwen Team | 0.851 | 72B | — | — | |
4 | 0.845 | 70B | — | — | ||
5 | Google | 0.837 | 27B | — | — | |
6 | Nous Research | 0.832 | 70B | — | — | |
7 | Alibaba Cloud / Qwen Team | 0.820 | 33B | — | — | |
8 | Microsoft | 0.813 | 60B | — | — | |
9 | Alibaba Cloud / Qwen Team | 0.808 | 32B | 128K | $0.09 $0.09 | |
10 | Google | 0.806 | 9B | — | — | |
11 | Mistral AI | 0.768 | 12B | 128K | $0.15 $0.15 | |
12 | Mistral AI | 0.753 | 8B | 128K | $0.10 $0.10 | |
13 | 0.744 | 8B | — | — | ||
14 | Alibaba Cloud / Qwen Team | 0.729 | 7B | — | — | |
15 | 0.717 | 2B | — | — | ||
15 | Google | 0.717 | 8B | — | — | |
17 | Microsoft | 0.685 | 4B | 128K | $0.10 $0.10 | |
18 | Microsoft | 0.670 | 4B | — | — | |
19 | Google | 0.668 | 8B | — | — | |
19 | 0.668 | 2B | — | — | ||
21 | Baidu | 0.513 | 21B | 128K | $0.40 $4.00 |
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about Winogrande
WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine commonsense reasoning. Uses adversarial filtering to reduce spurious biases and provides a more robust evaluation of whether AI systems truly understand commonsense or exploit statistical shortcuts. Current best AI methods achieve 59.4-79.1% accuracy, significantly below human performance of 94.0%.
The Winogrande paper is available at https://arxiv.org/abs/1907.10641. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Winogrande leaderboard ranks 21 AI models based on their performance on this benchmark. Currently, GPT-4 by OpenAI leads with a score of 0.875. The average score across all models is 0.761.
The highest Winogrande score is 0.875, achieved by GPT-4 from OpenAI.
21 models have been evaluated on the Winogrande benchmark, with 0 verified results and 21 self-reported results.
Winogrande is categorized under language and reasoning. The benchmark evaluates text models.