Benchmarks/language/Winogrande

Winogrande

WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine commonsense reasoning. Uses adversarial filtering to reduce spurious biases and provides a more robust evaluation of whether AI systems truly understand commonsense or exploit statistical shortcuts. Current best AI methods achieve 59.4-79.1% accuracy, significantly below human performance of 94.0%.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Winogrande

State-of-the-art frontier
Open
Proprietary

Winogrande Leaderboard

21 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
0.87533K
$30.00
$60.00
2
0.854104B128K
$0.25
$1.00
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.85172B
4
0.84570B
5
0.83727B
6
Nous Research
Nous Research
0.83270B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.82033B
8
0.81360B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.80832B128K
$0.09
$0.09
10
0.8069B
11
0.76812B128K
$0.15
$0.15
12
0.7538B128K
$0.10
$0.10
13
0.7448B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.7297B
15
0.7172B
15
0.7178B
17
0.6854B128K
$0.10
$0.10
18
Microsoft
Microsoft
0.6704B
19
0.6688B
19
0.6682B
21
0.51321B128K
$0.40
$4.00
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about Winogrande

WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine commonsense reasoning. Uses adversarial filtering to reduce spurious biases and provides a more robust evaluation of whether AI systems truly understand commonsense or exploit statistical shortcuts. Current best AI methods achieve 59.4-79.1% accuracy, significantly below human performance of 94.0%.
The Winogrande paper is available at https://arxiv.org/abs/1907.10641. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Winogrande leaderboard ranks 21 AI models based on their performance on this benchmark. Currently, GPT-4 by OpenAI leads with a score of 0.875. The average score across all models is 0.761.
The highest Winogrande score is 0.875, achieved by GPT-4 from OpenAI.
21 models have been evaluated on the Winogrande benchmark, with 0 verified results and 21 self-reported results.
Winogrande is categorized under language and reasoning. The benchmark evaluates text models.