HellaSwag

A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities

Paper

Progress Over Time

Interactive timeline showing model performance evolution on HellaSwag

State-of-the-art frontier
Open
Proprietary

HellaSwag Leaderboard

26 models
ContextCostLicense
1
Anthropic
Anthropic
200K$15.00 / $75.00
2
OpenAI
OpenAI
33K$30.00 / $60.00
32.1M$2.50 / $10.00
4200K$3.00 / $15.00
5104B128K$0.25 / $1.00
6
Nous Research
Nous Research
70B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
81.0M$0.15 / $0.60
927B
10200K$0.25 / $1.25
1170B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
1360B
1412B128K$0.15 / $0.15
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K$0.09 / $0.09
169B
178B
182B
188B
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
218B
212B
233B128K$0.01 / $0.02
244B128K$0.10 / $0.10
25
Microsoft
Microsoft
4B
2621B128K$0.40 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about HellaSwag

A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities
The HellaSwag paper is available at https://arxiv.org/abs/1905.07830. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HellaSwag leaderboard ranks 26 AI models based on their performance on this benchmark. Currently, Claude 3 Opus by Anthropic leads with a score of 0.954. The average score across all models is 0.807.
The highest HellaSwag score is 0.954, achieved by Claude 3 Opus from Anthropic.
26 models have been evaluated on the HellaSwag benchmark, with 0 verified results and 26 self-reported results.
HellaSwag is categorized under reasoning. The benchmark evaluates text models.