HellaSwag
A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities
Progress Over Time
Interactive timeline showing model performance evolution on HellaSwag
State-of-the-art frontier
Open
Proprietary
HellaSwag Leaderboard
26 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 200K | $15.00 / $75.00 | ||
| 2 | OpenAI | — | 33K | $30.00 / $60.00 | ||
| 3 | Google | — | 2.1M | $2.50 / $10.00 | ||
| 4 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 5 | Cohere | 104B | 128K | $0.25 / $1.00 | ||
| 6 | Nous Research | 70B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 8 | Google | — | 1.0M | $0.15 / $0.60 | ||
| 9 | Google | 27B | — | — | ||
| 10 | Anthropic | — | 200K | $0.25 / $1.25 | ||
| 11 | 70B | — | — | |||
| 12 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 13 | Microsoft | 60B | — | — | ||
| 14 | Mistral AI | 12B | 128K | $0.15 / $0.15 | ||
| 15 | Alibaba Cloud / Qwen Team | 32B | 128K | $0.09 / $0.09 | ||
| 16 | Google | 9B | — | — | ||
| 17 | 8B | — | — | |||
| 18 | 2B | — | — | |||
| 18 | Google | 8B | — | — | ||
| 20 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 21 | Google | 8B | — | — | ||
| 21 | 2B | — | — | |||
| 23 | 3B | 128K | $0.01 / $0.02 | |||
| 24 | Microsoft | 4B | 128K | $0.10 / $0.10 | ||
| 25 | Microsoft | 4B | — | — | ||
| 26 | Baidu | 21B | 128K | $0.40 / $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about HellaSwag
A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities
The HellaSwag paper is available at https://arxiv.org/abs/1905.07830. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HellaSwag leaderboard ranks 26 AI models based on their performance on this benchmark. Currently, Claude 3 Opus by Anthropic leads with a score of 0.954. The average score across all models is 0.807.
The highest HellaSwag score is 0.954, achieved by Claude 3 Opus from Anthropic.
26 models have been evaluated on the HellaSwag benchmark, with 0 verified results and 26 self-reported results.
HellaSwag is categorized under reasoning. The benchmark evaluates text models.