BIG-Bench Extra Hard
Progress Over Time
Interactive timeline showing model performance evolution on BIG-Bench Extra Hard
BIG-Bench Extra Hard Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | 31B | 262K | $0.13 / $0.38 | ||
| 2 | Google | 25B | 262K | $0.13 / $0.40 | ||
| 3 | Google | 12B | — | — | ||
| 4 | Google | 25B | — | — | ||
| 5 | Google | 8B | — | — | ||
| 6 | Google | 5B | — | — | ||
| 7 | Google | 27B | — | — | ||
| 8 | Google | 12B | — | — | ||
| 9 | Google | — | — | — | ||
| 10 | Google | 4B | — | — | ||
| 11 | Google | 1B | — | — |
What is BIG-Bench Extra Hard?
BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.
BIG-Bench Extra Hard is a text benchmark evaluating models on language, reasoning, and general tasks. LLM Stats tracks 11 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.7.
Compare leaders on the best AI for language, best AI for reasoning and best AI for general leaderboards.
Current leaders
Gemma 4 31B from Google currently leads the BIG-Bench Extra Hard leaderboard with a score of 0.744 across 11 evaluated AI models.
Source paper
- Title
- BIG-Bench Extra Hard
- Authors
- Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, and 16 others
- Published
- arXiv
- 2502.19187
Abstract
Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
FAQ
Common questions about the BIG-Bench Extra Hard benchmark and leaderboard.