Benchmarks/general/BIG-Bench Extra Hard

BIG-Bench Extra Hard

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BIG-Bench Extra Hard

State-of-the-art frontier
Open
Proprietary

BIG-Bench Extra Hard Leaderboard

9 models
ContextCostLicense
1
Google
Google
31B
225B
3
Google
Google
8B
4
Google
Google
5B
527B131K$0.10 / $0.20
612B131K$0.05 / $0.10
7
84B131K$0.02 / $0.04
91B
Notice missing or incorrect data?

FAQ

Common questions about BIG-Bench Extra Hard

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.
The BIG-Bench Extra Hard paper is available at https://arxiv.org/abs/2502.19187. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BIG-Bench Extra Hard leaderboard ranks 9 AI models based on their performance on this benchmark. Currently, Gemma 4 31B by Google leads with a score of 0.744. The average score across all models is 0.292.
The highest BIG-Bench Extra Hard score is 0.744, achieved by Gemma 4 31B from Google.
9 models have been evaluated on the BIG-Bench Extra Hard benchmark, with 0 verified results and 9 self-reported results.
BIG-Bench Extra Hard is categorized under general, language, and reasoning. The benchmark evaluates text models.