BIG-Bench Extra Hard
BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.
Progress Over Time
Interactive timeline showing model performance evolution on BIG-Bench Extra Hard
BIG-Bench Extra Hard Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Gemma 4 31BNew Google | 31B | — | — | ||
| 2 | Google | 25B | — | — | ||
| 3 | Gemma 4 E4BNew Google | 8B | — | — | ||
| 4 | Gemma 4 E2BNew Google | 5B | — | — | ||
| 5 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 6 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 7 | Google | — | — | — | ||
| 8 | Google | 4B | 131K | $0.02 / $0.04 | ||
| 9 | Google | 1B | — | — |
FAQ
Common questions about BIG-Bench Extra Hard