BIG-Bench Extra Hard

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BIG-Bench Extra Hard

State-of-the-art frontier
Open
Proprietary

BIG-Bench Extra Hard Leaderboard

11 models
ContextCostLicense
131B262K$0.13 / $0.38
225B262K$0.13 / $0.40
312B
425B
58B
65B
727B
812B
9
104B
111B
Notice missing or incorrect data?
About this benchmark

What is BIG-Bench Extra Hard?

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.

BIG-Bench Extra Hard is a text benchmark evaluating models on language, reasoning, and general tasks. LLM Stats tracks 11 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.7.

Compare leaders on the best AI for language, best AI for reasoning and best AI for general leaderboards.

Current leaders

Gemma 4 31B from Google currently leads the BIG-Bench Extra Hard leaderboard with a score of 0.744 across 11 evaluated AI models.

1Gemma 4 31BGoogle74.4%
2Gemma 4 26B-A4BGoogle64.8%
3Gemma 4 12BGoogle53.0%

Source paper

Title
BIG-Bench Extra Hard
Authors
Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, and 16 others
Published
Abstract

Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.

FAQ

Common questions about the BIG-Bench Extra Hard benchmark and leaderboard.

What is the BIG-Bench Extra Hard benchmark?

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.

What is the BIG-Bench Extra Hard leaderboard?

The BIG-Bench Extra Hard leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, Gemma 4 31B by Google leads with a score of 0.744. The average score across all models is 0.331.

What is the highest BIG-Bench Extra Hard score?

The highest BIG-Bench Extra Hard score is 0.744, achieved by Gemma 4 31B from Google.

How many models are evaluated on BIG-Bench Extra Hard?

11 models have been evaluated on the BIG-Bench Extra Hard benchmark, with 0 verified results and 11 self-reported results.

Where can I find the BIG-Bench Extra Hard paper?

The BIG-Bench Extra Hard paper is available at https://arxiv.org/abs/2502.19187. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BIG-Bench Extra Hard cover?

BIG-Bench Extra Hard is categorized under language, reasoning, and general. The benchmark evaluates text models.

What is the best open-source model on BIG-Bench Extra Hard?

Gemma 4 31B by Google is the top-ranked open-source model on BIG-Bench Extra Hard, with a score of 0.744 (rank #1).

Which model offers the best value on BIG-Bench Extra Hard?

Among models scoring within 10% of the leader, Gemma 4 31B from Google is the cheapest, at $0.13 per million input tokens with a score of 0.744.

How recent are the BIG-Bench Extra Hard leaderboard results?

The BIG-Bench Extra Hard leaderboard was last updated in July 2026 and currently includes 11 evaluated models.