BIG-Bench Hard

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BIG-Bench Hard

State-of-the-art frontier
Open
Proprietary

BIG-Bench Hard Leaderboard

21 models
ContextCostLicense
1
1
3
427B
5
Anthropic
Anthropic
612B
7
8
960B
10
114B
12
Microsoft
Microsoft
4B
138B
138B
154B
167B
172B
178B
198B
192B
211B
Notice missing or incorrect data?
About this benchmark

What is BIG-Bench Hard?

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.

BIG-Bench Hard is a text benchmark evaluating models on math, reasoning, and language tasks. LLM Stats tracks 21 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.

Compare leaders on the best AI for math, best AI for reasoning and best AI for language leaderboards.

Current leaders

Claude 3.5 Sonnet from Anthropic currently leads the BIG-Bench Hard leaderboard with a score of 0.931 across 21 evaluated AI models.

1Claude 3.5 SonnetAnthropic93.1%
1Claude 3.5 SonnetAnthropic93.1%
3Gemini 1.5 ProGoogle89.2%
OSSGemma 3 27B#4 open-weight87.6%

Source paper

Title
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Authors
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, and 7 others
Published
Abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

FAQ

Common questions about the BIG-Bench Hard benchmark and leaderboard.

What is the BIG-Bench Hard benchmark?

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.

What is the BIG-Bench Hard leaderboard?

The BIG-Bench Hard leaderboard ranks 21 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.931. The average score across all models is 0.712.

What is the highest BIG-Bench Hard score?

The highest BIG-Bench Hard score is 0.931, achieved by Claude 3.5 Sonnet from Anthropic.

How many models are evaluated on BIG-Bench Hard?

21 models have been evaluated on the BIG-Bench Hard benchmark, with 0 verified results and 21 self-reported results.

Where can I find the BIG-Bench Hard paper?

The BIG-Bench Hard paper is available at https://arxiv.org/abs/2210.09261. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BIG-Bench Hard cover?

BIG-Bench Hard is categorized under math, reasoning, and language. The benchmark evaluates text models.

What is the best open-source model on BIG-Bench Hard?

Gemma 3 27B by Google is the top-ranked open-source model on BIG-Bench Hard, with a score of 0.876 (rank #4).

How recent are the BIG-Bench Hard leaderboard results?

The BIG-Bench Hard leaderboard was last updated in June 2026 and currently includes 21 evaluated models.