Benchmarks/language/BIG-Bench Hard

BIG-Bench Hard

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BIG-Bench Hard

State-of-the-art frontier
Open
Proprietary

BIG-Bench Hard Leaderboard

21 models • 0 verified
ContextCostLicense
1
200K
$3.00
$15.00
1
200K
$3.00
$15.00
3
2.1M
$2.50
$10.00
4
27B131K
$0.10
$0.20
5
Anthropic
Anthropic
200K
$15.00
$75.00
6
12B131K
$0.05
$0.10
7
1.0M
$0.15
$0.60
8
200K
$3.00
$15.00
9
60B
10
200K
$0.25
$1.25
11
4B131K
$0.02
$0.04
12
Microsoft
Microsoft
4B
13
8B128K
$0.50
$0.50
13
8B
15
4B128K
$0.10
$0.10
16
7B
17
2B
17
8B
19
8B
19
2B
21
1B
Notice missing or incorrect data?

FAQ

Common questions about BIG-Bench Hard

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.
The BIG-Bench Hard paper is available at https://arxiv.org/abs/2210.09261. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BIG-Bench Hard leaderboard ranks 21 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.931. The average score across all models is 0.712.
The highest BIG-Bench Hard score is 0.931, achieved by Claude 3.5 Sonnet from Anthropic.
21 models have been evaluated on the BIG-Bench Hard benchmark, with 0 verified results and 21 self-reported results.
BIG-Bench Hard is categorized under language, math, and reasoning. The benchmark evaluates text models.