BIG-Bench Hard
BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.
Progress Over Time
Interactive timeline showing model performance evolution on BIG-Bench Hard
State-of-the-art frontier
Open
Proprietary
BIG-Bench Hard Leaderboard
21 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Anthropic | — | 200K | $3.00 $15.00 | ||
1 | Anthropic | — | 200K | $3.00 $15.00 | ||
3 | Google | — | 2.1M | $2.50 $10.00 | ||
4 | Google | 27B | 131K | $0.10 $0.20 | ||
5 | Anthropic | — | 200K | $15.00 $75.00 | ||
6 | Google | 12B | 131K | $0.05 $0.10 | ||
7 | Google | — | 1.0M | $0.15 $0.60 | ||
8 | Anthropic | — | 200K | $3.00 $15.00 | ||
9 | Microsoft | 60B | — | — | ||
10 | Anthropic | — | 200K | $0.25 $1.25 | ||
11 | Google | 4B | 131K | $0.02 $0.04 | ||
12 | Microsoft | 4B | — | — | ||
13 | 8B | 128K | $0.50 $0.50 | |||
13 | 8B | — | — | |||
15 | Microsoft | 4B | 128K | $0.10 $0.10 | ||
16 | 7B | — | — | |||
17 | 2B | — | — | |||
17 | Google | 8B | — | — | ||
19 | Google | 8B | — | — | ||
19 | 2B | — | — | |||
21 | Google | 1B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about BIG-Bench Hard
BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.
The BIG-Bench Hard paper is available at https://arxiv.org/abs/2210.09261. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BIG-Bench Hard leaderboard ranks 21 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.931. The average score across all models is 0.712.
The highest BIG-Bench Hard score is 0.931, achieved by Claude 3.5 Sonnet from Anthropic.
21 models have been evaluated on the BIG-Bench Hard benchmark, with 0 verified results and 21 self-reported results.
BIG-Bench Hard is categorized under language, math, and reasoning. The benchmark evaluates text models.