BIG-Bench
Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 204+ tasks designed to probe large language models and extrapolate their future capabilities. It covers diverse domains including linguistics, mathematics, common-sense reasoning, biology, physics, social bias, software development, and more. The benchmark focuses on tasks believed to be beyond current language model capabilities and includes both English and non-English tasks across multiple languages.
Progress Over Time
Interactive timeline showing model performance evolution on BIG-Bench
State-of-the-art frontier
Open
Proprietary
BIG-Bench Leaderboard
3 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | — | 33K | $0.50 / $1.50 | ||
| 2 | Google | 27B | — | — | ||
| 3 | Google | 9B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about BIG-Bench
Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 204+ tasks designed to probe large language models and extrapolate their future capabilities. It covers diverse domains including linguistics, mathematics, common-sense reasoning, biology, physics, social bias, software development, and more. The benchmark focuses on tasks believed to be beyond current language model capabilities and includes both English and non-English tasks across multiple languages.
The BIG-Bench paper is available at https://arxiv.org/abs/2206.04615. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BIG-Bench leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Gemini 1.0 Pro by Google leads with a score of 0.750. The average score across all models is 0.727.
The highest BIG-Bench score is 0.750, achieved by Gemini 1.0 Pro from Google.
3 models have been evaluated on the BIG-Bench benchmark, with 0 verified results and 2 self-reported results.
BIG-Bench is categorized under language, math, and reasoning. The benchmark evaluates text models with multilingual support.