BIG-Bench

Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 204+ tasks designed to probe large language models and extrapolate their future capabilities. It covers diverse domains including linguistics, mathematics, common-sense reasoning, biology, physics, social bias, software development, and more. The benchmark focuses on tasks believed to be beyond current language model capabilities and includes both English and non-English tasks across multiple languages.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BIG-Bench

State-of-the-art frontier
Open
Proprietary

BIG-Bench Leaderboard

3 models
ContextCostLicense
133K$0.50 / $1.50
227B
39B
Notice missing or incorrect data?

FAQ

Common questions about BIG-Bench

Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 204+ tasks designed to probe large language models and extrapolate their future capabilities. It covers diverse domains including linguistics, mathematics, common-sense reasoning, biology, physics, social bias, software development, and more. The benchmark focuses on tasks believed to be beyond current language model capabilities and includes both English and non-English tasks across multiple languages.
The BIG-Bench paper is available at https://arxiv.org/abs/2206.04615. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BIG-Bench leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Gemini 1.0 Pro by Google leads with a score of 0.750. The average score across all models is 0.727.
The highest BIG-Bench score is 0.750, achieved by Gemini 1.0 Pro from Google.
3 models have been evaluated on the BIG-Bench benchmark, with 0 verified results and 2 self-reported results.
BIG-Bench is categorized under language, math, and reasoning. The benchmark evaluates text models with multilingual support.