Benchmarks/general/BigCodeBench

BigCodeBench

A benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. Evaluates code generation with diverse function calls and complex instructions, featuring two variants: Complete (code completion based on comprehensive docstrings) and Instruct (generating code from natural language instructions).

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BigCodeBench

State-of-the-art frontier
Open
Proprietary

BigCodeBench Leaderboard

2 models
ContextCostLicense
1
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?

FAQ

Common questions about BigCodeBench

A benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. Evaluates code generation with diverse function calls and complex instructions, featuring two variants: Complete (code completion based on comprehensive docstrings) and Instruct (generating code from natural language instructions).
The BigCodeBench paper is available at https://arxiv.org/abs/2406.15877. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BigCodeBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Gemini Diffusion by Google leads with a score of 0.454. The average score across all models is 0.432.
The highest BigCodeBench score is 0.454, achieved by Gemini Diffusion from Google.
2 models have been evaluated on the BigCodeBench benchmark, with 0 verified results and 2 self-reported results.
BigCodeBench is categorized under general and reasoning. The benchmark evaluates text models.