BigCodeBench
A benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. Evaluates code generation with diverse function calls and complex instructions, featuring two variants: Complete (code completion based on comprehensive docstrings) and Instruct (generating code from natural language instructions).
Progress Over Time
Interactive timeline showing model performance evolution on BigCodeBench
State-of-the-art frontier
Open
Proprietary
BigCodeBench Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | — | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 7B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about BigCodeBench
A benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. Evaluates code generation with diverse function calls and complex instructions, featuring two variants: Complete (code completion based on comprehensive docstrings) and Instruct (generating code from natural language instructions).
The BigCodeBench paper is available at https://arxiv.org/abs/2406.15877. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BigCodeBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Gemini Diffusion by Google leads with a score of 0.454. The average score across all models is 0.432.
The highest BigCodeBench score is 0.454, achieved by Gemini Diffusion from Google.
2 models have been evaluated on the BigCodeBench benchmark, with 0 verified results and 2 self-reported results.
BigCodeBench is categorized under general and reasoning. The benchmark evaluates text models.