BigCodeBench-Full
A comprehensive benchmark that evaluates large language models' ability to solve complex, practical programming tasks via code generation. Contains 1,140 fine-grained tasks across 7 domains using function calls from 139 libraries. Challenges LLMs to invoke multiple function calls as tools and handle complex instructions for realistic software engineering and general-purpose reasoning tasks.
Progress Over Time
Interactive timeline showing model performance evolution on BigCodeBench-Full
State-of-the-art frontier
Open
Proprietary
BigCodeBench-Full Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 32B | 128K | $0.09 / $0.09 |
Notice missing or incorrect data?
FAQ
Common questions about BigCodeBench-Full
A comprehensive benchmark that evaluates large language models' ability to solve complex, practical programming tasks via code generation. Contains 1,140 fine-grained tasks across 7 domains using function calls from 139 libraries. Challenges LLMs to invoke multiple function calls as tools and handle complex instructions for realistic software engineering and general-purpose reasoning tasks.
The BigCodeBench-Full paper is available at https://arxiv.org/abs/2406.15877. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BigCodeBench-Full leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Coder 32B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.496. The average score across all models is 0.496.
The highest BigCodeBench-Full score is 0.496, achieved by Qwen2.5-Coder 32B Instruct from Alibaba Cloud / Qwen Team.
1 models have been evaluated on the BigCodeBench-Full benchmark, with 0 verified results and 1 self-reported results.
BigCodeBench-Full is categorized under general and reasoning. The benchmark evaluates text models.