Benchmarks/general/BigCodeBench-Full

BigCodeBench-Full

A comprehensive benchmark that evaluates large language models' ability to solve complex, practical programming tasks via code generation. Contains 1,140 fine-grained tasks across 7 domains using function calls from 139 libraries. Challenges LLMs to invoke multiple function calls as tools and handle complex instructions for realistic software engineering and general-purpose reasoning tasks.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BigCodeBench-Full

State-of-the-art frontier
Open
Proprietary

BigCodeBench-Full Leaderboard

1 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K$0.09 / $0.09
Notice missing or incorrect data?

FAQ

Common questions about BigCodeBench-Full

A comprehensive benchmark that evaluates large language models' ability to solve complex, practical programming tasks via code generation. Contains 1,140 fine-grained tasks across 7 domains using function calls from 139 libraries. Challenges LLMs to invoke multiple function calls as tools and handle complex instructions for realistic software engineering and general-purpose reasoning tasks.
The BigCodeBench-Full paper is available at https://arxiv.org/abs/2406.15877. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BigCodeBench-Full leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Coder 32B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.496. The average score across all models is 0.496.
The highest BigCodeBench-Full score is 0.496, achieved by Qwen2.5-Coder 32B Instruct from Alibaba Cloud / Qwen Team.
1 models have been evaluated on the BigCodeBench-Full benchmark, with 0 verified results and 1 self-reported results.
BigCodeBench-Full is categorized under general and reasoning. The benchmark evaluates text models.