MBPP+

MBPP+ is an enhanced version of MBPP (Mostly Basic Python Problems) with significantly more test cases (35x) for more rigorous evaluation. MBPP is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MBPP+

State-of-the-art frontier
Open
Proprietary

MBPP+ Leaderboard

3 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
3
21B128K
$0.40
$4.00
Notice missing or incorrect data?

FAQ

Common questions about MBPP+

MBPP+ is an enhanced version of MBPP (Mostly Basic Python Problems) with significantly more test cases (35x) for more rigorous evaluation. MBPP is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality.
The MBPP+ paper is available at https://arxiv.org/abs/2108.07732. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MBPP+ leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Qwen2.5 32B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.672. The average score across all models is 0.569.
The highest MBPP+ score is 0.672, achieved by Qwen2.5 32B Instruct from Alibaba Cloud / Qwen Team.
3 models have been evaluated on the MBPP+ benchmark, with 0 verified results and 3 self-reported results.
MBPP+ is categorized under general and reasoning. The benchmark evaluates text models.