MBPP+
MBPP+ is an enhanced version of MBPP (Mostly Basic Python Problems) with significantly more test cases (35x) for more rigorous evaluation. MBPP is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality.
Progress Over Time
Interactive timeline showing model performance evolution on MBPP+
State-of-the-art frontier
Open
Proprietary
MBPP+ Leaderboard
3 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
2 | Alibaba Cloud / Qwen Team | 15B | — | — | ||
3 | Baidu | 21B | 128K | $0.40 $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about MBPP+
MBPP+ is an enhanced version of MBPP (Mostly Basic Python Problems) with significantly more test cases (35x) for more rigorous evaluation. MBPP is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality.
The MBPP+ paper is available at https://arxiv.org/abs/2108.07732. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MBPP+ leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Qwen2.5 32B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.672. The average score across all models is 0.569.
The highest MBPP+ score is 0.672, achieved by Qwen2.5 32B Instruct from Alibaba Cloud / Qwen Team.
3 models have been evaluated on the MBPP+ benchmark, with 0 verified results and 3 self-reported results.
MBPP+ is categorized under general and reasoning. The benchmark evaluates text models.