MBPP EvalPlus
MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. EvalPlus extends MBPP with significantly more test cases (35x) for more rigorous evaluation of LLM-synthesized code, providing high-quality and precise evaluation.
Progress Over Time
Interactive timeline showing model performance evolution on MBPP EvalPlus
State-of-the-art frontier
Open
Proprietary
MBPP EvalPlus Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 405B | 128K | $0.89 / $0.89 | |||
| 2 | 70B | 128K | $0.20 / $0.20 |
Notice missing or incorrect data?
FAQ
Common questions about MBPP EvalPlus
MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. EvalPlus extends MBPP with significantly more test cases (35x) for more rigorous evaluation of LLM-synthesized code, providing high-quality and precise evaluation.
The MBPP EvalPlus paper is available at https://arxiv.org/abs/2108.07732. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MBPP EvalPlus leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Llama 3.1 405B Instruct by Meta leads with a score of 0.886. The average score across all models is 0.881.
The highest MBPP EvalPlus score is 0.886, achieved by Llama 3.1 405B Instruct from Meta.
2 models have been evaluated on the MBPP EvalPlus benchmark, with 0 verified results and 2 self-reported results.
MBPP EvalPlus is categorized under general and reasoning. The benchmark evaluates text models.