MBPP

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MBPP

State-of-the-art frontier
Open
Proprietary

MBPP Leaderboard

33 models
ContextCostLicense
1
Sarvam AI
Sarvam AI
30B
250B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K$0.09 / $0.09
49B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K$0.35 / $0.40
68B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B128K$0.10 / $0.10
1260B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B131K$0.30 / $0.30
15
Mistral AI
Mistral AI
22B
16400B1.0M$0.17 / $0.60
17
1824B
1927B131K$0.10 / $0.20
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
2112B131K$0.05 / $0.10
2224B
234B128K$0.10 / $0.10
24109B10.0M$0.08 / $0.30
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
262B
268B32K$20.00 / $40.00
284B131K$0.02 / $0.04
2927B
302B
308B
329B
331B
Notice missing or incorrect data?

FAQ

Common questions about MBPP

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.
The MBPP paper is available at https://arxiv.org/abs/2108.07732. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MBPP leaderboard ranks 33 AI models based on their performance on this benchmark. Currently, Sarvam-30B by Sarvam AI leads with a score of 0.927. The average score across all models is 0.741.
The highest MBPP score is 0.927, achieved by Sarvam-30B from Sarvam AI.
33 models have been evaluated on the MBPP benchmark, with 0 verified results and 33 self-reported results.
MBPP is categorized under general and reasoning. The benchmark evaluates text models.