MBPP

Name: MBPP Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MBPP

State-of-the-art frontier

Open

Proprietary

MBPP Leaderboard

33 models

			Context	Cost
1	Sarvam-30B Sarvam AI	30B	—	—
2	Llama-3.3 Nemotron Super 49B v1 NVIDIA	50B	—	—
3	Qwen2.5-Coder 32B Instruct Alibaba Cloud / Qwen Team	32B	—	—
4	MiniCPM-SALA OpenBMB	9B	—	—
5	Qwen2.5 72B Instruct Alibaba Cloud / Qwen Team	73B	—	—
6	Llama 3.1 Nemotron Nano 8B V1 NVIDIA	8B	—	—
7	Qwen2.5 VL 32B Instruct Alibaba Cloud / Qwen Team	34B	—	—
7	Qwen2.5 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
9	Qwen2.5-Coder 7B Instruct Alibaba Cloud / Qwen Team	7B	—	—
10	Qwen2.5 14B Instruct Alibaba Cloud / Qwen Team	15B	—	—
11	Qwen3 235B A22B Alibaba Cloud / Qwen Team	235B	—	—
12	Phi-3.5-MoE-instruct Microsoft	60B	—	—
13	Qwen2 72B Instruct Alibaba Cloud / Qwen Team	72B	—	—
14	Qwen2.5 7B Instruct Alibaba Cloud / Qwen Team	8B	—	—
15	Codestral-22B Mistral AI	22B	—	—
16	Llama 4 Maverick Meta	400B	—	—
17	Gemini Diffusion Google	—	—	—
18	Mistral Small 3.1 24B Instruct Mistral AI	24B	—	—
19	Gemma 3 27B Google	27B	—	—
20	Qwen2.5-Omni-7B Alibaba Cloud / Qwen Team	7B	—	—
21	Gemma 3 12B Google	12B	—	—
22	Mistral Small 3 24B Base Mistral AI	24B	—	—
23	Phi-3.5-mini-instruct Microsoft	4B	—	—
24	Llama 4 Scout Meta	109B	—	—
25	Qwen2 7B Instruct Alibaba Cloud / Qwen Team	8B	—	—
26	Gemma 3n E4B Instructed Google	8B	—	—
26	Gemma 3n E4B Instructed LiteRT Preview Google	2B	—	—
28	Gemma 3 4B Google	4B	—	—
29	Gemma 2 27B Google	27B	—	—
30	Gemma 3n E2B Instructed LiteRT (Preview) Google	2B	—	—
30	Gemma 3n E2B Instructed Google	8B	—	—
32	Gemma 2 9B Google	9B	—	—
33	Gemma 3 1B Google	1B	—	—

Notice missing or incorrect data?

About this benchmark

What is MBPP?

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.

MBPP is a text benchmark evaluating models on reasoning and general tasks. LLM Stats tracks 33 models on this benchmark, scored on a 0–100 scale. The current average is 0.7, with the leader at 0.9.

Compare leaders on the best AI for reasoning and best AI for general leaderboards.

Current leaders

Sarvam-30B from Sarvam AI currently leads the MBPP leaderboard with a score of 0.927 across 33 evaluated AI models.

Sarvam-30BSarvam AI0.9%

Llama-3.3 Nemotron Super 49B v1NVIDIA0.9%

Qwen2.5-Coder 32B InstructAlibaba Cloud / Qwen Team0.9%

Source paper

Title: Program Synthesis with Large Language Models
Authors: Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, and 7 others
Published: August 16, 2021
arXiv: 2108.07732

Abstract

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

FAQ

Common questions about the MBPP benchmark and leaderboard.

What is the MBPP benchmark?

What is the MBPP leaderboard?

The MBPP leaderboard ranks 33 AI models based on their performance on this benchmark. Currently, Sarvam-30B by Sarvam AI leads with a score of 0.927. The average score across all models is 0.741.

What is the highest MBPP score?

The highest MBPP score is 0.927, achieved by Sarvam-30B from Sarvam AI.

How many models are evaluated on MBPP?

33 models have been evaluated on the MBPP benchmark, with 0 verified results and 33 self-reported results.

Where can I find the MBPP paper?

The MBPP paper is available at https://arxiv.org/abs/2108.07732. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MBPP cover?

MBPP is categorized under reasoning and general. The benchmark evaluates text models.

What is the best open-source model on MBPP?

Sarvam-30B by Sarvam AI is the top-ranked open-source model on MBPP, with a score of 0.927 (rank #1).

How recent are the MBPP leaderboard results?

The MBPP leaderboard was last updated in July 2026 and currently includes 33 evaluated models.