Benchmarks/math/MATH

MATH

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MATH

State-of-the-art frontier

Open

Proprietary

MATH Leaderboard

70 models

			Context	Cost
1	o3-mini OpenAI	—	200K	$1.10 / $4.40
2	o1 OpenAI	—	200K	$15.00 / $60.00
3	Mistral Large 3 Mistral AI	675B	128K	$2.00 / $5.00
3	MiniStral 3 (14B Instruct 2512) Mistral AI	14B	—	—
5	Gemini 2.0 Flash Google	—	1.0M	$0.10 / $0.40
6	Kimi K2 0905 Moonshot AI	1.0T	262K	$0.60 / $2.50
7	Gemma 3 27B Google	27B	131K	$0.10 / $0.20
8	Ministral 3 (8B Instruct 2512) Mistral AI	8B	—	—
9	Gemini 2.0 Flash-Lite Google	—	1.0M	$0.07 / $0.30
10	Gemini 1.5 Pro Google	—	2.1M	$2.50 / $10.00
11	o1-preview OpenAI	—	128K	$15.00 / $60.00
12	GPT-5 OpenAI	—	—	—
13	Gemma 3 12B Google	12B	131K	$0.05 / $0.10
14	Qwen2.5 72B Instruct Alibaba Cloud / Qwen Team	73B	131K	$0.35 / $0.40
14	Qwen2.5 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
16	Ministral 3 (3B Instruct 2512) Mistral AI	3B	—	—
17	Qwen2.5 VL 32B Instruct Alibaba Cloud / Qwen Team	34B	—	—
18	Phi 4 Microsoft	15B	16K	$0.07 / $0.14
19	Qwen2.5 14B Instruct Alibaba Cloud / Qwen Team	15B	—	—
20	Claude 3.5 Sonnet Anthropic	—	200K	$3.00 / $15.00
21	Gemini 1.5 Flash Google	—	1.0M	$0.15 / $0.60
22	Llama 3.3 70B Instruct Meta	70B	128K	$0.20 / $0.20
23	Nova Pro Amazon	—	300K	$0.80 / $3.20
23	GPT-4o OpenAI	—	128K	$2.50 / $10.00
25	Grok-2 xAI	—	128K	$2.00 / $10.00
26	Gemma 3 4B Google	4B	131K	$0.02 / $0.04
27	Qwen2.5 7B Instruct Alibaba Cloud / Qwen Team	8B	131K	$0.30 / $0.30
28	DeepSeek-V2.5 DeepSeek	236B	8K	$0.14 / $0.28
29	Llama 3.1 405B Instruct Meta	405B	128K	$0.89 / $0.89
30	Nova Lite Amazon	—	300K	$0.06 / $0.24
31	Grok-2 mini xAI	—	—	—
32	GPT-4 Turbo OpenAI	—	128K	$10.00 / $30.00
33	Qwen3 235B A22B Alibaba Cloud / Qwen Team	235B	128K	$0.10 / $0.10
34	Qwen2.5-Omni-7B Alibaba Cloud / Qwen Team	7B	—	—
35	Claude 3.5 Sonnet Anthropic	—	200K	$3.00 / $15.00
36	Mistral Small 3 24B Instruct Mistral AI	24B	32K	$0.07 / $0.14
37	GPT-4o mini OpenAI	—	128K	$0.15 / $0.60
37	Kimi K2 Base Moonshot AI	1.0T	—	—
39	Mistral Small 3.2 24B Instruct Mistral AI	24B	—	—
40	Claude 3.5 Haiku Anthropic	—	200K	$0.80 / $4.00
41	Nova Micro Amazon	—	128K	$0.03 / $0.14
41	Mistral Small 3.1 24B Instruct Mistral AI	24B	—	—
43	Llama 3.2 90B Instruct Meta	90B	128K	$0.35 / $0.40
44	Phi 4 Mini Microsoft	4B	—	—
45	Llama 4 Maverick Meta	400B	1.0M	$0.17 / $0.60
46	Claude 3 Opus Anthropic	—	200K	$15.00 / $75.00
47	Qwen2 72B Instruct Alibaba Cloud / Qwen Team	72B	—	—
48	Phi-3.5-MoE-instruct Microsoft	60B	—	—
49	Gemini 1.5 Flash 8B Google	8B	1.0M	$0.07 / $0.30
50	Qwen2.5-Coder 32B Instruct Alibaba Cloud / Qwen Team	32B	128K	$0.09 / $0.09

1–50 of 70

1/2

Notice missing or incorrect data?

FAQ

Common questions about MATH

The MATH paper is available at https://arxiv.org/abs/2103.03874. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The MATH leaderboard ranks 70 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.979. The average score across all models is 0.668.

The highest MATH score is 0.979, achieved by o3-mini from OpenAI.

70 models have been evaluated on the MATH benchmark, with 0 verified results and 68 self-reported results.

MATH is categorized under math and reasoning. The benchmark evaluates text models.

MATH

Progress Over Time

MATH Leaderboard

FAQ

What is the MATH benchmark?

Where can I find the MATH paper?

What is the MATH leaderboard?

What is the highest MATH score?

How many models are evaluated on MATH?

What categories does MATH cover?