MATH

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MATH

State-of-the-art frontier
Open
Proprietary

MATH Leaderboard

70 models
ContextCostLicense
1
OpenAI
OpenAI
200K$1.10 / $4.40
2
OpenAI
OpenAI
200K$15.00 / $60.00
3
Mistral AI
Mistral AI
675B128K$2.00 / $5.00
314B
51.0M$0.10 / $0.40
6
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
727B131K$0.10 / $0.20
88B
91.0M$0.07 / $0.30
102.1M$2.50 / $10.00
11128K$15.00 / $60.00
12
OpenAI
OpenAI
1312B131K$0.05 / $0.10
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B131K$0.35 / $0.40
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
163B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
18
Microsoft
Microsoft
15B16K$0.07 / $0.14
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
20200K$3.00 / $15.00
211.0M$0.15 / $0.60
2270B128K$0.20 / $0.20
23
Amazon
Amazon
300K$0.80 / $3.20
23
OpenAI
OpenAI
128K$2.50 / $10.00
25128K$2.00 / $10.00
264B131K$0.02 / $0.04
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B131K$0.30 / $0.30
28236B8K$0.14 / $0.28
29405B128K$0.89 / $0.89
30
Amazon
Amazon
300K$0.06 / $0.24
31
32128K$10.00 / $30.00
33
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B128K$0.10 / $0.10
34
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
35200K$3.00 / $15.00
3624B32K$0.07 / $0.14
37128K$0.15 / $0.60
37
Moonshot AI
Moonshot AI
1.0T
3924B
40200K$0.80 / $4.00
41128K$0.03 / $0.14
4124B
4390B128K$0.35 / $0.40
44
Microsoft
Microsoft
4B
45400B1.0M$0.17 / $0.60
46
Anthropic
Anthropic
200K$15.00 / $75.00
47
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
4860B
498B1.0M$0.07 / $0.30
50
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
32B128K$0.09 / $0.09
150 of 70
1/2
Notice missing or incorrect data?

FAQ

Common questions about MATH

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
The MATH paper is available at https://arxiv.org/abs/2103.03874. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MATH leaderboard ranks 70 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.979. The average score across all models is 0.668.
The highest MATH score is 0.979, achieved by o3-mini from OpenAI.
70 models have been evaluated on the MATH benchmark, with 0 verified results and 68 self-reported results.
MATH is categorized under math and reasoning. The benchmark evaluates text models.