Benchmarks/math/MATH-500

MATH-500

MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MATH-500

State-of-the-art frontier
Open
Proprietary

MATH-500 Leaderboard

32 models • 0 verified
ContextCostLicense
1
560B128K
$0.30
$1.20
2
Sarvam AI
Sarvam AI
105B
3
Zhipu AI
Zhipu AI
355B131K
$0.40
$1.60
4
Zhipu AI
Zhipu AI
106B
5
9B
6
1.0T
6
Moonshot AI
Moonshot AI
1.0T200K
$0.50
$0.50
8
Sarvam AI
Sarvam AI
30B
8
253B
10
456B1.0M
$0.55
$2.20
10
69B256K
$0.10
$0.40
12
50B
13
560B128K
$0.30
$1.20
14
200K
$3.00
$15.00
14
Moonshot AI
Moonshot AI
16
456B
17
671B
18
8B
19
4B
20
71B128K
$0.10
$0.40
21
33B128K
$0.12
$0.18
22
671B164K
$0.28
$1.14
23
15B
24
8B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B33K
$0.15
$0.60
27
DeepSeek
DeepSeek
671B131K
$0.27
$1.10
28
OpenAI
OpenAI
128K
$3.00
$12.00
29
8B
30
2B
31
8B128K
$0.50
$0.50
31
8B
Notice missing or incorrect data?

FAQ

Common questions about MATH-500

MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
The MATH-500 paper is available at https://arxiv.org/abs/2103.03874. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MATH-500 leaderboard ranks 32 AI models based on their performance on this benchmark. Currently, LongCat-Flash-Thinking by Meituan leads with a score of 0.992. The average score across all models is 0.932.
The highest MATH-500 score is 0.992, achieved by LongCat-Flash-Thinking from Meituan.
32 models have been evaluated on the MATH-500 benchmark, with 0 verified results and 32 self-reported results.
MATH-500 is categorized under math and reasoning. The benchmark evaluates text models.