MATH-500

Progress Over Time

Interactive timeline showing model performance evolution on MATH-500

State-of-the-art frontier
Open
Proprietary

MATH-500 Leaderboard

32 models
ContextCostLicense
1560B
2
Sarvam AI
Sarvam AI
105B
3
Zhipu AI
Zhipu AI
355B
4
Zhipu AI
Zhipu AI
106B
59B
6
Moonshot AI
Moonshot AI
1.0T
61.0T
8
Sarvam AI
Sarvam AI
30B
8253B
1069B256K$0.10 / $0.40
10456B
1250B
13560B
14
Moonshot AI
Moonshot AI
14
16456B
17671B
188B
194B
2071B
2133B
22671B164K$0.28 / $1.14
2315B
248B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
27
DeepSeek
DeepSeek
671B
28
OpenAI
OpenAI
298B
302B
318B
318B
Notice missing or incorrect data?
About this benchmark

What is MATH-500?

MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

MATH-500 is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 32 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 1.0.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

LongCat-Flash-Thinking from Meituan currently leads the MATH-500 leaderboard with a score of 0.992 across 32 evaluated AI models.

2Sarvam-105BSarvam AI98.6%
3GLM-4.5Zhipu AI98.2%

Source paper

Title
Measuring Mathematical Problem Solving With the MATH Dataset
Authors
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, and 4 others
Published
Abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

FAQ

Common questions about the MATH-500 benchmark and leaderboard.

What is the MATH-500 benchmark?

MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

What is the MATH-500 leaderboard?

The MATH-500 leaderboard ranks 32 AI models based on their performance on this benchmark. Currently, LongCat-Flash-Thinking by Meituan leads with a score of 0.992. The average score across all models is 0.932.

What is the highest MATH-500 score?

The highest MATH-500 score is 0.992, achieved by LongCat-Flash-Thinking from Meituan.

How many models are evaluated on MATH-500?

32 models have been evaluated on the MATH-500 benchmark, with 0 verified results and 32 self-reported results.

Where can I find the MATH-500 paper?

The MATH-500 paper is available at https://arxiv.org/abs/2103.03874. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MATH-500 cover?

MATH-500 is categorized under math and reasoning. The benchmark evaluates text models.

What is the best open-source model on MATH-500?

LongCat-Flash-Thinking by Meituan is the top-ranked open-source model on MATH-500, with a score of 0.992 (rank #1).

Which model offers the best value on MATH-500?

Among models scoring within 10% of the leader, LongCat-Flash-Lite from Meituan is the cheapest, at $0.10 per million input tokens with a score of 0.968.

How is MATH-500 scored?

MATH-500 is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the MATH-500 leaderboard results?

The MATH-500 leaderboard was last updated in July 2026 and currently includes 32 evaluated models.