MATH-500
Progress Over Time
Interactive timeline showing model performance evolution on MATH-500
MATH-500 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Meituan | 560B | — | — | ||
| 2 | Sarvam AI | 105B | — | — | ||
| 3 | Zhipu AI | 355B | — | — | ||
| 4 | Zhipu AI | 106B | — | — | ||
| 5 | NVIDIA | 9B | — | — | ||
| 6 | Moonshot AI | 1.0T | — | — | ||
| 6 | Moonshot AI | 1.0T | — | — | ||
| 8 | Sarvam AI | 30B | — | — | ||
| 8 | 253B | — | — | |||
| 10 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 10 | MiniMax | 456B | — | — | ||
| 12 | 50B | — | — | |||
| 13 | Meituan | 560B | — | — | ||
| 14 | Moonshot AI | — | — | — | ||
| 14 | Anthropic | — | — | — | ||
| 16 | MiniMax | 456B | — | — | ||
| 17 | DeepSeek | 671B | — | — | ||
| 18 | 8B | — | — | |||
| 19 | Microsoft | 4B | — | — | ||
| 20 | DeepSeek | 71B | — | — | ||
| 21 | DeepSeek | 33B | — | — | ||
| 22 | DeepSeek | 671B | 164K | $0.28 / $1.14 | ||
| 23 | DeepSeek | 15B | — | — | ||
| 24 | DeepSeek | 8B | — | — | ||
| 25 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 25 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 27 | DeepSeek | 671B | — | — | ||
| 28 | OpenAI | — | — | — | ||
| 29 | DeepSeek | 8B | — | — | ||
| 30 | DeepSeek | 2B | — | — | ||
| 31 | 8B | — | — | |||
| 31 | 8B | — | — |
What is MATH-500?
MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
MATH-500 is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 32 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 1.0.
Compare leaders on the best AI for math and best AI for reasoning leaderboards.
Current leaders
LongCat-Flash-Thinking from Meituan currently leads the MATH-500 leaderboard with a score of 0.992 across 32 evaluated AI models.
Source paper
- Title
- Measuring Mathematical Problem Solving With the MATH Dataset
- Authors
- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, and 4 others
- Published
- arXiv
- 2103.03874
Abstract
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.
FAQ
Common questions about the MATH-500 benchmark and leaderboard.