MATH
MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
Progress Over Time
Interactive timeline showing model performance evolution on MATH
State-of-the-art frontier
Open
Proprietary
MATH Leaderboard
70 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 2 | OpenAI | — | 200K | $15.00 / $60.00 | ||
| 3 | Mistral AI | 675B | 128K | $2.00 / $5.00 | ||
| 3 | Mistral AI | 14B | — | — | ||
| 5 | Google | — | 1.0M | $0.10 / $0.40 | ||
| 6 | Moonshot AI | 1.0T | 262K | $0.60 / $2.50 | ||
| 7 | Google | 27B | 131K | $0.10 / $0.20 | ||
| 8 | Mistral AI | 8B | — | — | ||
| 9 | Google | — | 1.0M | $0.07 / $0.30 | ||
| 10 | Google | — | 2.1M | $2.50 / $10.00 | ||
| 11 | OpenAI | — | 128K | $15.00 / $60.00 | ||
| 12 | OpenAI | — | — | — | ||
| 13 | Google | 12B | 131K | $0.05 / $0.10 | ||
| 14 | Alibaba Cloud / Qwen Team | 73B | 131K | $0.35 / $0.40 | ||
| 14 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 16 | Mistral AI | 3B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 34B | — | — | ||
| 18 | Microsoft | 15B | 16K | $0.07 / $0.14 | ||
| 19 | Alibaba Cloud / Qwen Team | 15B | — | — | ||
| 20 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 21 | Google | — | 1.0M | $0.15 / $0.60 | ||
| 22 | 70B | 128K | $0.20 / $0.20 | |||
| 23 | Amazon | — | 300K | $0.80 / $3.20 | ||
| 23 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 25 | xAI | — | 128K | $2.00 / $10.00 | ||
| 26 | Google | 4B | 131K | $0.02 / $0.04 | ||
| 27 | Alibaba Cloud / Qwen Team | 8B | 131K | $0.30 / $0.30 | ||
| 28 | DeepSeek | 236B | 8K | $0.14 / $0.28 | ||
| 29 | 405B | 128K | $0.89 / $0.89 | |||
| 30 | Amazon | — | 300K | $0.06 / $0.24 | ||
| 31 | xAI | — | — | — | ||
| 32 | OpenAI | — | 128K | $10.00 / $30.00 | ||
| 33 | Alibaba Cloud / Qwen Team | 235B | 128K | $0.10 / $0.10 | ||
| 34 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 35 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 36 | Mistral AI | 24B | 32K | $0.07 / $0.14 | ||
| 37 | OpenAI | — | 128K | $0.15 / $0.60 | ||
| 37 | Moonshot AI | 1.0T | — | — | ||
| 39 | Mistral AI | 24B | — | — | ||
| 40 | Anthropic | — | 200K | $0.80 / $4.00 | ||
| 41 | Amazon | — | 128K | $0.03 / $0.14 | ||
| 41 | Mistral AI | 24B | — | — | ||
| 43 | 90B | 128K | $0.35 / $0.40 | |||
| 44 | Microsoft | 4B | — | — | ||
| 45 | Meta | 400B | 1.0M | $0.17 / $0.60 | ||
| 46 | Anthropic | — | 200K | $15.00 / $75.00 | ||
| 47 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 48 | Microsoft | 60B | — | — | ||
| 49 | Google | 8B | 1.0M | $0.07 / $0.30 | ||
| 50 | Alibaba Cloud / Qwen Team | 32B | 128K | $0.09 / $0.09 |
1–50 of 70
1/2
Notice missing or incorrect data?
FAQ
Common questions about MATH
MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
The MATH paper is available at https://arxiv.org/abs/2103.03874. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MATH leaderboard ranks 70 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.979. The average score across all models is 0.668.
The highest MATH score is 0.979, achieved by o3-mini from OpenAI.
70 models have been evaluated on the MATH benchmark, with 0 verified results and 68 self-reported results.
MATH is categorized under math and reasoning. The benchmark evaluates text models.