AIME 2025
All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.
Progress Over Time
Interactive timeline showing model performance evolution on AIME 2025
State-of-the-art frontier
Open
Proprietary
AIME 2025 Leaderboard
105 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | OpenAI | 1.000 | — | 400K | $1.75 $14.00 | |
1 | Google | 1.000 | — | — | — | |
1 | Moonshot AI | 1.000 | 1.0T | — | — | |
1 | xAI | 1.000 | — | — | — | |
1 | OpenAI | 1.000 | — | 400K | $21.00 $168.00 | |
6 | Anthropic | 0.998 | — | 200K | $5.00 $25.00 | |
7 | Google | 0.997 | — | 1.0M | $0.50 $3.00 | |
8 | Meituan | 0.996 | 560B | 128K | $0.30 $1.20 | |
8 | OpenAI | 0.996 | — | 400K | $1.25 $10.00 | |
10 | 0.992 | 32B | 262K | $0.06 $0.24 | ||
11 | OpenAI | 0.987 | 21B | — | — | |
12 | OpenAI | 0.984 | — | 400K | $1.25 $10.00 | |
13 | ByteDance | 0.983 | — | — | — | |
14 | StepFun | 0.973 | 196B | 66K | $0.10 $0.40 | |
15 | Sarvam AI | 0.967 | 30B | — | — | |
15 | Sarvam AI | 0.967 | 105B | — | — | |
15 | OpenAI | 0.967 | — | 400K | $1.25 $10.00 | |
18 | Moonshot AI | 0.961 | 1.0T | 262K | $0.60 $2.50 | |
19 | DeepSeek | 0.960 | 685B | — | — | |
20 | Zhipu AI | 0.957 | 358B | 205K | $0.60 $2.20 | |
21 | OpenAI | 0.946 | — | 400K | $1.25 $10.00 | |
21 | OpenAI | 0.946 | — | 400K | $1.25 $10.00 | |
23 | Xiaomi | 0.941 | 309B | 256K | $0.10 $0.30 | |
24 | OpenAI | 0.940 | — | 400K | $1.25 $10.00 | |
24 | OpenAI | 0.940 | — | 400K | $1.25 $10.00 | |
24 | OpenAI | 0.940 | — | 400K | $1.25 $10.00 | |
27 | Zhipu AI | 0.939 | 357B | 131K | $0.55 $2.19 | |
28 | xAI | 0.933 | — | 128K | $3.00 $15.00 | |
29 | DeepSeek | 0.931 | 685B | — | — | |
30 | ByteDance | 0.930 | — | — | — | |
31 | LG AI Research | 0.928 | 236B | 33K | $0.60 $1.00 | |
32 | OpenAI | 0.927 | — | 200K | $1.10 $4.40 | |
33 | OpenAI | 0.925 | 117B | 131K | $0.10 $0.50 | |
34 | Alibaba Cloud / Qwen Team | 0.923 | 235B | 262K | $0.30 $3.00 | |
35 | xAI | 0.920 | — | 2.0M | $0.20 $0.50 | |
36 | xAI | 0.917 | — | — | — | |
37 | Zhipu AI | 0.916 | 30B | 128K | $0.07 $0.40 | |
38 | OpenAI | 0.911 | — | 400K | $0.25 $2.00 | |
38 | Inception | 0.911 | — | 128K | $0.25 $0.75 | |
40 | xAI | 0.908 | — | 128K | $0.30 $0.50 | |
41 | Meituan | 0.906 | 560B | 128K | $0.30 $1.20 | |
42 | 0.902 | 120B | 262K | $0.10 $0.50 | ||
43 | Alibaba Cloud / Qwen Team | 0.897 | 236B | 262K | $0.45 $3.49 | |
44 | DeepSeek | 0.893 | 685B | — | — | |
45 | OpenAI | 0.889 | — | 400K | $1.25 $10.00 | |
46 | 0.880 | — | 1.0M | $1.25 $10.00 | ||
47 | Alibaba Cloud / Qwen Team | 0.878 | 80B | 66K | $0.15 $1.50 | |
48 | StepFun | 0.877 | 10B | — | — | |
49 | DeepSeek | 0.875 | 671B | 131K | $0.50 $2.15 | |
50 | Baidu | 0.870 | — | — | — |
Showing 1-50 of 105
1 / 3
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about AIME 2025
All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.
The AIME 2025 paper is available at https://arxiv.org/abs/2503.21380. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AIME 2025 leaderboard ranks 105 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 1.000. The average score across all models is 0.783.
The highest AIME 2025 score is 1.000, achieved by GPT-5.2 from OpenAI.
105 models have been evaluated on the AIME 2025 benchmark, with 0 verified results and 105 self-reported results.
AIME 2025 is categorized under math and reasoning. The benchmark evaluates text models.