AIME 2025

Progress Over Time

Interactive timeline showing model performance evolution on AIME 2025

State-of-the-art frontier
Open
Proprietary

AIME 2025 Leaderboard

114 models
ContextCostLicense
11.0T
1
OpenAI
OpenAI
400K$1.75 / $14.00
1
1
1
61.0M$5.00 / $25.00
71.0M$0.50 / $3.00
8
8560B
1032B262K$0.06 / $0.24
1121B
12400K$1.25 / $10.00
13
ByteDance
ByteDance
256K$0.50 / $3.00
14196B66K$0.10 / $0.40
151.0T
16
Sarvam AI
Sarvam AI
105B
16
Sarvam AI
Sarvam AI
30B
16
19
Moonshot AI
Moonshot AI
1.0T
20685B
21
Zhipu AI
Zhipu AI
358B
22
OpenAI
OpenAI
22
24309B
25
25
OpenAI
OpenAI
400K$1.25 / $10.00
25400K$1.25 / $10.00
28
Zhipu AI
Zhipu AI
357B
29128K$3.00 / $15.00
30685B
30685B
32
ByteDance
ByteDance
33
LG AI Research
LG AI Research
236B
34
OpenAI
OpenAI
35117B131K$0.10 / $0.50
36
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
38
392.0M$0.20 / $0.50
40
4130B
42
Inception
Inception
128K$0.25 / $0.75
42400K$0.25 / $2.00
441.0M$0.30 / $2.50
45
46560B
47120B
48218B
49
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
50685B
150 of 114
1/3
Notice missing or incorrect data?

Sub-benchmarks

About this benchmark

What is AIME 2025?

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

AIME 2025 is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 114 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 1.0.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

Kimi K2-Thinking-0905 from Moonshot AI currently leads the AIME 2025 leaderboard with a score of 1.000 across 114 evaluated AI models.

1Kimi K2-Thinking-0905Moonshot AI100.0%
1GPT-5.2OpenAI100.0%
1Grok-4 HeavyxAI100.0%

Source paper

Title
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Authors
Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, and 1 others
Published
Abstract

The rapid advancement of large reasoning models has saturated existing math benchmarks, underscoring the urgent need for more challenging evaluation frameworks. To address this, we introduce OlymMATH, a rigorously curated, Olympiad-level math benchmark comprising 350 problems, each with parallel English and Chinese versions. OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: (1) natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and (2) formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation. All problems are manually sourced from printed publications to minimize data contamination, verified by experts, and span four core domains. Extensive experiments reveal the benchmark's significant challenge, and our analysis also uncovers consistent performance gaps between languages and identifies cases where models employ heuristic "guessing" rather than rigorous reasoning. To further support community research, we release 582k+ reasoning trajectories, a visualization tool, and expert solutions at https://github.com/RUCAIBox/OlymMATH.

FAQ

Common questions about the AIME 2025 benchmark and leaderboard.

What is the AIME 2025 benchmark?

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

What is the AIME 2025 leaderboard?

The AIME 2025 leaderboard ranks 114 AI models based on their performance on this benchmark. Currently, Kimi K2-Thinking-0905 by Moonshot AI leads with a score of 1.000. The average score across all models is 0.791.

What is the highest AIME 2025 score?

The highest AIME 2025 score is 1.000, achieved by Kimi K2-Thinking-0905 from Moonshot AI.

How many models are evaluated on AIME 2025?

114 models have been evaluated on the AIME 2025 benchmark, with 0 verified results and 114 self-reported results.

Where can I find the AIME 2025 paper?

The AIME 2025 paper is available at https://arxiv.org/abs/2503.21380. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does AIME 2025 cover?

AIME 2025 is categorized under math and reasoning. The benchmark evaluates text models.

Are there variants of AIME 2025?

Yes. AIME 2025 has 1 related variant: MT-AIME 2025.

What is the best open-source model on AIME 2025?

Kimi K2-Thinking-0905 by Moonshot AI is the top-ranked open-source model on AIME 2025, with a score of 1.000 (rank #1).

Which model offers the best value on AIME 2025?

Among models scoring within 10% of the leader, Nemotron 3 Nano (30B A3B) from NVIDIA is the cheapest, at $0.06 per million input tokens with a score of 0.992.

How is AIME 2025 scored?

AIME 2025 is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the AIME 2025 leaderboard results?

The AIME 2025 leaderboard was last updated in June 2026 and currently includes 114 evaluated models.