USAMO25
Progress Over Time
Interactive timeline showing model performance evolution on USAMO25
USAMO25 Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | — | ||
| 2 | xAI | — | — | — | ||
| 3 | xAI | — | — | — |
What is USAMO25?
The 2025 United States of America Mathematical Olympiad (USAMO) benchmark consists of six challenging mathematical problems requiring rigorous proof-based reasoning. USAMO is the most prestigious high school mathematics competition in the United States, serving as the final round of the American Mathematics Competitions series. This benchmark evaluates models on mathematical problem-solving capabilities beyond simple numerical computation, focusing on formal mathematical reasoning and proof generation.
USAMO25 is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 3 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 1.0.
Compare leaders on the best AI for math and best AI for reasoning leaderboards.
Current leaders
Claude Mythos Preview from Anthropic currently leads the USAMO25 leaderboard with a score of 0.976 across 3 evaluated AI models.
Source paper
- Title
- Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
- Authors
- Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, and 4 others
- Published
- arXiv
- 2503.21934
Abstract
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, Gemini-2.5-Pro, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce a comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly: only Gemini-2.5-Pro achieves a non-trivial score of 25%, while all other models achieve less than 5%. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.
FAQ
Common questions about the USAMO25 benchmark and leaderboard.