USAMO25

Name: USAMO25 Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on USAMO25

State-of-the-art frontier

Open

Proprietary

USAMO25 Leaderboard

3 models

			Context	Cost
1	Claude Mythos Preview Anthropic	—	—	—
2	Grok-4 Heavy xAI	—	—	—
3	Grok-4 xAI	—	—	—

Notice missing or incorrect data?

About this benchmark

What is USAMO25?

The 2025 United States of America Mathematical Olympiad (USAMO) benchmark consists of six challenging mathematical problems requiring rigorous proof-based reasoning. USAMO is the most prestigious high school mathematics competition in the United States, serving as the final round of the American Mathematics Competitions series. This benchmark evaluates models on mathematical problem-solving capabilities beyond simple numerical computation, focusing on formal mathematical reasoning and proof generation.

USAMO25 is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 3 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 1.0.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

Claude Mythos Preview from Anthropic currently leads the USAMO25 leaderboard with a score of 0.976 across 3 evaluated AI models.

Claude Mythos PreviewAnthropic97.6%

Grok-4 HeavyxAI61.9%

Grok-4xAI37.5%

Source paper

Title: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
Authors: Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, and 4 others
Published: March 27, 2025
arXiv: 2503.21934

Abstract

Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, Gemini-2.5-Pro, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce a comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly: only Gemini-2.5-Pro achieves a non-trivial score of 25%, while all other models achieve less than 5%. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

FAQ

Common questions about the USAMO25 benchmark and leaderboard.

What is the USAMO25 benchmark?

What is the USAMO25 leaderboard?

The USAMO25 leaderboard ranks 3 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.976. The average score across all models is 0.657.

What is the highest USAMO25 score?

The highest USAMO25 score is 0.976, achieved by Claude Mythos Preview from Anthropic.

How many models are evaluated on USAMO25?

3 models have been evaluated on the USAMO25 benchmark, with 0 verified results and 3 self-reported results.

Where can I find the USAMO25 paper?

The USAMO25 paper is available at https://arxiv.org/abs/2503.21934. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does USAMO25 cover?

USAMO25 is categorized under math and reasoning. The benchmark evaluates text models.

How recent are the USAMO25 leaderboard results?

The USAMO25 leaderboard was last updated in July 2026 and currently includes 3 evaluated models.