ArXivMath

Progress Over Time

Interactive timeline showing model performance evolution on ArXivMath

State-of-the-art frontier
Open
Proprietary

ArXivMath Leaderboard

1 models
ContextCostLicense
1
Anthropic
Anthropic
1.0M$3.00 / $15.00
Notice missing or incorrect data?
About this benchmark

What is ArXivMath?

ArXivMath is a final-answer benchmark of research-level mathematics maintained by MathArena. Problems are extracted monthly from recent arXiv paper abstracts, then filtered through automated and manual checks to ensure they are self-contained, non-trivial, and verifiable. Because problems are drawn from active research, the benchmark is more realistic and more closely connected to mathematical research than contest or olympiad benchmarks.

ArXivMath is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.7.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

Claude Sonnet 5 from Anthropic currently leads the ArXivMath leaderboard with a score of 0.722 across 1 evaluated AI models.

1Claude Sonnet 5Anthropic72.2%

FAQ

Common questions about the ArXivMath benchmark and leaderboard.

What is the ArXivMath benchmark?

ArXivMath is a final-answer benchmark of research-level mathematics maintained by MathArena. Problems are extracted monthly from recent arXiv paper abstracts, then filtered through automated and manual checks to ensure they are self-contained, non-trivial, and verifiable. Because problems are drawn from active research, the benchmark is more realistic and more closely connected to mathematical research than contest or olympiad benchmarks.

What is the ArXivMath leaderboard?

The ArXivMath leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Claude Sonnet 5 by Anthropic leads with a score of 0.722. The average score across all models is 0.722.

What is the highest ArXivMath score?

The highest ArXivMath score is 0.722, achieved by Claude Sonnet 5 from Anthropic.

How many models are evaluated on ArXivMath?

1 models have been evaluated on the ArXivMath benchmark, with 0 verified results and 1 self-reported results.

What categories does ArXivMath cover?

ArXivMath is categorized under math and reasoning. The benchmark evaluates text models.

Which model offers the best value on ArXivMath?

Among models scoring within 10% of the leader, Claude Sonnet 5 from Anthropic is the cheapest, at $3.00 per million input tokens with a score of 0.722.

How recent are the ArXivMath leaderboard results?

The ArXivMath leaderboard was last updated in June 2026 and currently includes 1 evaluated models.