AIME

American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning capabilities of large language models. Contains 30 challenging mathematical problems from AIME 2024 competition that require multi-step reasoning and advanced mathematical insight. Each problem has an integer answer between 000-999.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on AIME

State-of-the-art frontier
Open
Proprietary

AIME Leaderboard

1 models • 0 verified
ContextCostLicense
1
4B
Notice missing or incorrect data?

FAQ

Common questions about AIME

American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning capabilities of large language models. Contains 30 challenging mathematical problems from AIME 2024 competition that require multi-step reasoning and advanced mathematical insight. Each problem has an integer answer between 000-999.
The AIME paper is available at https://arxiv.org/html/2503.21380v2. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AIME leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Phi 4 Mini Reasoning by Microsoft leads with a score of 0.575. The average score across all models is 0.575.
The highest AIME score is 0.575, achieved by Phi 4 Mini Reasoning from Microsoft.
1 models have been evaluated on the AIME benchmark, with 0 verified results and 1 self-reported results.
AIME is categorized under math and reasoning. The benchmark evaluates text models.