AIME
American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning capabilities of large language models. Contains 30 challenging mathematical problems from AIME 2024 competition that require multi-step reasoning and advanced mathematical insight. Each problem has an integer answer between 000-999.
Progress Over Time
Interactive timeline showing model performance evolution on AIME
State-of-the-art frontier
Open
Proprietary
AIME Leaderboard
1 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Microsoft | 4B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about AIME
American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning capabilities of large language models. Contains 30 challenging mathematical problems from AIME 2024 competition that require multi-step reasoning and advanced mathematical insight. Each problem has an integer answer between 000-999.
The AIME paper is available at https://arxiv.org/html/2503.21380v2. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The AIME leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Phi 4 Mini Reasoning by Microsoft leads with a score of 0.575. The average score across all models is 0.575.
The highest AIME score is 0.575, achieved by Phi 4 Mini Reasoning from Microsoft.
1 models have been evaluated on the AIME benchmark, with 0 verified results and 1 self-reported results.
AIME is categorized under math and reasoning. The benchmark evaluates text models.