FrontierMath
A benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians, covering most major branches of modern mathematics from number theory and real analysis to algebraic geometry and category theory.
Progress Over Time
Interactive timeline showing model performance evolution on FrontierMath
State-of-the-art frontier
Open
Proprietary
FrontierMath Leaderboard
11 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | OpenAI | 0.476 | — | 1.0M | $2.50 $15.00 | |
2 | OpenAI | 0.403 | — | 400K | $1.75 $14.00 | |
3 | OpenAI | 0.267 | — | 400K | $1.25 $10.00 | |
3 | OpenAI | 0.267 | — | 400K | $1.25 $10.00 | |
3 | OpenAI | 0.267 | — | 400K | $1.25 $10.00 | |
6 | OpenAI | 0.263 | — | 400K | $1.25 $10.00 | |
7 | OpenAI | 0.221 | — | 400K | $0.25 $2.00 | |
8 | OpenAI | 0.158 | — | 200K | $2.00 $8.00 | |
9 | OpenAI | 0.096 | — | 400K | $0.05 $0.40 | |
10 | OpenAI | 0.092 | — | 200K | $1.10 $4.40 | |
11 | OpenAI | 0.055 | — | 200K | $15.00 $60.00 |
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about FrontierMath
A benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians, covering most major branches of modern mathematics from number theory and real analysis to algebraic geometry and category theory.
The FrontierMath paper is available at https://arxiv.org/abs/2411.04872. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The FrontierMath leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, GPT-5.4 by OpenAI leads with a score of 0.476. The average score across all models is 0.233.
The highest FrontierMath score is 0.476, achieved by GPT-5.4 from OpenAI.
11 models have been evaluated on the FrontierMath benchmark, with 0 verified results and 11 self-reported results.
FrontierMath is categorized under math and reasoning. The benchmark evaluates text models.