Rankings draw from 74 math benchmarks that test different skills: word problems, multi-step algebra, geometry, calculus, competition-level reasoning, and proof construction. Each benchmark contributes to a model's overall score, weighted by how decisive the benchmark is — competition math like MATH and AIME differentiates frontier models more than saturated benchmarks like GSM8K, where the top 10 all score above 95%.
Scores are normalized so benchmarks measured on different scales (percentage correct, problems solved, pass@1) can be compared directly. We source results from official model cards first, then independent reproductions when available — independently reproduced scores get higher weight because self-reported numbers occasionally use favorable evaluation conditions.
The conservative ranking penalizes models with sparse benchmark coverage. A model that only reports GSM8K can't be ranked against one tested across the full math suite — uncertainty bands stay wide until enough benchmark evidence accumulates.