Methodology

How we measure AI.

An open, reproducible methodology for ranking language models. We re-run 200+ benchmarks on our own hardware and collect blind preference votes from 200,000 users across 140 countries.

Updated
April 2026
Version
v3
Maintained at
llm-stats.com

1Verified scoring.

We run every benchmark ourselves. Most leaderboards republish numbers from model cards; we do not. Each Verified score is produced on our hardware against the public benchmark, with prompt set, grader, and sampler frozen. Raw outputs and per-row traces are published with every run.

Self-reported scores still appear on the site, but they are labelled and never used as the headline. When a vendor's published claim disagrees with our verified result, we cite both and explain the gap.

Verified · Self-reported% of model entries
SWE-bench Verified
92%
GPQA Diamond
88%
AIME 2025
81%
HumanEval
76%
MMLU-Pro
71%
HealthBench
95%
Figure 1.Share of Verified entries on a sample of high-traffic benchmarks. The remainder are clearly labelled self-reported.

Versioning is treated as part of the methodology. Older runs stay reproducible: prompts, datasets, and graders are pinned to a version, and a model's historical scores never change.

2Arenas.

Where benchmarks measure capability, arenas measure preference. For each prompt, four randomly-sampled models generate a response. The user sees the outputs side-by-side, with names, providers, and logos hidden, and picks the one they would ship. Order is randomised on every match. Frontier and open-weight models pull from the same queue.

One blind comparisonIdentities revealed after vote
Model AHidden
Model BHidden
Selected
Figure 2.A single match. The voter sees only the outputs; identities are revealed after the vote is logged.

Raters span 140 countries and include doctors, engineers, lawyers, and researchers. Ratings come from multi-turn sessions, not single-prompt snapshots, and the modality coverage spans text, code, image, video, and audio.

3TrueSkill rating.

After every match we update each model's posterior with TrueSkill [1]. Each model carries a mean μ (estimated skill) and a standard deviation σ (the system's uncertainty about that estimate).

We never publish μ. The published score is the conservative rating r = μ − 3σ: the lower bound at 99.7% confidence. A model with one hot streak has a large σ and a low conservative rating, even if its mean looks competitive. The leaderboard rewards demonstrated skill, not optimistic estimates.

Posterior distribution · Conservative rating
r=μ3σ
μpublishedμ − 3σrating
Figure 3.The published score is the lower bound of the posterior, not the mean. A streak cannot fake a position; σ has to narrow first.

The same conservative rating powers the homepage Performance Index, every modality leaderboard, and the per-arena tables.

4The Hub.

The Hub is the platform behind every Verified score. It is also the platform researchers use to author and run benchmarks of their own.

Datasets accept Parquet, CSV, or Hugging Face. Graders accept binary, graded, and LLM-as-judge. We handle orchestration, retries, rate limits, cost accounting, and per-row traces. Results land on a versioned, public leaderboard you can share or embed.

Submission pipeline~14k rows / minute · per worker
Dataset
Parquet · CSV · HF
Run
300+ models
Score
Binary · graded · judge
Publish
Versioned, public
Figure 4.Submit once. The Hub orchestrates the run, scores every row, and publishes a versioned leaderboard. Throughput sits at ~14,000 rows per minute per worker.

The Hub is open to research teams. There are no licence fees, no “featured” tiers, and no gatekeeping on which models can be evaluated.

5Coverage.

The catalogue spans 200+ benchmarks across coding, reasoning, multimodal, math, and the high-stakes domains where mistakes carry weight.

Domain
Count
Coding & reasoning
45
Multimodal
35
Long context
22
Healthcare
18
Finance
14
Legal
12

Table 1.Coverage by domain on the Hub. Counts are approximate and grow with each release.

6References & acknowledgements.

References

  1. [1]Herbrich, R., Minka, T., & Graepel, T. (2007). TrueSkill™: A Bayesian skill rating system. Microsoft Research. microsoft.com/research.

Supported by

Y Combinator (S25), with angel investment from Ivan Burazin (founder, Daytona), Thomas Wolf (Hugging Face), researchers at Harvard Medical, and employees and executives at Google and Datadog.

Y CombinatorHugging FaceHarvard MedicalGoogleDatadog

Cited in

OpenAI announcements, NBC News, TechCrunch, The Lancet, and Seeking Alpha.

OpenAINBC NewsTechCrunchThe LancetSeeking Alpha

Maintainers

Maintained by Jonathan Chávez and Sebastian Crossa, co-founders, llm-stats. Every change is logged. Every prior run stays reproducible.

Cite this methodology

Chávez, J., & Crossa, S. (2026).
How we measure AI: Methodology, v3.
llm-stats. https://llm-stats.com/research