Methodology

How we measure AI.

An open, reproducible methodology for ranking language models. We re-run 200+ benchmarks on our own hardware and collect blind preference votes from 200,000 users across 140 countries.

Authors: Jonathan Chávez
Sebastian Crossa
Updated: April 2026
Version: v3
Maintained at: llm-stats.com

1Verified scoring.

We run every benchmark ourselves. Most leaderboards republish numbers from model cards; we do not. Each Verified score is produced on our hardware against the public benchmark, with prompt set, grader, and sampler frozen. Raw outputs and per-row traces are published with every run.

Self-reported scores still appear on the site, but they are labelled and never used as the headline. When a vendor's published claim disagrees with our verified result, we cite both and explain the gap.

Verified · Self-reported% of model entries

SWE-bench Verified

92%

GPQA Diamond

88%

AIME 2025

81%

HumanEval

76%

MMLU-Pro

71%

HealthBench

95%

Figure 1.Share of Verified entries on a sample of high-traffic benchmarks. The remainder are clearly labelled self-reported.

Versioning is treated as part of the methodology. Older runs stay reproducible: prompts, datasets, and graders are pinned to a version, and a model's historical scores never change.

2Arenas.

Where benchmarks measure capability, arenas measure preference. For each prompt, four randomly-sampled models generate a response. The user sees the outputs side-by-side, with names, providers, and logos hidden, and picks the one they would ship. Order is randomised on every match. Frontier and open-weight models pull from the same queue.

One blind comparisonIdentities revealed after vote

Model AHidden

Model BHidden

Selected

Figure 2.A single match. The voter sees only the outputs; identities are revealed after the vote is logged.

Raters span 140 countries and include doctors, engineers, lawyers, and researchers. Ratings come from multi-turn sessions, not single-prompt snapshots, and the modality coverage spans text, code, image, video, and audio.

3TrueSkill rating.

After every match we update each model's posterior with TrueSkill [1]. Each model carries a mean μ (estimated skill) and a standard deviation σ(the system's uncertainty about that estimate).

We never publish μ. The published score is the conservative rating r = μ − 3σ: the lower bound at 99.7% confidence. A model with one hot streak has a large σ and a low conservative rating, even if its mean looks competitive. The leaderboard rewards demonstrated skill, not optimistic estimates.

Posterior distribution · Conservative rating

r=μ−3σ

Figure 3.The published score is the lower bound of the posterior, not the mean. A streak cannot fake a position; σ has to narrow first.

The same conservative rating powers the homepage Performance Index, every modality leaderboard, and the per-arena tables.

4The Hub.

The Hub is the platform behind every Verified score. It is also the platform researchers use to author and run benchmarks of their own.

Datasets accept Parquet, CSV, or Hugging Face. Graders accept binary, graded, and LLM-as-judge. We handle orchestration, retries, rate limits, cost accounting, and per-row traces. Results land on a versioned, public leaderboard you can share or embed.

Submission pipeline~14k rows / minute · per worker

Dataset

Parquet · CSV · HF

Run

300+ models

Score

Binary · graded · judge

Publish

Versioned, public

Figure 4.Submit once. The Hub orchestrates the run, scores every row, and publishes a versioned leaderboard. Throughput sits at ~14,000 rows per minute per worker.

The Hub is open to research teams. There are no licence fees, no “featured” tiers, and no gatekeeping on which models can be evaluated.

5Coverage.

The catalogue spans 200+ benchmarks across coding, reasoning, multimodal, math, and the high-stakes domains where mistakes carry weight.

Domain

Count

Examples

Coding & reasoning

SWE-bench, LiveCodeBench, GPQA, AIME, MMLU-Pro

Multimodal

MMMU, ChartQA, image · video · audio

Long context

RULER, NoLiMa, MRCR-v2

Healthcare

HealthBench, clinical reasoning

Finance

Audit, valuation, risk

Legal

Diligence, contract review

Table 1.Coverage by domain on the Hub. Counts are approximate and grow with each release.

6References & acknowledgements.

References

[1]Herbrich, R., Minka, T., & Graepel, T. (2007). TrueSkill™: A Bayesian skill rating system. Microsoft Research. microsoft.com/research.

Supported by

Y Combinator (S25), with angel investment from Ivan Burazin (founder, Daytona), Thomas Wolf (Hugging Face), researchers at Harvard Medical, and employees and executives at Google and Datadog.

Cited in

OpenAI announcements, NBC News, TechCrunch, The Lancet, and Seeking Alpha.

Maintainers

Maintained by Jonathan Chávez and Sebastian Crossa, co-founders, llm-stats. Every change is logged. Every prior run stays reproducible.

Cite this methodology

Chávez, J., & Crossa, S. (2026).
How we measure AI: Methodology, v3.
llm-stats. https://llm-stats.com/research