How we measure AI.
An open, reproducible methodology for ranking language models. We re-run 200+ benchmarks on our own hardware and collect blind preference votes from 200,000 users across 140 countries.
- Updated
- April 2026
- Version
- v3
- Maintained at
- llm-stats.com
1Verified scoring.
We run every benchmark ourselves. Most leaderboards republish numbers from model cards; we do not. Each Verified score is produced on our hardware against the public benchmark, with prompt set, grader, and sampler frozen. Raw outputs and per-row traces are published with every run.
Self-reported scores still appear on the site, but they are labelled and never used as the headline. When a vendor's published claim disagrees with our verified result, we cite both and explain the gap.
Versioning is treated as part of the methodology. Older runs stay reproducible: prompts, datasets, and graders are pinned to a version, and a model's historical scores never change.
2Arenas.
Where benchmarks measure capability, arenas measure preference. For each prompt, four randomly-sampled models generate a response. The user sees the outputs side-by-side, with names, providers, and logos hidden, and picks the one they would ship. Order is randomised on every match. Frontier and open-weight models pull from the same queue.
Raters span 140 countries and include doctors, engineers, lawyers, and researchers. Ratings come from multi-turn sessions, not single-prompt snapshots, and the modality coverage spans text, code, image, video, and audio.
3TrueSkill rating.
After every match we update each model's posterior with TrueSkill [1]. Each model carries a mean μ (estimated skill) and a standard deviation σ (the system's uncertainty about that estimate).
We never publish μ. The published score is the conservative rating r = μ − 3σ: the lower bound at 99.7% confidence. A model with one hot streak has a large σ and a low conservative rating, even if its mean looks competitive. The leaderboard rewards demonstrated skill, not optimistic estimates.
The same conservative rating powers the homepage Performance Index, every modality leaderboard, and the per-arena tables.
4The Hub.
The Hub is the platform behind every Verified score. It is also the platform researchers use to author and run benchmarks of their own.
Datasets accept Parquet, CSV, or Hugging Face. Graders accept binary, graded, and LLM-as-judge. We handle orchestration, retries, rate limits, cost accounting, and per-row traces. Results land on a versioned, public leaderboard you can share or embed.
The Hub is open to research teams. There are no licence fees, no “featured” tiers, and no gatekeeping on which models can be evaluated.
5Coverage.
The catalogue spans 200+ benchmarks across coding, reasoning, multimodal, math, and the high-stakes domains where mistakes carry weight.
Table 1.Coverage by domain on the Hub. Counts are approximate and grow with each release.
6References & acknowledgements.
References
- [1]Herbrich, R., Minka, T., & Graepel, T. (2007). TrueSkill™: A Bayesian skill rating system. Microsoft Research. microsoft.com/research.
Supported by
Y Combinator (S25), with angel investment from Ivan Burazin (founder, Daytona), Thomas Wolf (Hugging Face), researchers at Harvard Medical, and employees and executives at Google and Datadog.
Cited in
OpenAI announcements, NBC News, TechCrunch, The Lancet, and Seeking Alpha.
Maintainers
Maintained by Jonathan Chávez and Sebastian Crossa, co-founders, llm-stats. Every change is logged. Every prior run stays reproducible.
Cite this methodology
Chávez, J., & Crossa, S. (2026). How we measure AI: Methodology, v3. llm-stats. https://llm-stats.com/research