LLM Stats Score

Methodology v1.0 · published April 30, 2026

The LLM Stats Score is a single 0–100 number summarizing how capable a large language model is across the dimensions that matter for real workloads: reasoning, coding, knowledge, agentic tool use, long context, and vision. It exists because no single public benchmark captures frontier capability, and because composite scores published elsewhere don’t expose their weights or sources.

Components & weights

Axis	Weight	Component benchmarks
Reasoning	25%	GPQA Diamond, AIME 2025, FrontierMath
Coding	25%	LLM Stats Coding Arena, SWE-Bench Verified, Terminal-Bench
Knowledge	15%	HLE, MMMU-Pro, SimpleQA
Tool use & agents	20%	TAU-Bench Retail, Toolathlon, MCP Atlas
Long context	10%	MRCR-v2, AA-LCR
Vision	5%	MMMU, ScreenSpot-Pro, CharXiv-R

Within each axis, component benchmarks are equal-weighted after min-max normalization to the most recent frontier release window.

What goes in (and what doesn’t)

A benchmark is eligible for the LLM Stats Score only if its results are reproducible — either lab-published with full methodology, or independently replicated by the LLM Stats team or an independent third party. Self-reported numbers without a reproducible artifact are tracked on the per-model page but excluded from the composite.

Benchmarks are excluded when they show signs of training-set contamination, when their score distribution saturates (e.g. every frontier model scores within 1 point), or when the benchmark’s license forbids commercial reuse of evaluation data.

Versioning

Methodology versions are frozen and dated. When axis weights or benchmark eligibility changes — for example when a new frontier benchmark is added or a saturated one is retired — we increment the version, publish a changelog, and keep the old version reachable for reproducibility.

Current version: 1.0 (April 30, 2026).

Citation

LLM Stats (2026). LLM Stats Score (v1.0). LLM Stats. https://llm-stats.com/methodology/llm-stats-score

FAQ

What is the LLM Stats Score?

The LLM Stats Score is a single 0–100 number that summarizes a model's overall capability across reasoning, coding, knowledge, agentic tool use, long context, and vision. It is the weighted aggregation of verified benchmark results, normalized to make scores comparable across model families.

How is it different from other composite scores?

Three things: (1) every input benchmark must be reproducible and either lab-verified or independently replicated; (2) weights are published, versioned, and explained on this page; (3) the underlying per-benchmark scores are linkable from each model page so anyone can audit the composite.

How are benchmarks weighted?

Reasoning and coding each contribute 25%; tool use 20%; knowledge 15%; long context 10%; vision 5%. Within each axis, component benchmarks are equal-weighted after min-max normalization on the most recent frontier release window.

How often is the score updated?

Continuously. Pricing and live performance metrics revalidate hourly. New benchmark scores are folded in within hours of a verified release. Methodology versions are dated and frozen — when weights change, we publish a new version and link the old one for reproducibility.

How should I cite the LLM Stats Score?

LLM Stats (2026). LLM Stats Score (v1.0). LLM Stats. https://llm-stats.com/methodology/llm-stats-score

Can I reproduce the score myself?

Yes. Each component benchmark is linked from the model page. Raw scores are exposed via the public API. The weighting is the simple weighted mean documented above. Everything we use is either public or available under a permissive license.