Coding claims vs hard evidence

BenchmaxxedAI Models

A source-separated report for spotting models whose self-reported coding benchmarks run hotter than verified benchmarks, community coding results, and LLM Stats coding arena votes.

Evidence overlap
49
193 claimed · 89 hard
Pearson
0.79
Spearman 0.81
Explained variance
63%
claim signal vs hard signal
Bucket coverage
74%
standard coding buckets
Evidence gap

Claims vs hard evidence

Each model is plotted by its self-reported coding signal against independent hard evidence. Points above the dashed line are over-claiming.

LongCat-Flash-Thinking-2DeepSeek-R1-0528Mercury 2LongCat-Flash-ThinkingSelf-reported coding signalHard coding evidence

Compared 49 coding models with enough claimed and hard evidence.

Reported coding claims show a strong relationship with hard evidence.

LongCat-Flash-Thinking-2601 is the largest positive residual in this slice.

LongCat-Flash-Thinking-2601 has the highest selective-disclosure risk once benchmark coverage and source choice are included.

Last 30 days

Models that got worse

Sigma-normalized vs. each model's baseline

How is this measured?

For each model we reconstruct daily TrueSkill conservative ratings per arena from match-level vote outcomes, then compute a baseline from the first 21 days of activity (after a 3-day warm-up). The Quality Index is the sigma-normalized deviation from that baseline, weighted across arenas. Change shown is the difference between today and 30 days ago. A swing of ±0.5σ is noticeable; ±1σ is significant.

Largest claim gaps

Models claiming the most relative to hard evidence

Self-reported coding signal runs hottest compared to verified benchmarks, community results, and arena votes.

Hard data overperforms

Models stronger than their own claims

Independent evidence ranks these models higher than their self-reported coding numbers.

Selective disclosure

Risk ledger

Combines residual gap, missing standard coding buckets, house-benchmark dependence, and hard-evidence thickness.

By provider

Provider disclosure profile

Average claim gap and number of positive residuals across each provider's overlapping coding models.

Unknown Organization
Inception
1 models · 1 over-claiming · 100% hard/claimed
2.71
avg gap +1.00
Meituan
Meituan
4 models · 3 over-claiming · 100% hard/claimed
1.71
avg gap +0.32
StepFun
StepFun
1 models · 1 over-claiming · 100% hard/claimed
1.57
avg gap +0.59
DeepSeek
DeepSeek
4 models · 3 over-claiming · 28% hard/claimed
1.51
avg gap +0.45
Xiaomi
Xiaomi
1 models · 1 over-claiming · 33% hard/claimed
1.27
avg gap +0.39
MoonshotAI
Moonshot AI
2 models · 1 over-claiming · 43% hard/claimed
1.08
avg gap +0.43
Nvidia
NVIDIA
1 models · 1 over-claiming · 25% hard/claimed
0.82
avg gap -0.44
OpenAI
OpenAI
9 models · 2 over-claiming · 89% hard/claimed
0.81
avg gap -0.10
Qwen
Alibaba Cloud / Qwen Team
4 models · 1 over-claiming · 35% hard/claimed
0.68
avg gap -0.21
Anthropic
Anthropic
9 models · 3 over-claiming · 59% hard/claimed
0.65
avg gap +0.05
By benchmark

Benchmark-choice bias

Self-reported benchmarks weakly correlated with hard evidence and skewed high in claim gap.

SWE-Lancer (IC-Diamond subset)
SWE-style repair · 3 models · 0% house share
1.16
r -0.53 · gap +0.28
LiveCodeBench
Algorithmic coding · 11 models · 0% house share
0.68
r -0.04 · gap +0.29
SWE-Bench Pro
SWE-style repair · 9 models · 0% house share
0.52
r 0.33 · gap +0.50
CyberGym
Other · 4 models · 0% house share
0.43
r 0.99 · gap +0.43
Multi-SWE-Bench
SWE-style repair · 4 models · 0% house share
0.39
r 0.50 · gap +0.39
SWE-bench Multilingual
SWE-style repair · 12 models · 0% house share
0.20
r 0.81 · gap +0.20
Terminal-Bench 2.0
Terminal tasks · 23 models · 0% house share
0.18
r 0.69 · gap +0.18
MCP Atlas
Other · 11 models · 0% house share
0.17
r 0.27 · gap +0.09
SciCode
Other · 8 models · 0% house share
0.16
r 0.79 · gap +0.16
Aider-Polyglot
Other · 7 models · 0% house share
0.01
r 0.34 · gap -0.53
Family inventory

Silent siblings

One model is claimed loudly, related variants only have hard evidence or no coding coverage at all.

No silent sibling pattern detected yet.

Coverage matrix

Standard coding bucket coverage

Whether each risk model has both, only claims, only hard evidence, or nothing across the standard coding buckets.

ModelSWE-style repairTerminal tasksAlgorithmic codingWeb and visual codingAgentic tool use
Meituan
LongCat-Flash-Thinking-2601
Claim only
Missing
Claim only
Hard only
Missing
Unknown Organization
Mercury 2
Missing
Missing
Claim only
Hard only
Missing
Meituan
LongCat-Flash-Thinking
Claim only
Missing
Claim only
Hard only
Missing
DeepSeek
DeepSeek-R1-0528
Claim only
Claim only
Claim only
Hard only
Hard only
MoonshotAI
Kimi K2-Thinking-0905
Claim only
Claim only
Missing
Hard only
Hard only
DeepSeek
DeepSeek-V3.2-Speciale
Claim only
Claim only
Missing
Hard only
Missing
DeepSeek
DeepSeek-V3.2 (Thinking)
Claim only
Claim only
Claim only
Hard only
Hard only
OpenAI
GPT-5.3 Codex
Claim only
Claim only
Missing
Hard only
Hard only
OpenAI
GPT-5.2 Codex
Claim only
Claim only
Missing
Hard only
Hard only
StepFun
Step-3.5-Flash
Claim only
Claim only
Missing
Hard only
Missing
Claim + hard evidence
Claim only
Hard only
Missing
How is this measured?

Self-reported catalog benchmarks in the code category.

Verified/non-self-reported code benchmarks, public community code leaderboards, and LLM Stats coding arena ratings.

Both signals are aggregated separately, robust-z normalized over overlapping models, then compared with regression residuals.

The cherry-picking layer adds benchmark coverage, house-benchmark dependence, provider aggregation, silent family siblings, and benchmark-choice bias. These are risk signals, not accusations.

Minimum evidence: 2 reported sources and 2 hard sources, or 30+ arena votes.