BenchmaxxedAI Models
A source-separated report for spotting models whose self-reported coding benchmarks run hotter than verified benchmarks, community coding results, and LLM Stats coding arena votes.
Claims vs hard evidence
Each model is plotted by its self-reported coding signal against independent hard evidence. Points above the dashed line are over-claiming.
Compared 49 coding models with enough claimed and hard evidence.
Reported coding claims show a strong relationship with hard evidence.
LongCat-Flash-Thinking-2601 is the largest positive residual in this slice.
LongCat-Flash-Thinking-2601 has the highest selective-disclosure risk once benchmark coverage and source choice are included.
Models that got worse
Sigma-normalized vs. each model's baseline
How is this measured?⌄
For each model we reconstruct daily TrueSkill conservative ratings per arena from match-level vote outcomes, then compute a baseline from the first 21 days of activity (after a 3-day warm-up). The Quality Index is the sigma-normalized deviation from that baseline, weighted across arenas. Change shown is the difference between today and 30 days ago. A swing of ±0.5σ is noticeable; ±1σ is significant.
Models claiming the most relative to hard evidence
Self-reported coding signal runs hottest compared to verified benchmarks, community results, and arena votes.
LongCat-Flash-Thinking-2601
DeepSeek-R1-0528
Mercury 2
LongCat-Flash-Thinking
Kimi K2-Thinking-0905
GPT-5.2 Codex
Claude Haiku 4.5
DeepSeek-V3.2 (Thinking)
Models stronger than their own claims
Independent evidence ranks these models higher than their self-reported coding numbers.
GPT-4.1 mini
GPT-5.4 mini
Gemini 3 Pro
Gemini 3 Flash
Qwen3.5-35B-A3B
Claude Sonnet 4.6
GPT-4.1
Claude 3.7 Sonnet
Risk ledger
Combines residual gap, missing standard coding buckets, house-benchmark dependence, and hard-evidence thickness.
LongCat-Flash-Thinking-2601
Mercury 2
LongCat-Flash-Thinking
DeepSeek-R1-0528
Kimi K2-Thinking-0905
DeepSeek-V3.2-Speciale
DeepSeek-V3.2 (Thinking)
GPT-5.3 Codex
Provider disclosure profile
Average claim gap and number of positive residuals across each provider's overlapping coding models.
Benchmark-choice bias
Self-reported benchmarks weakly correlated with hard evidence and skewed high in claim gap.
Silent siblings
One model is claimed loudly, related variants only have hard evidence or no coding coverage at all.
No silent sibling pattern detected yet.
Standard coding bucket coverage
Whether each risk model has both, only claims, only hard evidence, or nothing across the standard coding buckets.
| Model | SWE-style repair | Terminal tasks | Algorithmic coding | Web and visual coding | Agentic tool use |
|---|---|---|---|---|---|
Claim only | Missing | Claim only | Hard only | Missing | |
Missing | Missing | Claim only | Hard only | Missing | |
Claim only | Missing | Claim only | Hard only | Missing | |
Claim only | Claim only | Claim only | Hard only | Hard only | |
Claim only | Claim only | Missing | Hard only | Hard only | |
Claim only | Claim only | Missing | Hard only | Missing | |
Claim only | Claim only | Claim only | Hard only | Hard only | |
Claim only | Claim only | Missing | Hard only | Hard only | |
Claim only | Claim only | Missing | Hard only | Hard only | |
Claim only | Claim only | Missing | Hard only | Missing |
How is this measured?⌄
Self-reported catalog benchmarks in the code category.
Verified/non-self-reported code benchmarks, public community code leaderboards, and LLM Stats coding arena ratings.
Both signals are aggregated separately, robust-z normalized over overlapping models, then compared with regression residuals.
The cherry-picking layer adds benchmark coverage, house-benchmark dependence, provider aggregation, silent family siblings, and benchmark-choice bias. These are risk signals, not accusations.
Minimum evidence: 2 reported sources and 2 hard sources, or 30+ arena votes.