AI Leaderboards LLM Leaderboard Open LLM Leaderboard AI Trends LLM Updates AI News Benchmaxxed

Best AI for...

Coding claims vs hard evidence

BenchmaxxedAI Models

A source-separated report for spotting models whose self-reported coding benchmarks run hotter than verified benchmarks, community coding results, and LLM Stats coding arena votes.

Evidence overlap

193 claimed · 89 hard

Pearson

0.79

Spearman 0.81

Explained variance

63%

claim signal vs hard signal

Bucket coverage

74%

standard coding buckets

Evidence gap

Claims vs hard evidence

Each model is plotted by its self-reported coding signal against independent hard evidence. Points above the dashed line are over-claiming.

Compared 49 coding models with enough claimed and hard evidence.

Reported coding claims show a strong relationship with hard evidence.

LongCat-Flash-Thinking-2601 is the largest positive residual in this slice.

LongCat-Flash-Thinking-2601 has the highest selective-disclosure risk once benchmark coverage and source choice are included.

Last 30 days

Models that got worse

Sigma-normalized vs. each model's baseline

How is this measured?⌄

For each model we reconstruct daily TrueSkill conservative ratings per arena from match-level vote outcomes, then compute a baseline from the first 21 days of activity (after a 3-day warm-up). The Quality Index is the sigma-normalized deviation from that baseline, weighted across arenas. Change shown is the difference between today and 30 days ago. A swing of ±0.5σ is noticeable; ±1σ is significant.

Largest claim gaps

Models claiming the most relative to hard evidence

Self-reported coding signal runs hottest compared to verified benchmarks, community results, and arena votes.

Hard data overperforms

Models stronger than their own claims

Independent evidence ranks these models higher than their self-reported coding numbers.

GPT-4.1 mini

OpenAI·3 claimed / 2 hard

-1.00

GPT-5.4 mini

OpenAI·3 claimed / 1 hard

-0.79

Gemini 3 Pro

Google·3 claimed / 6 hard

-0.72

Gemini 3 Flash

Google·4 claimed / 5 hard

-0.71

Qwen3.5-35B-A3B

Alibaba Cloud / Qwen Team·4 claimed / 1 hard

-0.65

Claude Sonnet 4.6

Anthropic·3 claimed / 3 hard

-0.63

GPT-4.1

OpenAI·3 claimed / 2 hard

-0.63

Claude 3.7 Sonnet

Anthropic·2 claimed / 1 hard

-0.62

Selective disclosure

Risk ledger

Combines residual gap, missing standard coding buckets, house-benchmark dependence, and hard-evidence thickness.

LongCat-Flash-Thinking-2601

60% coverage·2 missing buckets

2.73

risk

Mercury 2

40% coverage·3 missing buckets

2.71

risk

LongCat-Flash-Thinking

60% coverage·2 missing buckets

2.38

risk

DeepSeek-R1-0528

100% coverage·0 missing buckets

2.17

risk

Kimi K2-Thinking-0905

80% coverage·1 missing buckets

2.02

risk

DeepSeek-V3.2-Speciale

60% coverage·2 missing buckets

1.67

risk

DeepSeek-V3.2 (Thinking)

100% coverage·0 missing buckets

1.63

risk

GPT-5.3 Codex

80% coverage·1 missing buckets

1.62

risk

By provider

Provider disclosure profile

Average claim gap and number of positive residuals across each provider's overlapping coding models.

Inception

1 models · 1 over-claiming · 100% hard/claimed

2.71

avg gap +1.00

Meituan

4 models · 3 over-claiming · 100% hard/claimed

1.71

avg gap +0.32

StepFun

1 models · 1 over-claiming · 100% hard/claimed

1.57

avg gap +0.59

DeepSeek

4 models · 3 over-claiming · 28% hard/claimed

1.51

avg gap +0.45

Xiaomi

1 models · 1 over-claiming · 33% hard/claimed

1.27

avg gap +0.39

Moonshot AI

2 models · 1 over-claiming · 43% hard/claimed

1.08

avg gap +0.43

NVIDIA

1 models · 1 over-claiming · 25% hard/claimed

0.82

avg gap -0.44

OpenAI

9 models · 2 over-claiming · 89% hard/claimed

0.81

avg gap -0.10

Alibaba Cloud / Qwen Team

4 models · 1 over-claiming · 35% hard/claimed

0.68

avg gap -0.21

Anthropic

9 models · 3 over-claiming · 59% hard/claimed

0.65

avg gap +0.05

By benchmark

Benchmark-choice bias

Self-reported benchmarks weakly correlated with hard evidence and skewed high in claim gap.

SWE-Lancer (IC-Diamond subset)

SWE-style repair · 3 models · 0% house share

1.16

r -0.53 · gap +0.28

LiveCodeBench

Algorithmic coding · 11 models · 0% house share

0.68

r -0.04 · gap +0.29

SWE-Bench Pro

SWE-style repair · 9 models · 0% house share

0.52

r 0.33 · gap +0.50

CyberGym

Other · 4 models · 0% house share

0.43

r 0.99 · gap +0.43

Multi-SWE-Bench

SWE-style repair · 4 models · 0% house share

0.39

r 0.50 · gap +0.39

SWE-bench Multilingual

SWE-style repair · 12 models · 0% house share

0.20

r 0.81 · gap +0.20

Terminal-Bench 2.0

Terminal tasks · 23 models · 0% house share

0.18

r 0.69 · gap +0.18

MCP Atlas

Other · 11 models · 0% house share

0.17

r 0.27 · gap +0.09

SciCode

Other · 8 models · 0% house share

0.16

r 0.79 · gap +0.16

Aider-Polyglot

Other · 7 models · 0% house share

0.01

r 0.34 · gap -0.53

Family inventory

Silent siblings

One model is claimed loudly, related variants only have hard evidence or no coding coverage at all.

No silent sibling pattern detected yet.

Coverage matrix

Standard coding bucket coverage

Whether each risk model has both, only claims, only hard evidence, or nothing across the standard coding buckets.

Model	SWE-style repair	Terminal tasks	Algorithmic coding	Web and visual coding	Agentic tool use
LongCat-Flash-Thinking-2601	Claim only	Missing	Claim only	Hard only	Missing
Mercury 2	Missing	Missing	Claim only	Hard only	Missing
LongCat-Flash-Thinking	Claim only	Missing	Claim only	Hard only	Missing
DeepSeek-R1-0528	Claim only	Claim only	Claim only	Hard only	Hard only
Kimi K2-Thinking-0905	Claim only	Claim only	Missing	Hard only	Hard only
DeepSeek-V3.2-Speciale	Claim only	Claim only	Missing	Hard only	Missing
DeepSeek-V3.2 (Thinking)	Claim only	Claim only	Claim only	Hard only	Hard only
GPT-5.3 Codex	Claim only	Claim only	Missing	Hard only	Hard only
GPT-5.2 Codex	Claim only	Claim only	Missing	Hard only	Hard only
Step-3.5-Flash	Claim only	Claim only	Missing	Hard only	Missing

Claim + hard evidence

Claim only

Hard only

Missing

How is this measured?⌄

Self-reported catalog benchmarks in the code category.

Verified/non-self-reported code benchmarks, public community code leaderboards, and LLM Stats coding arena ratings.

Both signals are aggregated separately, robust-z normalized over overlapping models, then compared with regression residuals.

The cherry-picking layer adds benchmark coverage, house-benchmark dependence, provider aggregation, silent family siblings, and benchmark-choice bias. These are risk signals, not accusations.

Minimum evidence: 2 reported sources and 2 hard sources, or 30+ arena votes.

Claims vs hard evidence

Models that got worse