SWE-Bench Pro

Progress Over Time

Interactive timeline showing model performance evolution on SWE-Bench Pro

State-of-the-art frontier
Open
Proprietary

SWE-Bench Pro Leaderboard

32 models
ContextCostLicense
1
2
31.0M$5.00 / $25.00
41.0M$5.00 / $25.00
5
Zhipu AI
Zhipu AI
753B1.0M$1.40 / $4.40
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
7
MiniMax
MiniMax
1.0M$0.60 / $2.40
8
Moonshot AI
Moonshot AI
1.0T262K$0.95 / $4.00
8
OpenAI
OpenAI
1.1M$5.00 / $30.00
10
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
11
OpenAI
OpenAI
1.0M$2.50 / $15.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
131.0T1.0M$0.43 / $0.87
14400K$1.75 / $14.00
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
16400K$1.75 / $14.00
17205K$0.30 / $1.20
18
Xiaomi
Xiaomi
311B1.0M$0.17 / $0.34
19230B1.0M$0.30 / $1.20
191.6T1.0M$1.74 / $3.48
211.0M$1.50 / $9.00
22400K$0.75 / $4.50
231.0M$2.50 / $15.00
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
251.0T
26284B1.0M$0.14 / $0.28
27
27400K$0.20 / $1.25
29
30
Moonshot AI
Moonshot AI
1.0T
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
3230B
Notice missing or incorrect data?
About this benchmark

What is SWE-Bench Pro?

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

SWE-Bench Pro is a text benchmark evaluating models on reasoning, agents, and code tasks. LLM Stats tracks 32 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.

Compare leaders on the best AI for reasoning, best AI for agents and best AI for code leaderboards.

Current leaders

Claude Fable 5 from Anthropic currently leads the SWE-Bench Pro leaderboard with a score of 0.800 across 32 evaluated AI models.

1Claude Fable 5Anthropic80.0%
2Claude Mythos PreviewAnthropic77.8%
3Claude Opus 4.8Anthropic69.2%
OSSGLM-5.2#5 open-weight62.1%

FAQ

Common questions about the SWE-Bench Pro benchmark and leaderboard.

What is the SWE-Bench Pro benchmark?

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

What is the SWE-Bench Pro leaderboard?

The SWE-Bench Pro leaderboard ranks 32 AI models based on their performance on this benchmark. Currently, Claude Fable 5 by Anthropic leads with a score of 0.800. The average score across all models is 0.573.

What is the highest SWE-Bench Pro score?

The highest SWE-Bench Pro score is 0.800, achieved by Claude Fable 5 from Anthropic.

How many models are evaluated on SWE-Bench Pro?

32 models have been evaluated on the SWE-Bench Pro benchmark, with 0 verified results and 32 self-reported results.

What categories does SWE-Bench Pro cover?

SWE-Bench Pro is categorized under reasoning, agents, and code. The benchmark evaluates text models.

What's the difference between SWE-Bench Pro and SWE-Bench Verified?

SWE-Bench Pro is a variant of SWE-Bench Verified. See the SWE-Bench Verified leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on SWE-Bench Pro?

GLM-5.2 by Zhipu AI is the top-ranked open-source model on SWE-Bench Pro, with a score of 0.621 (rank #5).

How recent are the SWE-Bench Pro leaderboard results?

The SWE-Bench Pro leaderboard was last updated in June 2026 and currently includes 32 evaluated models.