SWE-Bench Pro
Progress Over Time
Interactive timeline showing model performance evolution on SWE-Bench Pro
SWE-Bench Pro Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | — | ||
| 2 | Anthropic | — | — | — | ||
| 3 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 4 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 5 | GLM-5.2New Zhipu AI | 753B | 1.0M | $1.40 / $4.40 | ||
| 6 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 7 | MiniMax | — | 1.0M | $0.60 / $2.40 | ||
| 8 | Moonshot AI | 1.0T | 262K | $0.95 / $4.00 | ||
| 8 | OpenAI | — | 1.1M | $5.00 / $30.00 | ||
| 10 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 11 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 12 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 13 | Xiaomi | 1.0T | 1.0M | $0.43 / $0.87 | ||
| 14 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 15 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 16 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 17 | MiniMax | — | 205K | $0.30 / $1.20 | ||
| 18 | Xiaomi | 311B | 1.0M | $0.17 / $0.34 | ||
| 19 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 19 | DeepSeek | 1.6T | 1.0M | $1.74 / $3.48 | ||
| 21 | Google | — | 1.0M | $1.50 / $9.00 | ||
| 22 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 23 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 24 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 25 | Microsoft | 1.0T | — | — | ||
| 26 | DeepSeek | 284B | 1.0M | $0.14 / $0.28 | ||
| 27 | Meta | — | — | — | ||
| 27 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 29 | Microsoft | — | — | — | ||
| 30 | Moonshot AI | 1.0T | — | — | ||
| 31 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 32 | Cohere | 30B | — | — |
What is SWE-Bench Pro?
SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.
SWE-Bench Pro is a text benchmark evaluating models on reasoning, agents, and code tasks. LLM Stats tracks 32 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.8.
Compare leaders on the best AI for reasoning, best AI for agents and best AI for code leaderboards.
Current leaders
Claude Fable 5 from Anthropic currently leads the SWE-Bench Pro leaderboard with a score of 0.800 across 32 evaluated AI models.
FAQ
Common questions about the SWE-Bench Pro benchmark and leaderboard.