SWE-Bench Verified
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-Bench Verified
State-of-the-art frontier
Open
Proprietary
SWE-Bench Verified Leaderboard
80 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 200K | $5.00 / $25.00 | ||
| 2 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 3 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 4 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 5 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 6 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 7 | Qwen3.6 PlusNew Alibaba Cloud / Qwen Team | — | — | — | ||
| 8 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 8 | Xiaomi | 1.0T | 1.0M | $1.00 / $3.00 | ||
| 10 | Zhipu AI | 744B | 200K | $1.00 / $3.20 | ||
| 11 | Moonshot AI | 1.0T | 262K | $0.60 / $2.50 | ||
| 12 | ByteDance | — | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 14 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 14 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 14 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 17 | Google | — | — | — | ||
| 18 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 19 | Xiaomi | — | 262K | $0.40 / $2.00 | ||
| 20 | OpenAI | — | — | — | ||
| 20 | Anthropic | — | 200K | $15.00 / $75.00 | ||
| 22 | StepFun | 196B | 66K | $0.10 / $0.40 | ||
| 23 | Zhipu AI | 358B | 205K | $0.60 / $2.20 | ||
| 24 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 25 | ByteDance | — | — | — | ||
| 26 | Xiaomi | 309B | 256K | $0.10 / $0.30 | ||
| 27 | Anthropic | — | 200K | $1.00 / $5.00 | ||
| 28 | DeepSeek | 685B | — | — | ||
| 28 | DeepSeek | 685B | — | — | ||
| 30 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 31 | Anthropic | — | 200K | $15.00 / $75.00 | ||
| 32 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
| 33 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 34 | Moonshot AI | 1.0T | — | — | ||
| 35 | — | 256K | $0.20 / $1.50 | |||
| 36 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 37 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 38 | Alibaba Cloud / Qwen Team | 1.0T | 256K | $0.50 / $5.00 | ||
| 38 | Alibaba Cloud / Qwen Team | 480B | — | — | ||
| 40 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 41 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 42 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 43 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 44 | Zhipu AI | 357B | 131K | $0.55 / $2.19 | ||
| 45 | DeepSeek | 685B | — | — | ||
| 46 | — | 1.0M | $1.25 / $10.00 | |||
| 47 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 48 | DeepSeek | 671B | 164K | $0.27 / $1.00 | ||
| 49 | Moonshot AI | 1.0T | — | — | ||
| 50 | Zhipu AI | 355B | 131K | $0.40 / $1.60 |
1–50 of 80
1/2
Notice missing or incorrect data?
FAQ
Common questions about SWE-Bench Verified
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
The SWE-Bench Verified paper is available at https://arxiv.org/abs/2310.06770. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Bench Verified leaderboard ranks 80 AI models based on their performance on this benchmark. Currently, Claude Opus 4.5 by Anthropic leads with a score of 0.809. The average score across all models is 0.627.
The highest SWE-Bench Verified score is 0.809, achieved by Claude Opus 4.5 from Anthropic.
80 models have been evaluated on the SWE-Bench Verified benchmark, with 0 verified results and 80 self-reported results.
SWE-Bench Verified is categorized under code, frontend development, and reasoning. The benchmark evaluates text models.