SWE-Bench Verified
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-Bench Verified
State-of-the-art frontier
Open
Proprietary
SWE-Bench Verified Leaderboard
77 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Anthropic | 0.809 | — | 200K | $5.00 $25.00 | |
2 | Anthropic | 0.808 | — | 200K | $5.00 $25.00 | |
3 | Google | 0.806 | — | 1.0M | $2.50 $15.00 | |
4 | MiniMax | 0.802 | 230B | 1.0M | $0.30 $1.20 | |
5 | OpenAI | 0.800 | — | 400K | $1.75 $14.00 | |
6 | Anthropic | 0.796 | — | 200K | $3.00 $15.00 | |
7 | Google | 0.780 | — | 1.0M | $0.50 $3.00 | |
8 | Zhipu AI | 0.778 | 744B | 200K | $1.00 $3.20 | |
9 | Moonshot AI | 0.768 | 1.0T | 262K | $0.60 $2.50 | |
10 | ByteDance | 0.765 | — | — | — | |
11 | Alibaba Cloud / Qwen Team | 0.764 | 397B | 262K | $0.60 $3.60 | |
12 | OpenAI | 0.763 | — | 400K | $1.25 $10.00 | |
12 | OpenAI | 0.763 | — | 400K | $1.25 $10.00 | |
12 | OpenAI | 0.763 | — | 400K | $1.25 $10.00 | |
15 | Google | 0.762 | — | — | — | |
16 | OpenAI | 0.749 | — | 400K | $1.25 $10.00 | |
17 | Anthropic | 0.745 | — | 200K | $15.00 $75.00 | |
17 | OpenAI | 0.745 | — | — | — | |
19 | StepFun | 0.744 | 196B | 66K | $0.10 $0.40 | |
20 | Zhipu AI | 0.738 | 358B | 205K | $0.60 $2.20 | |
21 | OpenAI | 0.737 | — | 400K | $1.25 $10.00 | |
22 | ByteDance | 0.735 | — | — | — | |
23 | Xiaomi | 0.734 | 309B | 256K | $0.10 $0.30 | |
24 | Anthropic | 0.733 | — | 200K | $1.00 $5.00 | |
25 | DeepSeek | 0.731 | 685B | — | — | |
25 | DeepSeek | 0.731 | 685B | — | — | |
27 | Anthropic | 0.727 | — | 200K | $3.00 $15.00 | |
28 | Anthropic | 0.725 | — | 200K | $15.00 $75.00 | |
29 | Alibaba Cloud / Qwen Team | 0.724 | 27B | — | — | |
30 | Alibaba Cloud / Qwen Team | 0.720 | 122B | 262K | $0.40 $3.20 | |
31 | Moonshot AI | 0.713 | 1.0T | — | — | |
32 | 0.708 | — | 256K | $0.20 $1.50 | ||
33 | Anthropic | 0.703 | — | 200K | $3.00 $15.00 | |
34 | Meituan | 0.700 | 560B | 128K | $0.30 $1.20 | |
35 | Alibaba Cloud / Qwen Team | 0.696 | 480B | — | — | |
35 | Alibaba Cloud / Qwen Team | 0.696 | 1.0T | 256K | $0.50 $5.00 | |
37 | MiniMax | 0.694 | 230B | 1.0M | $0.30 $1.20 | |
38 | Alibaba Cloud / Qwen Team | 0.692 | 35B | 262K | $0.25 $2.00 | |
39 | OpenAI | 0.691 | — | 200K | $2.00 $8.00 | |
40 | OpenAI | 0.681 | — | 200K | $1.10 $4.40 | |
41 | Zhipu AI | 0.680 | 357B | 131K | $0.55 $2.19 | |
42 | DeepSeek | 0.678 | 685B | — | — | |
43 | 0.672 | — | 1.0M | $1.25 $10.00 | ||
44 | MiniMax | 0.670 | 230B | 1.0M | $0.30 $1.20 | |
45 | DeepSeek | 0.660 | 671B | 164K | $0.27 $1.00 | |
46 | Moonshot AI | 0.658 | 1.0T | — | — | |
47 | Zhipu AI | 0.642 | 355B | 131K | $0.40 $1.60 | |
48 | Google | 0.632 | — | 1.0M | $1.25 $10.00 | |
49 | Mistral AI | 0.616 | — | 128K | $0.40 $2.00 | |
50 | Meituan | 0.604 | 560B | 128K | $0.30 $1.20 |
Showing 1-50 of 77
1 / 2
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about SWE-Bench Verified
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
The SWE-Bench Verified paper is available at https://arxiv.org/abs/2310.06770. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Bench Verified leaderboard ranks 77 AI models based on their performance on this benchmark. Currently, Claude Opus 4.5 by Anthropic leads with a score of 0.809. The average score across all models is 0.622.
The highest SWE-Bench Verified score is 0.809, achieved by Claude Opus 4.5 from Anthropic.
77 models have been evaluated on the SWE-Bench Verified benchmark, with 0 verified results and 77 self-reported results.
SWE-Bench Verified is categorized under code, frontend development, and reasoning. The benchmark evaluates text models.