Benchmarks/code/SWE-Bench Verified

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-Bench Verified

State-of-the-art frontier
Open
Proprietary

SWE-Bench Verified Leaderboard

77 models • 0 verified
ContextCostLicense
1
0.809200K
$5.00
$25.00
2
0.808200K
$5.00
$25.00
3
0.8061.0M
$2.50
$15.00
4
0.802230B1.0M
$0.30
$1.20
5
OpenAI
OpenAI
0.800400K
$1.75
$14.00
6
0.796200K
$3.00
$15.00
7
0.7801.0M
$0.50
$3.00
8
Zhipu AI
Zhipu AI
0.778744B200K
$1.00
$3.20
9
Moonshot AI
Moonshot AI
0.7681.0T262K
$0.60
$2.50
10
ByteDance
ByteDance
0.765
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.764397B262K
$0.60
$3.60
12
OpenAI
OpenAI
0.763400K
$1.25
$10.00
12
0.763400K
$1.25
$10.00
12
0.763400K
$1.25
$10.00
15
0.762
16
OpenAI
OpenAI
0.749400K
$1.25
$10.00
17
0.745200K
$15.00
$75.00
17
0.745
19
0.744196B66K
$0.10
$0.40
20
Zhipu AI
Zhipu AI
0.738358B205K
$0.60
$2.20
21
0.737400K
$1.25
$10.00
22
ByteDance
ByteDance
0.735
23
0.734309B256K
$0.10
$0.30
24
0.733200K
$1.00
$5.00
25
0.731685B
25
0.731685B
27
0.727200K
$3.00
$15.00
28
Anthropic
Anthropic
0.725200K
$15.00
$75.00
29
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.72427B
30
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.720122B262K
$0.40
$3.20
31
0.7131.0T
32
0.708256K
$0.20
$1.50
33
0.703200K
$3.00
$15.00
34
0.700560B128K
$0.30
$1.20
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.696480B
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.6961.0T256K
$0.50
$5.00
37
MiniMax
MiniMax
0.694230B1.0M
$0.30
$1.20
38
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.69235B262K
$0.25
$2.00
39
OpenAI
OpenAI
0.691200K
$2.00
$8.00
40
OpenAI
OpenAI
0.681200K
$1.10
$4.40
41
Zhipu AI
Zhipu AI
0.680357B131K
$0.55
$2.19
42
0.678685B
43
0.6721.0M
$1.25
$10.00
44
0.670230B1.0M
$0.30
$1.20
45
0.660671B164K
$0.27
$1.00
46
0.6581.0T
47
Zhipu AI
Zhipu AI
0.642355B131K
$0.40
$1.60
48
0.6321.0M
$1.25
$10.00
49
Mistral AI
Mistral AI
0.616128K
$0.40
$2.00
50
0.604560B128K
$0.30
$1.20
Showing 1-50 of 77
1 / 2
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
The SWE-Bench Verified paper is available at https://arxiv.org/abs/2310.06770. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Bench Verified leaderboard ranks 77 AI models based on their performance on this benchmark. Currently, Claude Opus 4.5 by Anthropic leads with a score of 0.809. The average score across all models is 0.622.
The highest SWE-Bench Verified score is 0.809, achieved by Claude Opus 4.5 from Anthropic.
77 models have been evaluated on the SWE-Bench Verified benchmark, with 0 verified results and 77 self-reported results.
SWE-Bench Verified is categorized under code, frontend development, and reasoning. The benchmark evaluates text models.

Sub-benchmarks