Benchmarks/code/SWE-Bench Verified

SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-Bench Verified

State-of-the-art frontier
Open
Proprietary

SWE-Bench Verified Leaderboard

80 models
ContextCostLicense
1200K$5.00 / $25.00
21.0M$5.00 / $25.00
31.0M$2.50 / $15.00
4230B1.0M$0.30 / $1.20
5
OpenAI
OpenAI
400K$1.75 / $14.00
6200K$3.00 / $15.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
81.0M$0.50 / $3.00
81.0T1.0M$1.00 / $3.00
10
Zhipu AI
Zhipu AI
744B200K$1.00 / $3.20
11
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
12
ByteDance
ByteDance
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
14
OpenAI
OpenAI
400K$1.25 / $10.00
14400K$1.25 / $10.00
14400K$1.25 / $10.00
17
18
OpenAI
OpenAI
400K$1.25 / $10.00
19262K$0.40 / $2.00
20
20200K$15.00 / $75.00
22196B66K$0.10 / $0.40
23
Zhipu AI
Zhipu AI
358B205K$0.60 / $2.20
24400K$1.25 / $10.00
25
ByteDance
ByteDance
26309B256K$0.10 / $0.30
27200K$1.00 / $5.00
28685B
28685B
30200K$3.00 / $15.00
31
Anthropic
Anthropic
200K$15.00 / $75.00
32
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
33
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
341.0T
35256K$0.20 / $1.50
36200K$3.00 / $15.00
37560B128K$0.30 / $1.20
38
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0T256K$0.50 / $5.00
38
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
480B
40
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
41
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
42
OpenAI
OpenAI
200K$2.00 / $8.00
43
OpenAI
OpenAI
200K$1.10 / $4.40
44
Zhipu AI
Zhipu AI
357B131K$0.55 / $2.19
45685B
461.0M$1.25 / $10.00
47230B1.0M$0.30 / $1.20
48671B164K$0.27 / $1.00
491.0T
50
Zhipu AI
Zhipu AI
355B131K$0.40 / $1.60
150 of 80
1/2
Notice missing or incorrect data?

FAQ

Common questions about SWE-Bench Verified

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
The SWE-Bench Verified paper is available at https://arxiv.org/abs/2310.06770. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Bench Verified leaderboard ranks 80 AI models based on their performance on this benchmark. Currently, Claude Opus 4.5 by Anthropic leads with a score of 0.809. The average score across all models is 0.627.
The highest SWE-Bench Verified score is 0.809, achieved by Claude Opus 4.5 from Anthropic.
80 models have been evaluated on the SWE-Bench Verified benchmark, with 0 verified results and 80 self-reported results.
SWE-Bench Verified is categorized under code, frontend development, and reasoning. The benchmark evaluates text models.

Sub-benchmarks