Benchmarks/agents/SWE-Bench Pro

SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

Progress Over Time

Interactive timeline showing model performance evolution on SWE-Bench Pro

State-of-the-art frontier
Open
Proprietary

SWE-Bench Pro Leaderboard

8 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
0.5771.0M
$2.50
$15.00
2
0.568400K
$1.75
$14.00
3
0.564400K
$1.75
$14.00
4
0.554230B1.0M
$0.30
$1.20
5
OpenAI
OpenAI
0.544400K
$0.75
$4.50
6
0.5421.0M
$2.50
$15.00
7
OpenAI
OpenAI
0.524400K
$0.20
$1.25
8
Moonshot AI
Moonshot AI
0.5071.0T262K
$0.60
$2.50
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about SWE-Bench Pro

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.
The SWE-Bench Pro leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, GPT-5.4 by OpenAI leads with a score of 0.577. The average score across all models is 0.547.
The highest SWE-Bench Pro score is 0.577, achieved by GPT-5.4 from OpenAI.
8 models have been evaluated on the SWE-Bench Pro benchmark, with 0 verified results and 8 self-reported results.
SWE-Bench Pro is categorized under agents, code, and reasoning. The benchmark evaluates text models.