SWE-Bench Pro
SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-Bench Pro
State-of-the-art frontier
Open
Proprietary
SWE-Bench Pro Leaderboard
8 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | OpenAI | 0.577 | — | 1.0M | $2.50 $15.00 | |
2 | OpenAI | 0.568 | — | 400K | $1.75 $14.00 | |
3 | OpenAI | 0.564 | — | 400K | $1.75 $14.00 | |
4 | MiniMax | 0.554 | 230B | 1.0M | $0.30 $1.20 | |
5 | GPT-5.4 miniNew OpenAI | 0.544 | — | 400K | $0.75 $4.50 | |
6 | Google | 0.542 | — | 1.0M | $2.50 $15.00 | |
7 | GPT-5.4 nanoNew OpenAI | 0.524 | — | 400K | $0.20 $1.25 | |
8 | Moonshot AI | 0.507 | 1.0T | 262K | $0.60 $2.50 |
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about SWE-Bench Pro
SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.
The SWE-Bench Pro leaderboard ranks 8 AI models based on their performance on this benchmark. Currently, GPT-5.4 by OpenAI leads with a score of 0.577. The average score across all models is 0.547.
The highest SWE-Bench Pro score is 0.577, achieved by GPT-5.4 from OpenAI.
8 models have been evaluated on the SWE-Bench Pro benchmark, with 0 verified results and 8 self-reported results.
SWE-Bench Pro is categorized under agents, code, and reasoning. The benchmark evaluates text models.