SWE-bench Verified (Agentic Coding)
Progress Over Time
Interactive timeline showing model performance evolution on SWE-bench Verified (Agentic Coding)
SWE-bench Verified (Agentic Coding) Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 2 | Moonshot AI | 1.0T | — | — |
What is SWE-bench Verified (Agentic Coding)?
SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.
SWE-bench Verified (Agentic Coding) is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.
Compare leaders on the best AI for reasoning and best AI for code leaderboards.
Current leaders
Claude Sonnet 4.5 from Anthropic currently leads the SWE-bench Verified (Agentic Coding) leaderboard with a score of 0.772 across 2 evaluated AI models.
Source paper
- Title
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- Authors
- Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, and 3 others
- Published
- arXiv
- 2310.06770
Abstract
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
FAQ
Common questions about the SWE-bench Verified (Agentic Coding) benchmark and leaderboard.