SWE-bench Verified (Agentic Coding)
SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-bench Verified (Agentic Coding)
State-of-the-art frontier
Open
Proprietary
SWE-bench Verified (Agentic Coding) Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 2 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 |
Notice missing or incorrect data?
FAQ
Common questions about SWE-bench Verified (Agentic Coding)
SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.
The SWE-bench Verified (Agentic Coding) paper is available at https://arxiv.org/abs/2310.06770. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-bench Verified (Agentic Coding) leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.772. The average score across all models is 0.715.
The highest SWE-bench Verified (Agentic Coding) score is 0.772, achieved by Claude Sonnet 4.5 from Anthropic.
2 models have been evaluated on the SWE-bench Verified (Agentic Coding) benchmark, with 0 verified results and 2 self-reported results.
SWE-bench Verified (Agentic Coding) is categorized under code and reasoning. The benchmark evaluates text models.