Benchmarks/code/SWE-bench Verified (Agentic Coding)

SWE-bench Verified (Agentic Coding)

SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-bench Verified (Agentic Coding)

State-of-the-art frontier
Open
Proprietary

SWE-bench Verified (Agentic Coding) Leaderboard

1 models • 0 verified
ContextCostLicense
1
Notice missing or incorrect data?

FAQ

Common questions about SWE-bench Verified (Agentic Coding)

SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.
The SWE-bench Verified (Agentic Coding) paper is available at https://arxiv.org/abs/2310.06770. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-bench Verified (Agentic Coding) leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.772. The average score across all models is 0.772.
The highest SWE-bench Verified (Agentic Coding) score is 0.772, achieved by Claude Sonnet 4.5 from Anthropic.
1 models have been evaluated on the SWE-bench Verified (Agentic Coding) benchmark, with 0 verified results and 1 self-reported results.
SWE-bench Verified (Agentic Coding) is categorized under code and reasoning. The benchmark evaluates text models.