SWE-bench Verified (Agentic Coding)

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-bench Verified (Agentic Coding)

State-of-the-art frontier
Open
Proprietary

SWE-bench Verified (Agentic Coding) Leaderboard

2 models
ContextCostLicense
1200K$3.00 / $15.00
2
Moonshot AI
Moonshot AI
1.0T
Notice missing or incorrect data?
About this benchmark

What is SWE-bench Verified (Agentic Coding)?

SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.

SWE-bench Verified (Agentic Coding) is a text benchmark evaluating models on reasoning and code tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.

Compare leaders on the best AI for reasoning and best AI for code leaderboards.

Current leaders

Claude Sonnet 4.5 from Anthropic currently leads the SWE-bench Verified (Agentic Coding) leaderboard with a score of 0.772 across 2 evaluated AI models.

1Claude Sonnet 4.5Anthropic77.2%
2Kimi K2 InstructMoonshot AI65.8%

Source paper

Title
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Authors
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, and 3 others
Published
Abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

FAQ

Common questions about the SWE-bench Verified (Agentic Coding) benchmark and leaderboard.

What is the SWE-bench Verified (Agentic Coding) benchmark?

SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.

What is the SWE-bench Verified (Agentic Coding) leaderboard?

The SWE-bench Verified (Agentic Coding) leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Claude Sonnet 4.5 by Anthropic leads with a score of 0.772. The average score across all models is 0.715.

What is the highest SWE-bench Verified (Agentic Coding) score?

The highest SWE-bench Verified (Agentic Coding) score is 0.772, achieved by Claude Sonnet 4.5 from Anthropic.

How many models are evaluated on SWE-bench Verified (Agentic Coding)?

2 models have been evaluated on the SWE-bench Verified (Agentic Coding) benchmark, with 0 verified results and 2 self-reported results.

Where can I find the SWE-bench Verified (Agentic Coding) paper?

The SWE-bench Verified (Agentic Coding) paper is available at https://arxiv.org/abs/2310.06770. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SWE-bench Verified (Agentic Coding) cover?

SWE-bench Verified (Agentic Coding) is categorized under reasoning and code. The benchmark evaluates text models.

What is the best open-source model on SWE-bench Verified (Agentic Coding)?

Kimi K2 Instruct by Moonshot AI is the top-ranked open-source model on SWE-bench Verified (Agentic Coding), with a score of 0.658 (rank #2).

Which model offers the best value on SWE-bench Verified (Agentic Coding)?

Among models scoring within 10% of the leader, Claude Sonnet 4.5 from Anthropic is the cheapest, at $3.00 per million input tokens with a score of 0.772.

How recent are the SWE-bench Verified (Agentic Coding) leaderboard results?

The SWE-bench Verified (Agentic Coding) leaderboard was last updated in July 2026 and currently includes 2 evaluated models.