SWE-bench Verified (Multiple Attempts)

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-bench Verified (Multiple Attempts)

State-of-the-art frontier
Open
Proprietary

SWE-bench Verified (Multiple Attempts) Leaderboard

1 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T
Notice missing or incorrect data?
About this benchmark

What is SWE-bench Verified (Multiple Attempts)?

SWE-bench Verified is a human-validated subset of 500 test samples from the original SWE-bench dataset that evaluates AI systems' ability to automatically resolve real GitHub issues in Python repositories. Given a codebase and issue description, models must edit the code to successfully resolve the problem, requiring understanding and coordination of changes across multiple functions, classes, and files. The Verified version provides more reliable evaluation through manual validation of test samples.

SWE-bench Verified (Multiple Attempts) is a text benchmark evaluating models on reasoning tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.7.

Compare leaders on the best AI for reasoning leaderboards.

Current leaders

Kimi K2 Instruct from Moonshot AI currently leads the SWE-bench Verified (Multiple Attempts) leaderboard with a score of 0.716 across 1 evaluated AI models.

1Kimi K2 InstructMoonshot AI71.6%

Source paper

Title
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Authors
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, and 3 others
Published
Abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

FAQ

Common questions about the SWE-bench Verified (Multiple Attempts) benchmark and leaderboard.

What is the SWE-bench Verified (Multiple Attempts) benchmark?

SWE-bench Verified is a human-validated subset of 500 test samples from the original SWE-bench dataset that evaluates AI systems' ability to automatically resolve real GitHub issues in Python repositories. Given a codebase and issue description, models must edit the code to successfully resolve the problem, requiring understanding and coordination of changes across multiple functions, classes, and files. The Verified version provides more reliable evaluation through manual validation of test samples.

What is the SWE-bench Verified (Multiple Attempts) leaderboard?

The SWE-bench Verified (Multiple Attempts) leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.716. The average score across all models is 0.716.

What is the highest SWE-bench Verified (Multiple Attempts) score?

The highest SWE-bench Verified (Multiple Attempts) score is 0.716, achieved by Kimi K2 Instruct from Moonshot AI.

How many models are evaluated on SWE-bench Verified (Multiple Attempts)?

1 models have been evaluated on the SWE-bench Verified (Multiple Attempts) benchmark, with 0 verified results and 1 self-reported results.

Where can I find the SWE-bench Verified (Multiple Attempts) paper?

The SWE-bench Verified (Multiple Attempts) paper is available at https://arxiv.org/abs/2310.06770. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SWE-bench Verified (Multiple Attempts) cover?

SWE-bench Verified (Multiple Attempts) is categorized under reasoning. The benchmark evaluates text models.

What is the best open-source model on SWE-bench Verified (Multiple Attempts)?

Kimi K2 Instruct by Moonshot AI is the top-ranked open-source model on SWE-bench Verified (Multiple Attempts), with a score of 0.716 (rank #1).

How recent are the SWE-bench Verified (Multiple Attempts) leaderboard results?

The SWE-bench Verified (Multiple Attempts) leaderboard was last updated in July 2026 and currently includes 1 evaluated models.