SWE-bench Verified (Agentless)

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-bench Verified (Agentless)

State-of-the-art frontier
Open
Proprietary

SWE-bench Verified (Agentless) Leaderboard

2 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T
21.0T1.0M$0.43 / $0.87
Notice missing or incorrect data?
About this benchmark

What is SWE-bench Verified (Agentless)?

A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The benchmark tests models on software engineering problems requiring understanding and coordinating changes across multiple functions, classes, and files simultaneously.

SWE-bench Verified (Agentless) is a text benchmark evaluating models on reasoning and general tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.4, with the leader at 0.5.

Compare leaders on the best AI for reasoning and best AI for general leaderboards.

Current leaders

Kimi K2 Instruct from Moonshot AI currently leads the SWE-bench Verified (Agentless) leaderboard with a score of 0.518 across 2 evaluated AI models.

1Kimi K2 InstructMoonshot AI51.8%
2MiMo-V2.5-ProXiaomi35.7%

Source paper

Title
Agentless: Demystifying LLM-based Software Engineering Agents
Authors
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang
Published
Abstract

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

FAQ

Common questions about the SWE-bench Verified (Agentless) benchmark and leaderboard.

What is the SWE-bench Verified (Agentless) benchmark?

A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The benchmark tests models on software engineering problems requiring understanding and coordinating changes across multiple functions, classes, and files simultaneously.

What is the SWE-bench Verified (Agentless) leaderboard?

The SWE-bench Verified (Agentless) leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.518. The average score across all models is 0.438.

What is the highest SWE-bench Verified (Agentless) score?

The highest SWE-bench Verified (Agentless) score is 0.518, achieved by Kimi K2 Instruct from Moonshot AI.

How many models are evaluated on SWE-bench Verified (Agentless)?

2 models have been evaluated on the SWE-bench Verified (Agentless) benchmark, with 0 verified results and 2 self-reported results.

Where can I find the SWE-bench Verified (Agentless) paper?

The SWE-bench Verified (Agentless) paper is available at https://arxiv.org/abs/2407.01489. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SWE-bench Verified (Agentless) cover?

SWE-bench Verified (Agentless) is categorized under reasoning and general. The benchmark evaluates text models.

What is the best open-source model on SWE-bench Verified (Agentless)?

Kimi K2 Instruct by Moonshot AI is the top-ranked open-source model on SWE-bench Verified (Agentless), with a score of 0.518 (rank #1).

How recent are the SWE-bench Verified (Agentless) leaderboard results?

The SWE-bench Verified (Agentless) leaderboard was last updated in July 2026 and currently includes 2 evaluated models.