Benchmarks/general/SWE-bench Verified (Agentless)

SWE-bench Verified (Agentless)

A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The benchmark tests models on software engineering problems requiring understanding and coordinating changes across multiple functions, classes, and files simultaneously.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-bench Verified (Agentless)

State-of-the-art frontier
Open
Proprietary

SWE-bench Verified (Agentless) Leaderboard

1 models • 0 verified
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T
Notice missing or incorrect data?

FAQ

Common questions about SWE-bench Verified (Agentless)

A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The benchmark tests models on software engineering problems requiring understanding and coordinating changes across multiple functions, classes, and files simultaneously.
The SWE-bench Verified (Agentless) paper is available at https://arxiv.org/abs/2407.01489. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-bench Verified (Agentless) leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Kimi K2 Instruct by Moonshot AI leads with a score of 0.518. The average score across all models is 0.518.
The highest SWE-bench Verified (Agentless) score is 0.518, achieved by Kimi K2 Instruct from Moonshot AI.
1 models have been evaluated on the SWE-bench Verified (Agentless) benchmark, with 0 verified results and 1 self-reported results.
SWE-bench Verified (Agentless) is categorized under general and reasoning. The benchmark evaluates text models.