Benchmarks/code/SWE-Lancer

SWE-Lancer

A benchmark for evaluating large language models on real-world freelance software engineering tasks from Upwork. Contains over 1,400 tasks valued at $1 million USD total, ranging from $50 bug fixes to $32,000 feature implementations. Includes both independent engineering tasks graded via end-to-end tests and managerial tasks assessed against original engineering managers' choices.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-Lancer

State-of-the-art frontier
Open
Proprietary

SWE-Lancer Leaderboard

4 models
ContextCostLicense
1400K$1.25 / $10.00
2
OpenAI
OpenAI
128K$75.00 / $150.00
3
OpenAI
OpenAI
128K$2.50 / $10.00
4
OpenAI
OpenAI
200K$1.10 / $4.40
Notice missing or incorrect data?

FAQ

Common questions about SWE-Lancer

A benchmark for evaluating large language models on real-world freelance software engineering tasks from Upwork. Contains over 1,400 tasks valued at $1 million USD total, ranging from $50 bug fixes to $32,000 feature implementations. Includes both independent engineering tasks graded via end-to-end tests and managerial tasks assessed against original engineering managers' choices.
The SWE-Lancer paper is available at https://arxiv.org/abs/2502.12115. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Lancer leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, GPT-5.1 Codex by OpenAI leads with a score of 0.663. The average score across all models is 0.386.
The highest SWE-Lancer score is 0.663, achieved by GPT-5.1 Codex from OpenAI.
4 models have been evaluated on the SWE-Lancer benchmark, with 0 verified results and 4 self-reported results.
SWE-Lancer is categorized under code and reasoning. The benchmark evaluates text models.