SWE-Lancer
A benchmark for evaluating large language models on real-world freelance software engineering tasks from Upwork. Contains over 1,400 tasks valued at $1 million USD total, ranging from $50 bug fixes to $32,000 feature implementations. Includes both independent engineering tasks graded via end-to-end tests and managerial tasks assessed against original engineering managers' choices.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-Lancer
State-of-the-art frontier
Open
Proprietary
SWE-Lancer Leaderboard
4 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 2 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 3 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 4 | OpenAI | — | 200K | $1.10 / $4.40 |
Notice missing or incorrect data?
FAQ
Common questions about SWE-Lancer
A benchmark for evaluating large language models on real-world freelance software engineering tasks from Upwork. Contains over 1,400 tasks valued at $1 million USD total, ranging from $50 bug fixes to $32,000 feature implementations. Includes both independent engineering tasks graded via end-to-end tests and managerial tasks assessed against original engineering managers' choices.
The SWE-Lancer paper is available at https://arxiv.org/abs/2502.12115. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Lancer leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, GPT-5.1 Codex by OpenAI leads with a score of 0.663. The average score across all models is 0.386.
The highest SWE-Lancer score is 0.663, achieved by GPT-5.1 Codex from OpenAI.
4 models have been evaluated on the SWE-Lancer benchmark, with 0 verified results and 4 self-reported results.
SWE-Lancer is categorized under code and reasoning. The benchmark evaluates text models.