Benchmarks/code/SWE-Lancer (IC-Diamond subset)

SWE-Lancer (IC-Diamond subset)

SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 feature implementations. It evaluates AI models on independent engineering tasks using end-to-end tests triple-verified by experienced software engineers, and includes managerial tasks where models choose between technical implementation proposals.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SWE-Lancer (IC-Diamond subset)

State-of-the-art frontier
Open
Proprietary

SWE-Lancer (IC-Diamond subset) Leaderboard

6 models
ContextCostLicense
1
OpenAI
OpenAI
400K$1.25 / $10.00
2400K$1.75 / $14.00
3
OpenAI
OpenAI
400K$1.75 / $14.00
4
OpenAI
OpenAI
128K$75.00 / $150.00
5
OpenAI
OpenAI
128K$2.50 / $10.00
6
OpenAI
OpenAI
200K$1.10 / $4.40
Notice missing or incorrect data?

FAQ

Common questions about SWE-Lancer (IC-Diamond subset)

SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 feature implementations. It evaluates AI models on independent engineering tasks using end-to-end tests triple-verified by experienced software engineers, and includes managerial tasks where models choose between technical implementation proposals.
The SWE-Lancer (IC-Diamond subset) paper is available at https://arxiv.org/abs/2502.12115. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Lancer (IC-Diamond subset) leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 1.000. The average score across all models is 0.489.
The highest SWE-Lancer (IC-Diamond subset) score is 1.000, achieved by GPT-5 from OpenAI.
6 models have been evaluated on the SWE-Lancer (IC-Diamond subset) benchmark, with 0 verified results and 6 self-reported results.
SWE-Lancer (IC-Diamond subset) is categorized under code and reasoning. The benchmark evaluates text models.