SWE-Lancer (IC-Diamond subset)
SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 feature implementations. It evaluates AI models on independent engineering tasks using end-to-end tests triple-verified by experienced software engineers, and includes managerial tasks where models choose between technical implementation proposals.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-Lancer (IC-Diamond subset)
State-of-the-art frontier
Open
Proprietary
SWE-Lancer (IC-Diamond subset) Leaderboard
6 models
Notice missing or incorrect data?
FAQ
Common questions about SWE-Lancer (IC-Diamond subset)
SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 feature implementations. It evaluates AI models on independent engineering tasks using end-to-end tests triple-verified by experienced software engineers, and includes managerial tasks where models choose between technical implementation proposals.
The SWE-Lancer (IC-Diamond subset) paper is available at https://arxiv.org/abs/2502.12115. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The SWE-Lancer (IC-Diamond subset) leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 1.000. The average score across all models is 0.489.
The highest SWE-Lancer (IC-Diamond subset) score is 1.000, achieved by GPT-5 from OpenAI.
6 models have been evaluated on the SWE-Lancer (IC-Diamond subset) benchmark, with 0 verified results and 6 self-reported results.
SWE-Lancer (IC-Diamond subset) is categorized under code and reasoning. The benchmark evaluates text models.