OSWorld

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld

State-of-the-art frontier
Open
Proprietary

OSWorld Leaderboard

17 models • 0 verified
ContextCostLicense
1
0.727200K
$5.00
$25.00
2
0.725200K
$3.00
$15.00
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.667236B262K
$0.30
$1.50
4
0.663200K
$5.00
$25.00
5
0.614200K
$3.00
$15.00
6
0.507200K
$1.00
$5.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.41033B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.381236B262K
$0.45
$3.49
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.3399B262K
$0.08
$0.50
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.3399B262K
$0.18
$2.09
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.32633B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.3144B262K
$0.10
$1.00
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.30631B262K
$0.20
$1.00
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.30331B262K
$0.20
$0.70
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.2624B262K
$0.10
$0.60
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.08872B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.05934B
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about OSWorld

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows
The OSWorld paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.727. The average score across all models is 0.414.
The highest OSWorld score is 0.727, achieved by Claude Opus 4.6 from Anthropic.
17 models have been evaluated on the OSWorld benchmark, with 0 verified results and 17 self-reported results.
OSWorld is categorized under agents, general, multimodal, and vision. The benchmark evaluates multimodal models.

Sub-benchmarks