OSWorld
OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows
Progress Over Time
Interactive timeline showing model performance evolution on OSWorld
State-of-the-art frontier
Open
Proprietary
OSWorld Leaderboard
17 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Anthropic | 0.727 | — | 200K | $5.00 $25.00 | |
2 | Anthropic | 0.725 | — | 200K | $3.00 $15.00 | |
3 | Alibaba Cloud / Qwen Team | 0.667 | 236B | 262K | $0.30 $1.50 | |
4 | Anthropic | 0.663 | — | 200K | $5.00 $25.00 | |
5 | Anthropic | 0.614 | — | 200K | $3.00 $15.00 | |
6 | Anthropic | 0.507 | — | 200K | $1.00 $5.00 | |
7 | Alibaba Cloud / Qwen Team | 0.410 | 33B | — | — | |
8 | Alibaba Cloud / Qwen Team | 0.381 | 236B | 262K | $0.45 $3.49 | |
9 | Alibaba Cloud / Qwen Team | 0.339 | 9B | 262K | $0.08 $0.50 | |
9 | Alibaba Cloud / Qwen Team | 0.339 | 9B | 262K | $0.18 $2.09 | |
11 | Alibaba Cloud / Qwen Team | 0.326 | 33B | — | — | |
12 | Alibaba Cloud / Qwen Team | 0.314 | 4B | 262K | $0.10 $1.00 | |
13 | Alibaba Cloud / Qwen Team | 0.306 | 31B | 262K | $0.20 $1.00 | |
14 | Alibaba Cloud / Qwen Team | 0.303 | 31B | 262K | $0.20 $0.70 | |
15 | Alibaba Cloud / Qwen Team | 0.262 | 4B | 262K | $0.10 $0.60 | |
16 | Alibaba Cloud / Qwen Team | 0.088 | 72B | — | — | |
17 | Alibaba Cloud / Qwen Team | 0.059 | 34B | — | — |
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about OSWorld
OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows
The OSWorld paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld leaderboard ranks 17 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.727. The average score across all models is 0.414.
The highest OSWorld score is 0.727, achieved by Claude Opus 4.6 from Anthropic.
17 models have been evaluated on the OSWorld benchmark, with 0 verified results and 17 self-reported results.
OSWorld is categorized under agents, general, multimodal, and vision. The benchmark evaluates multimodal models.
Sub-benchmarks
OSWorld-G
OSWorld-G (Grounding) evaluates screenshot grounding accuracy for OS automation tasks.
image•Max 100
OSWorld-Verified
OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.
multimodal•Max 1