OSWorld

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld

State-of-the-art frontier
Open
Proprietary

OSWorld Leaderboard

18 models
ContextCostLicense
11.0M$5.00 / $25.00
2200K$3.00 / $15.00
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
4200K$5.00 / $25.00
5
Zhipu AI
Zhipu AI
6200K$3.00 / $15.00
7200K$1.00 / $5.00
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
Notice missing or incorrect data?

FAQ

Common questions about OSWorld

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows
The OSWorld paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, Claude Opus 4.6 by Anthropic leads with a score of 0.727. The average score across all models is 0.425.
The highest OSWorld score is 0.727, achieved by Claude Opus 4.6 from Anthropic.
18 models have been evaluated on the OSWorld benchmark, with 0 verified results and 18 self-reported results.
OSWorld is categorized under agents, general, multimodal, and vision. The benchmark evaluates multimodal models.

Sub-benchmarks