Benchmarks/vision/OSWorld-Verified

OSWorld-Verified

OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld-Verified

State-of-the-art frontier
Open
Proprietary

OSWorld-Verified Leaderboard

12 models
ContextCostLicense
1$25.00 / $125.00
2
OpenAI
OpenAI
1.1M$5.00 / $30.00
31.0M$5.00 / $25.00
4
OpenAI
OpenAI
1.0M$2.50 / $15.00
5
Moonshot AI
Moonshot AI
1.0T262K$0.95 / $4.00
6400K$0.75 / $4.50
7400K$1.75 / $14.00
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
12400K$0.20 / $1.25
Notice missing or incorrect data?

FAQ

Common questions about OSWorld-Verified

OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.
The OSWorld-Verified paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld-Verified leaderboard ranks 12 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.796. The average score across all models is 0.659.
The highest OSWorld-Verified score is 0.796, achieved by Claude Mythos Preview from Anthropic.
12 models have been evaluated on the OSWorld-Verified benchmark, with 0 verified results and 12 self-reported results.
OSWorld-Verified is categorized under vision, agents, general, and multimodal. The benchmark evaluates multimodal models.