Benchmarks/agents/OSWorld-Verified

OSWorld-Verified

OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld-Verified

State-of-the-art frontier
Open
Proprietary

OSWorld-Verified Leaderboard

5 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
0.7501.0M
$2.50
$15.00
2
0.647400K
$1.75
$14.00
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.580122B262K
$0.40
$3.20
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.56227B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.54535B262K
$0.25
$2.00
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about OSWorld-Verified

OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.
The OSWorld-Verified paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld-Verified leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, GPT-5.4 by OpenAI leads with a score of 0.750. The average score across all models is 0.617.
The highest OSWorld-Verified score is 0.750, achieved by GPT-5.4 from OpenAI.
5 models have been evaluated on the OSWorld-Verified benchmark, with 0 verified results and 5 self-reported results.
OSWorld-Verified is categorized under agents, general, multimodal, and vision. The benchmark evaluates multimodal models.