OSWorld-Verified
OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.
Progress Over Time
Interactive timeline showing model performance evolution on OSWorld-Verified
State-of-the-art frontier
Open
Proprietary
OSWorld-Verified Leaderboard
10 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | $25.00 / $125.00 | ||
| 2 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 3 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 4 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 5 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 6 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 8 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 9 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 10 | OpenAI | — | 400K | $0.20 / $1.25 |
Notice missing or incorrect data?
FAQ
Common questions about OSWorld-Verified
OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.
The OSWorld-Verified paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld-Verified leaderboard ranks 10 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.796. The average score across all models is 0.640.
The highest OSWorld-Verified score is 0.796, achieved by Claude Mythos Preview from Anthropic.
10 models have been evaluated on the OSWorld-Verified benchmark, with 0 verified results and 10 self-reported results.
OSWorld-Verified is categorized under agents, general, multimodal, and vision. The benchmark evaluates multimodal models.