OSWorld-Verified Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld-Verified

State-of-the-art frontier

Open

Proprietary

OSWorld-Verified Leaderboard

12 models

			Context	Cost
1	Claude Mythos Preview Anthropic	—	—	$25.00 / $125.00
2	GPT-5.5New OpenAI	—	1.1M	$5.00 / $30.00
3	Claude Opus 4.7 Anthropic	—	1.0M	$5.00 / $25.00
4	GPT-5.4 OpenAI	—	1.0M	$2.50 / $15.00
5	Kimi K2.6 Moonshot AI	1.0T	262K	$0.95 / $4.00
6	GPT-5.4 mini OpenAI	—	400K	$0.75 / $4.50
7	GPT-5.3 Codex OpenAI	—	400K	$1.75 / $14.00
8	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.50 / $3.00
9	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	262K	$0.40 / $3.20
10	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
11	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	262K	$0.25 / $2.00
12	GPT-5.4 nano OpenAI	—	400K	$0.20 / $1.25

FAQ

Common questions about OSWorld-Verified

OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.

The OSWorld-Verified paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The OSWorld-Verified leaderboard ranks 12 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.796. The average score across all models is 0.659.

The highest OSWorld-Verified score is 0.796, achieved by Claude Mythos Preview from Anthropic.

12 models have been evaluated on the OSWorld-Verified benchmark, with 0 verified results and 12 self-reported results.

OSWorld-Verified is categorized under agents, general, multimodal, and vision. The benchmark evaluates multimodal models.

OSWorld-Verified

Progress Over Time

OSWorld-Verified Leaderboard

FAQ

What is the OSWorld-Verified benchmark?

Where can I find the OSWorld-Verified paper?

What is the OSWorld-Verified leaderboard?

What is the highest OSWorld-Verified score?

How many models are evaluated on OSWorld-Verified?

What categories does OSWorld-Verified cover?