OSWorld

Name: OSWorld Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld

State-of-the-art frontier

Open

Proprietary

OSWorld Leaderboard

20 models

			Context	Cost
1	Seed 2.1 Pro ByteDance	—	—	—
2	Seed 2.1 Turbo ByteDance	—	—	—
3	Claude Opus 4.6 Anthropic	—	1.0M	$5.00 / $25.00
4	Claude Sonnet 4.6 Anthropic	—	200K	$3.00 / $15.00
5	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	—	—
6	Claude Opus 4.5 Anthropic	—	—	—
7	GLM-5V-Turbo Zhipu AI	—	—	—
8	Claude Sonnet 4.5 Anthropic	—	200K	$3.00 / $15.00
9	Claude Haiku 4.5 Anthropic	—	200K	$1.00 / $5.00
10	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
11	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	—	—
12	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	—	—
12	Qwen3 VL 8B Instruct Alibaba Cloud / Qwen Team	9B	—	—
14	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
15	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00
16	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	—	—
17	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	—	—
18	Qwen3 VL 4B Instruct Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $0.60
19	Qwen2.5 VL 72B Instruct Alibaba Cloud / Qwen Team	72B	—	—
20	Qwen2.5 VL 32B Instruct Alibaba Cloud / Qwen Team	34B	—	—

Notice missing or incorrect data?

Sub-benchmarks

OSWorld 2.0

OSWorld 2.0 is a benchmark of 108 long-horizon, real-world computer-use workflows spanning everyday and professional tasks. Each task is an end-to-end workflow that takes human users a median of about 1.6 hours, scored with a binary-completion metric, and targets challenges such as dynamic environments, cross-source reasoning, and implicit-state inference.

multimodal•Max 1

OSWorld-G

OSWorld-G (Grounding) evaluates screenshot grounding accuracy for OS automation tasks.

image•Max 100

OSWorld-Verified

OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.

multimodal•Max 1

About this benchmark

What is OSWorld?

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows

OSWorld is a multimodal benchmark evaluating models on multimodal, general, agents, and vision tasks. LLM Stats tracks 20 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.8.

Compare leaders on the best AI for multimodal, best AI for general, best AI for agents and best AI for vision leaderboards.

Current leaders

Seed 2.1 Pro from ByteDance currently leads the OSWorld leaderboard with a score of 0.788 across 20 evaluated AI models.

Seed 2.1 ProByteDance78.8%

Seed 2.1 TurboByteDance76.4%

Claude Opus 4.6Anthropic72.7%

OSS

Qwen3 VL 235B A22B Instruct#5 open-weight66.7%

Source paper

Title: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Authors: Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, and 13 others
Published: April 11, 2024
arXiv: 2404.07972

Abstract

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

FAQ

Common questions about the OSWorld benchmark and leaderboard.

What is the OSWorld benchmark?

What is the OSWorld leaderboard?

The OSWorld leaderboard ranks 20 AI models based on their performance on this benchmark. Currently, Seed 2.1 Pro by ByteDance leads with a score of 0.788. The average score across all models is 0.460.

What is the highest OSWorld score?

The highest OSWorld score is 0.788, achieved by Seed 2.1 Pro from ByteDance.

How many models are evaluated on OSWorld?

20 models have been evaluated on the OSWorld benchmark, with 0 verified results and 20 self-reported results.

Where can I find the OSWorld paper?

The OSWorld paper is available at https://arxiv.org/abs/2404.07972. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does OSWorld cover?

OSWorld is categorized under multimodal, general, agents, and vision. The benchmark evaluates multimodal models.

Are there variants of OSWorld?

Yes. OSWorld has 3 related variants: OSWorld 2.0, OSWorld-G, OSWorld-Verified.

What is the best open-source model on OSWorld?

Qwen3 VL 235B A22B Instruct by Alibaba Cloud / Qwen Team is the top-ranked open-source model on OSWorld, with a score of 0.667 (rank #5).

Which model offers the best value on OSWorld?

Among models scoring within 10% of the leader, Claude Sonnet 4.6 from Anthropic is the cheapest, at $3.00 per million input tokens with a score of 0.725.

How recent are the OSWorld leaderboard results?

The OSWorld leaderboard was last updated in July 2026 and currently includes 20 evaluated models.