OSWorld Extended Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld Extended

State-of-the-art frontier

Open

Proprietary

OSWorld Extended Leaderboard

1 models

				Context	Cost	License
1	Claude 3.5 Sonnet Anthropic		—	200K	$3.00 / $15.00

FAQ

Common questions about OSWorld Extended

OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.

The OSWorld Extended paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The OSWorld Extended leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.220. The average score across all models is 0.220.

The highest OSWorld Extended score is 0.220, achieved by Claude 3.5 Sonnet from Anthropic.

1 models have been evaluated on the OSWorld Extended benchmark, with 0 verified results and 1 self-reported results.

OSWorld Extended is categorized under agents, general, multimodal, and reasoning. The benchmark evaluates multimodal models.

OSWorld Extended

Progress Over Time

OSWorld Extended Leaderboard

FAQ

What is the OSWorld Extended benchmark?

Where can I find the OSWorld Extended paper?

What is the OSWorld Extended leaderboard?

What is the highest OSWorld Extended score?

How many models are evaluated on OSWorld Extended?

What categories does OSWorld Extended cover?