Benchmarks/agents/OSWorld Extended

OSWorld Extended

OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OSWorld Extended

State-of-the-art frontier
Open
Proprietary

OSWorld Extended Leaderboard

1 models
ContextCostLicense
1200K$3.00 / $15.00
Notice missing or incorrect data?

FAQ

Common questions about OSWorld Extended

OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.
The OSWorld Extended paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld Extended leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.220. The average score across all models is 0.220.
The highest OSWorld Extended score is 0.220, achieved by Claude 3.5 Sonnet from Anthropic.
1 models have been evaluated on the OSWorld Extended benchmark, with 0 verified results and 1 self-reported results.
OSWorld Extended is categorized under agents, general, multimodal, and reasoning. The benchmark evaluates multimodal models.