OSWorld Extended
OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.
Progress Over Time
Interactive timeline showing model performance evolution on OSWorld Extended
State-of-the-art frontier
Open
Proprietary
OSWorld Extended Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 200K | $3.00 / $15.00 |
Notice missing or incorrect data?
FAQ
Common questions about OSWorld Extended
OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.
The OSWorld Extended paper is available at https://arxiv.org/abs/2404.07972. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OSWorld Extended leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Claude 3.5 Sonnet by Anthropic leads with a score of 0.220. The average score across all models is 0.220.
The highest OSWorld Extended score is 0.220, achieved by Claude 3.5 Sonnet from Anthropic.
1 models have been evaluated on the OSWorld Extended benchmark, with 0 verified results and 1 self-reported results.
OSWorld Extended is categorized under agents, general, multimodal, and reasoning. The benchmark evaluates multimodal models.