ScreenSpot Pro

Progress Over Time

Interactive timeline showing model performance evolution on ScreenSpot Pro

State-of-the-art frontier
Open
Proprietary

ScreenSpot Pro Leaderboard

23 models
ContextCostLicense
11.0M$5.00 / $25.00
2
OpenAI
OpenAI
400K$1.75 / $14.00
3
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.32 / $1.28
5
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
81.0M$0.50 / $3.00
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
22
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?
About this benchmark

What is ScreenSpot Pro?

ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.

ScreenSpot Pro is a multimodal benchmark evaluating models on multimodal, spatial reasoning, grounding, and vision tasks. LLM Stats tracks 23 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.9.

Compare leaders on the best AI for multimodal, best AI for spatial reasoning, best AI for grounding and best AI for vision leaderboards.

Current leaders

Claude Opus 4.8 from Anthropic currently leads the ScreenSpot Pro leaderboard with a score of 0.879 across 23 evaluated AI models.

1Claude Opus 4.8Anthropic87.9%
2GPT-5.2OpenAI86.3%
3Muse SparkMeta84.1%
OSSQwen3.5-122B-A10B#6 open-weight70.4%

Source paper

Title
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Authors
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, and 4 others
Published
Abstract

Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.

FAQ

Common questions about the ScreenSpot Pro benchmark and leaderboard.

What is the ScreenSpot Pro benchmark?

ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.

What is the ScreenSpot Pro leaderboard?

The ScreenSpot Pro leaderboard ranks 23 AI models based on their performance on this benchmark. Currently, Claude Opus 4.8 by Anthropic leads with a score of 0.879. The average score across all models is 0.624.

What is the highest ScreenSpot Pro score?

The highest ScreenSpot Pro score is 0.879, achieved by Claude Opus 4.8 from Anthropic.

How many models are evaluated on ScreenSpot Pro?

23 models have been evaluated on the ScreenSpot Pro benchmark, with 0 verified results and 23 self-reported results.

Where can I find the ScreenSpot Pro paper?

The ScreenSpot Pro paper is available at https://arxiv.org/abs/2504.07981. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does ScreenSpot Pro cover?

ScreenSpot Pro is categorized under multimodal, spatial reasoning, grounding, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on ScreenSpot Pro?

Qwen3.5-122B-A10B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on ScreenSpot Pro, with a score of 0.704 (rank #6).

Which model offers the best value on ScreenSpot Pro?

Among models scoring within 10% of the leader, GPT-5.2 from OpenAI is the cheapest, at $1.75 per million input tokens with a score of 0.863.

How is ScreenSpot Pro scored?

ScreenSpot Pro is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the ScreenSpot Pro leaderboard results?

The ScreenSpot Pro leaderboard was last updated in July 2026 and currently includes 23 evaluated models.