Benchmarks/grounding/ScreenSpot Pro

ScreenSpot Pro

ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on ScreenSpot Pro

State-of-the-art frontier
Open
Proprietary

ScreenSpot Pro Leaderboard

20 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
2
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
5
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K
$0.30
$1.49
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K
$0.45
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
Notice missing or incorrect data?

FAQ

Common questions about ScreenSpot Pro

ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.
The ScreenSpot Pro paper is available at https://arxiv.org/abs/2504.07981. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ScreenSpot Pro leaderboard ranks 20 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 0.863. The average score across all models is 0.592.
The highest ScreenSpot Pro score is 0.863, achieved by GPT-5.2 from OpenAI.
20 models have been evaluated on the ScreenSpot Pro benchmark, with 0 verified results and 20 self-reported results.
ScreenSpot Pro is categorized under grounding, multimodal, spatial reasoning, and vision. The benchmark evaluates multimodal models.