ScreenSpot

ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (text and icon/widget), designed to evaluate visual GUI agents' ability to accurately locate screen elements based on natural language instructions.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on ScreenSpot

State-of-the-art frontier
Open
Proprietary

ScreenSpot Leaderboard

13 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.95833B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.95733B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.954236B262K
$0.30
$1.50
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.954236B262K
$0.45
$3.49
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.94731B262K
$0.20
$0.70
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.94731B262K
$0.20
$1.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.9449B262K
$0.08
$0.50
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.9404B262K
$0.10
$0.60
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.9369B262K
$0.18
$2.09
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.9294B262K
$0.10
$1.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.88534B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.87172B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
0.8478B
Notice missing or incorrect data?Start an Issue discussion

FAQ

Common questions about ScreenSpot

ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (text and icon/widget), designed to evaluate visual GUI agents' ability to accurately locate screen elements based on natural language instructions.
The ScreenSpot paper is available at https://arxiv.org/abs/2401.10935. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ScreenSpot leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Qwen3 VL 32B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.958. The average score across all models is 0.928.
The highest ScreenSpot score is 0.958, achieved by Qwen3 VL 32B Instruct from Alibaba Cloud / Qwen Team.
13 models have been evaluated on the ScreenSpot benchmark, with 0 verified results and 13 self-reported results.
ScreenSpot is categorized under grounding, multimodal, spatial reasoning, and vision. The benchmark evaluates multimodal models.