ScreenSpot
ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (text and icon/widget), designed to evaluate visual GUI agents' ability to accurately locate screen elements based on natural language instructions.
Progress Over Time
Interactive timeline showing model performance evolution on ScreenSpot
State-of-the-art frontier
Open
Proprietary
ScreenSpot Leaderboard
13 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Alibaba Cloud / Qwen Team | 0.958 | 33B | — | — | |
2 | Alibaba Cloud / Qwen Team | 0.957 | 33B | — | — | |
3 | Alibaba Cloud / Qwen Team | 0.954 | 236B | 262K | $0.30 $1.50 | |
3 | Alibaba Cloud / Qwen Team | 0.954 | 236B | 262K | $0.45 $3.49 | |
5 | Alibaba Cloud / Qwen Team | 0.947 | 31B | 262K | $0.20 $0.70 | |
5 | Alibaba Cloud / Qwen Team | 0.947 | 31B | 262K | $0.20 $1.00 | |
7 | Alibaba Cloud / Qwen Team | 0.944 | 9B | 262K | $0.08 $0.50 | |
8 | Alibaba Cloud / Qwen Team | 0.940 | 4B | 262K | $0.10 $0.60 | |
9 | Alibaba Cloud / Qwen Team | 0.936 | 9B | 262K | $0.18 $2.09 | |
10 | Alibaba Cloud / Qwen Team | 0.929 | 4B | 262K | $0.10 $1.00 | |
11 | Alibaba Cloud / Qwen Team | 0.885 | 34B | — | — | |
12 | Alibaba Cloud / Qwen Team | 0.871 | 72B | — | — | |
13 | Alibaba Cloud / Qwen Team | 0.847 | 8B | — | — |
Notice missing or incorrect data?Start an Issue discussion→
FAQ
Common questions about ScreenSpot
ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (text and icon/widget), designed to evaluate visual GUI agents' ability to accurately locate screen elements based on natural language instructions.
The ScreenSpot paper is available at https://arxiv.org/abs/2401.10935. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ScreenSpot leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Qwen3 VL 32B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.958. The average score across all models is 0.928.
The highest ScreenSpot score is 0.958, achieved by Qwen3 VL 32B Instruct from Alibaba Cloud / Qwen Team.
13 models have been evaluated on the ScreenSpot benchmark, with 0 verified results and 13 self-reported results.
ScreenSpot is categorized under grounding, multimodal, spatial reasoning, and vision. The benchmark evaluates multimodal models.