ScreenSpot Pro
Progress Over Time
Interactive timeline showing model performance evolution on ScreenSpot Pro
ScreenSpot Pro Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 2 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 3 | Meta | — | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 5 | Google | — | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 8 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 9 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 11 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 14 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 15 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 16 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 19 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 20 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 21 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 22 | Alibaba Cloud / Qwen Team | 34B | — | — | ||
| 23 | Alibaba Cloud / Qwen Team | 8B | — | — |
What is ScreenSpot Pro?
ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.
ScreenSpot Pro is a multimodal benchmark evaluating models on multimodal, spatial reasoning, grounding, and vision tasks. LLM Stats tracks 23 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.9.
Compare leaders on the best AI for multimodal, best AI for spatial reasoning, best AI for grounding and best AI for vision leaderboards.
Current leaders
Claude Opus 4.8 from Anthropic currently leads the ScreenSpot Pro leaderboard with a score of 0.879 across 23 evaluated AI models.
Source paper
- Title
- ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
- Authors
- Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, and 4 others
- Published
- arXiv
- 2504.07981
Abstract
Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.
FAQ
Common questions about the ScreenSpot Pro benchmark and leaderboard.