RealWorldQA

RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consists of over 700 anonymized images taken from vehicles and other real-world scenarios, each accompanied by a question and easily verifiable answer. Released by xAI as part of their Grok-1.5 Vision preview to test models' ability to understand natural scenes and spatial relationships in everyday visual contexts.

Progress Over Time

Interactive timeline showing model performance evolution on RealWorldQA

State-of-the-art frontier
Open
Proprietary

RealWorldQA Leaderboard

20 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
17
18
DeepSeek
DeepSeek
27B129K
1916B
203B
Notice missing or incorrect data?

FAQ

Common questions about RealWorldQA

RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consists of over 700 anonymized images taken from vehicles and other real-world scenarios, each accompanied by a question and easily verifiable answer. Released by xAI as part of their Grok-1.5 Vision preview to test models' ability to understand natural scenes and spatial relationships in everyday visual contexts.
The RealWorldQA leaderboard ranks 20 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.854. The average score across all models is 0.756.
The highest RealWorldQA score is 0.854, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.
20 models have been evaluated on the RealWorldQA benchmark, with 0 verified results and 20 self-reported results.
RealWorldQA is categorized under spatial reasoning and vision. The benchmark evaluates multimodal models.