RealWorldQA
RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consists of over 700 anonymized images taken from vehicles and other real-world scenarios, each accompanied by a question and easily verifiable answer. Released by xAI as part of their Grok-1.5 Vision preview to test models' ability to understand natural scenes and spatial relationships in everyday visual contexts.
Progress Over Time
Interactive timeline showing model performance evolution on RealWorldQA
State-of-the-art frontier
Open
Proprietary
RealWorldQA Leaderboard
20 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Qwen3.6 PlusNew Alibaba Cloud / Qwen Team | — | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 3 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 4 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 6 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 7 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 11 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 12 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 13 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 14 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 15 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 16 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 17 | xAI | — | — | — | ||
| 18 | DeepSeek | 27B | 129K | — | ||
| 19 | DeepSeek | 16B | — | — | ||
| 20 | DeepSeek | 3B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about RealWorldQA
RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consists of over 700 anonymized images taken from vehicles and other real-world scenarios, each accompanied by a question and easily verifiable answer. Released by xAI as part of their Grok-1.5 Vision preview to test models' ability to understand natural scenes and spatial relationships in everyday visual contexts.
The RealWorldQA leaderboard ranks 20 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.854. The average score across all models is 0.756.
The highest RealWorldQA score is 0.854, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.
20 models have been evaluated on the RealWorldQA benchmark, with 0 verified results and 20 self-reported results.
RealWorldQA is categorized under spatial reasoning and vision. The benchmark evaluates multimodal models.