RealWorldQA
Progress Over Time
Interactive timeline showing model performance evolution on RealWorldQA
RealWorldQA Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 2 | ByteDance | — | — | — | ||
| 3 | ByteDance | — | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 5 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 9 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 10 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 14 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 16 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 18 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 19 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 20 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 21 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 22 | xAI | — | — | — | ||
| 23 | DeepSeek | 27B | — | — | ||
| 24 | DeepSeek | 16B | — | — | ||
| 25 | DeepSeek | 3B | — | — |
What is RealWorldQA?
RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consists of over 700 anonymized images taken from vehicles and other real-world scenarios, each accompanied by a question and easily verifiable answer. Released by xAI as part of their Grok-1.5 Vision preview to test models' ability to understand natural scenes and spatial relationships in everyday visual contexts.
RealWorldQA is a multimodal benchmark evaluating models on spatial reasoning and vision tasks. LLM Stats tracks 25 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.
Compare leaders on the best AI for spatial reasoning and best AI for vision leaderboards.
Current leaders
Qwen3.7-Plus from Alibaba Cloud / Qwen Team currently leads the RealWorldQA leaderboard with a score of 0.869 across 25 evaluated AI models.
FAQ
Common questions about the RealWorldQA benchmark and leaderboard.