RealWorldQA

Name: RealWorldQA Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Progress Over Time

Interactive timeline showing model performance evolution on RealWorldQA

State-of-the-art frontier

Open

Proprietary

RealWorldQA Leaderboard

25 models

			Context	Cost
1	Qwen3.7-Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.32 / $1.28
2	Seed 2.1 Pro ByteDance	—	—	—
3	Seed 2.1 Turbo ByteDance	—	—	—
4	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	1.0M	$0.50 / $3.00
5	Qwen3.6-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—
6	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	—	—
7	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	—	—
7	Qwen3.6-27B Alibaba Cloud / Qwen Team	28B	262K	$0.60 / $3.60
9	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
10	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	—	—
11	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	—	—
12	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
13	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
14	Qwen2-VL-72B-Instruct Alibaba Cloud / Qwen Team	73B	—	—
15	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	—	—
16	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	—	—
17	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	262K	$0.18 / $2.09
18	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00
19	Qwen3 VL 8B Instruct Alibaba Cloud / Qwen Team	9B	—	—
20	Qwen3 VL 4B Instruct Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $0.60
21	Qwen2.5-Omni-7B Alibaba Cloud / Qwen Team	7B	—	—
22	Grok-1.5V xAI	—	—	—
23	DeepSeek VL2 DeepSeek	27B	—	—
24	DeepSeek VL2 Small DeepSeek	16B	—	—
25	DeepSeek VL2 Tiny DeepSeek	3B	—	—

Notice missing or incorrect data?

About this benchmark

What is RealWorldQA?

RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consists of over 700 anonymized images taken from vehicles and other real-world scenarios, each accompanied by a question and easily verifiable answer. Released by xAI as part of their Grok-1.5 Vision preview to test models' ability to understand natural scenes and spatial relationships in everyday visual contexts.

RealWorldQA is a multimodal benchmark evaluating models on spatial reasoning and vision tasks. LLM Stats tracks 25 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.

Compare leaders on the best AI for spatial reasoning and best AI for vision leaderboards.

Current leaders

Qwen3.7-Plus from Alibaba Cloud / Qwen Team currently leads the RealWorldQA leaderboard with a score of 0.869 across 25 evaluated AI models.

Qwen3.7-PlusAlibaba Cloud / Qwen Team86.9%

Seed 2.1 ProByteDance86.7%

Seed 2.1 TurboByteDance86.3%

OSS

Qwen3.6-35B-A3B#5 open-weight85.3%

FAQ

Common questions about the RealWorldQA benchmark and leaderboard.

What is the RealWorldQA benchmark?

What is the RealWorldQA leaderboard?

The RealWorldQA leaderboard ranks 25 AI models based on their performance on this benchmark. Currently, Qwen3.7-Plus by Alibaba Cloud / Qwen Team leads with a score of 0.869. The average score across all models is 0.776.

What is the highest RealWorldQA score?

The highest RealWorldQA score is 0.869, achieved by Qwen3.7-Plus from Alibaba Cloud / Qwen Team.

How many models are evaluated on RealWorldQA?

25 models have been evaluated on the RealWorldQA benchmark, with 0 verified results and 25 self-reported results.

What categories does RealWorldQA cover?

RealWorldQA is categorized under spatial reasoning and vision. The benchmark evaluates multimodal models.

What is the best open-source model on RealWorldQA?

Qwen3.6-35B-A3B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on RealWorldQA, with a score of 0.853 (rank #5).

Which model offers the best value on RealWorldQA?

Among models scoring within 10% of the leader, Qwen3.5-27B from Alibaba Cloud / Qwen Team is the cheapest, at $0.30 per million input tokens with a score of 0.837.

How recent are the RealWorldQA leaderboard results?

The RealWorldQA leaderboard was last updated in July 2026 and currently includes 25 evaluated models.