ERQA Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on ERQA

State-of-the-art frontier

Open

Proprietary

ERQA Leaderboard

18 models

			Context	Cost
1	GPT-5 OpenAI	—	—	—
1	Qwen3.6 Plus Alibaba Cloud / Qwen Team	—	—	—
3	Qwen3.5-35B-A3B Alibaba Cloud / Qwen Team	35B	262K	$0.25 / $2.00
4	Muse Spark Meta	—	—	—
5	o3 OpenAI	—	200K	$2.00 / $8.00
6	Qwen3.5-122B-A10B Alibaba Cloud / Qwen Team	122B	262K	$0.40 / $3.20
7	Qwen3.5-27B Alibaba Cloud / Qwen Team	27B	262K	$0.30 / $2.40
8	Qwen3 VL 235B A22B Thinking Alibaba Cloud / Qwen Team	236B	262K	$0.45 / $3.49
9	Qwen3 VL 32B Thinking Alibaba Cloud / Qwen Team	33B	—	—
10	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	262K	$0.30 / $1.50
11	Qwen3 VL 32B Instruct Alibaba Cloud / Qwen Team	33B	—	—
12	Qwen3 VL 4B Thinking Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $1.00
13	Qwen3 VL 8B Thinking Alibaba Cloud / Qwen Team	9B	262K	$0.18 / $2.09
14	Qwen3 VL 8B Instruct Alibaba Cloud / Qwen Team	9B	262K	$0.08 / $0.50
15	Qwen3 VL 30B A3B Thinking Alibaba Cloud / Qwen Team	31B	262K	$0.20 / $1.00
16	Qwen3 VL 30B A3B Instruct Alibaba Cloud / Qwen Team	31B	262K	$0.20 / $0.70
17	Qwen3 VL 4B Instruct Alibaba Cloud / Qwen Team	4B	262K	$0.10 / $0.60
18	GPT-4o OpenAI	—	128K	$2.50 / $10.00

FAQ

Common questions about ERQA

Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions

The ERQA paper is available at https://arxiv.org/abs/2503.20020. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The ERQA dataset is available at https://github.com/embodiedreasoning/ERQA.

The ERQA leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.657. The average score across all models is 0.532.

The highest ERQA score is 0.657, achieved by GPT-5 from OpenAI.

18 models have been evaluated on the ERQA benchmark, with 0 verified results and 18 self-reported results.

ERQA is categorized under reasoning, spatial reasoning, and vision. The benchmark evaluates multimodal models.

ERQA

Progress Over Time

ERQA Leaderboard

FAQ

What is the ERQA benchmark?

Where can I find the ERQA paper?

Where can I find the ERQA dataset?

What is the ERQA leaderboard?

What is the highest ERQA score?

How many models are evaluated on ERQA?

What categories does ERQA cover?