ERQA

Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions

PaperImplementation

Progress Over Time

Interactive timeline showing model performance evolution on ERQA

State-of-the-art frontier
Open
Proprietary

ERQA Leaderboard

18 models
ContextCostLicense
1
OpenAI
OpenAI
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
4
5
OpenAI
OpenAI
200K$2.00 / $8.00
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.50
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
18
OpenAI
OpenAI
128K$2.50 / $10.00
Notice missing or incorrect data?

FAQ

Common questions about ERQA

Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions
The ERQA paper is available at https://arxiv.org/abs/2503.20020. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ERQA dataset is available at https://github.com/embodiedreasoning/ERQA.
The ERQA leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.657. The average score across all models is 0.532.
The highest ERQA score is 0.657, achieved by GPT-5 from OpenAI.
18 models have been evaluated on the ERQA benchmark, with 0 verified results and 18 self-reported results.
ERQA is categorized under reasoning, spatial reasoning, and vision. The benchmark evaluates multimodal models.