ERQA
Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions
Progress Over Time
Interactive timeline showing model performance evolution on ERQA
State-of-the-art frontier
Open
Proprietary
ERQA Leaderboard
18 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | — | — | ||
| 1 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 4 | Meta | — | — | — | ||
| 5 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 6 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 7 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 8 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 9 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 10 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.50 | ||
| 11 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 13 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 14 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 15 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 16 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 17 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 18 | OpenAI | — | 128K | $2.50 / $10.00 |
Notice missing or incorrect data?
FAQ
Common questions about ERQA
Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions
The ERQA paper is available at https://arxiv.org/abs/2503.20020. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The ERQA dataset is available at https://github.com/embodiedreasoning/ERQA.
The ERQA leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.657. The average score across all models is 0.532.
The highest ERQA score is 0.657, achieved by GPT-5 from OpenAI.
18 models have been evaluated on the ERQA benchmark, with 0 verified results and 18 self-reported results.
ERQA is categorized under reasoning, spatial reasoning, and vision. The benchmark evaluates multimodal models.