MuirBench
A comprehensive benchmark for robust multi-image understanding capabilities of multimodal LLMs. Consists of 12 diverse multi-image tasks involving 10 categories of multi-image relations (e.g., multiview, temporal relations, narrative, complementary). Comprises 11,264 images and 2,600 multiple-choice questions created in a pairwise manner, where each standard instance is paired with an unanswerable variant for reliable assessment.
Progress Over Time
Interactive timeline showing model performance evolution on MuirBench
State-of-the-art frontier
Open
Proprietary
MuirBench Leaderboard
11 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 3 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 4 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 5 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 6 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 8 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 9 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 10 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 11 | Alibaba Cloud / Qwen Team | 7B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MuirBench
A comprehensive benchmark for robust multi-image understanding capabilities of multimodal LLMs. Consists of 12 diverse multi-image tasks involving 10 categories of multi-image relations (e.g., multiview, temporal relations, narrative, complementary). Comprises 11,264 images and 2,600 multiple-choice questions created in a pairwise manner, where each standard instance is paired with an unanswerable variant for reliable assessment.
The MuirBench paper is available at https://arxiv.org/abs/2406.09411. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MuirBench leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, Qwen3 VL 32B Thinking by Alibaba Cloud / Qwen Team leads with a score of 0.803. The average score across all models is 0.714.
The highest MuirBench score is 0.803, achieved by Qwen3 VL 32B Thinking from Alibaba Cloud / Qwen Team.
11 models have been evaluated on the MuirBench benchmark, with 0 verified results and 11 self-reported results.
MuirBench is categorized under vision, multimodal, and reasoning. The benchmark evaluates multimodal models.