MMMU-Pro

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMMU-Pro

State-of-the-art frontier
Open
Proprietary

MMMU-Pro Leaderboard

42 models
ContextCostLicense
1
OpenAI
OpenAI
1.0M$2.50 / $15.00
11.0M$0.50 / $3.00
3
41.0M$2.50 / $15.00
5
OpenAI
OpenAI
400K$1.75 / $14.00
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
8
OpenAI
OpenAI
400K$1.25 / $10.00
91.0M$5.00 / $25.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
10
Google
Google
31B
121.0M$0.25 / $1.50
13400K$0.75 / $4.50
14
OpenAI
OpenAI
200K$2.00 / $8.00
15200K$3.00 / $15.00
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
1825B
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
22400K$0.20 / $1.25
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
27
Mistral AI
Mistral AI
119B256K$0.15 / $0.60
28
OpenAI
OpenAI
128K$2.50 / $10.00
29400B1.0M$0.17 / $0.60
30
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
32
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
33
Google
Google
8B
34
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
3790B128K$0.35 / $0.40
38
Google
Google
5B
396B128K$0.05 / $0.10
40
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
41
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
4211B128K$0.05 / $0.05
Notice missing or incorrect data?

FAQ

Common questions about MMMU-Pro

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.
The MMMU-Pro paper is available at https://arxiv.org/abs/2409.02813. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMMU-Pro leaderboard ranks 42 AI models based on their performance on this benchmark. Currently, GPT-5.4 by OpenAI leads with a score of 0.812. The average score across all models is 0.643.
The highest MMMU-Pro score is 0.812, achieved by GPT-5.4 from OpenAI.
42 models have been evaluated on the MMMU-Pro benchmark, with 0 verified results and 42 self-reported results.
MMMU-Pro is categorized under vision, general, multimodal, and reasoning. The benchmark evaluates multimodal models.