MMMU-Pro
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.
Progress Over Time
Interactive timeline showing model performance evolution on MMMU-Pro
State-of-the-art frontier
Open
Proprietary
MMMU-Pro Leaderboard
42 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 1 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 3 | Google | — | — | — | ||
| 4 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 5 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 6 | Qwen3.6 PlusNew Alibaba Cloud / Qwen Team | — | — | — | ||
| 7 | Moonshot AI | 1.0T | 262K | $0.60 / $2.50 | ||
| 8 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 9 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 10 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 10 | Gemma 4 31BNew Google | 31B | — | — | ||
| 12 | Google | — | 1.0M | $0.25 / $1.50 | ||
| 13 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 14 | OpenAI | — | 200K | $2.00 / $8.00 | ||
| 15 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 16 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 17 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
| 18 | Google | 25B | — | — | ||
| 19 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 20 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 20 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 22 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 23 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 24 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 25 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 25 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 27 | Mistral AI | 119B | 256K | $0.15 / $0.60 | ||
| 28 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 29 | Meta | 400B | 1.0M | $0.17 / $0.60 | ||
| 30 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 31 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 32 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 33 | Gemma 4 E4BNew Google | 8B | — | — | ||
| 34 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 35 | Alibaba Cloud / Qwen Team | 34B | — | — | ||
| 36 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 37 | 90B | 128K | $0.35 / $0.40 | |||
| 38 | Gemma 4 E2BNew Google | 5B | — | — | ||
| 39 | Microsoft | 6B | 128K | $0.05 / $0.10 | ||
| 40 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 41 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 42 | 11B | 128K | $0.05 / $0.05 |
Notice missing or incorrect data?
FAQ
Common questions about MMMU-Pro
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.
The MMMU-Pro paper is available at https://arxiv.org/abs/2409.02813. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMMU-Pro leaderboard ranks 42 AI models based on their performance on this benchmark. Currently, GPT-5.4 by OpenAI leads with a score of 0.812. The average score across all models is 0.643.
The highest MMMU-Pro score is 0.812, achieved by GPT-5.4 from OpenAI.
42 models have been evaluated on the MMMU-Pro benchmark, with 0 verified results and 42 self-reported results.
MMMU-Pro is categorized under vision, general, multimodal, and reasoning. The benchmark evaluates multimodal models.