MMMU-Pro
Progress Over Time
Interactive timeline showing model performance evolution on MMMU-Pro
MMMU-Pro Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | — | 1.0M | $1.50 / $9.00 | ||
| 2 | OpenAI | — | 1.1M | $5.00 / $30.00 | ||
| 3 | ByteDance | — | — | — | ||
| 4 | ByteDance | — | — | — | ||
| 5 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 5 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 7 | Google | — | — | — | ||
| 8 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 9 | Meta | — | — | — | ||
| 10 | Moonshot AI | 1.0T | 262K | $0.75 / $3.50 | ||
| 11 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 12 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 13 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 14 | Moonshot AI | 1.0T | — | — | ||
| 15 | OpenAI | — | — | — | ||
| 16 | MiniMax | — | 1.0M | $0.30 / $1.20 | ||
| 17 | Xiaomi | 311B | 1.0M | $0.17 / $0.34 | ||
| 18 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 19 | Google | 31B | 262K | $0.13 / $0.38 | ||
| 19 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 21 | Google | — | 1.0M | $0.25 / $1.50 | ||
| 22 | OpenAI | — | 400K | $0.75 / $4.50 | ||
| 23 | OpenAI | — | — | — | ||
| 24 | OpenAI | — | 400K | $5.00 / $30.00 | ||
| 25 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 26 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 27 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 28 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 29 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 30 | Google | 25B | 262K | $0.13 / $0.40 | ||
| 31 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 32 | Google | 12B | — | — | ||
| 33 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 33 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 35 | OpenAI | — | 400K | $0.20 / $1.25 | ||
| 36 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 37 | Amazon | — | — | — | ||
| 38 | Cohere | 218B | — | — | ||
| 38 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 40 | Amazon | — | 1.0M | $0.30 / $2.50 | ||
| 41 | Amazon | — | — | — | ||
| 42 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 42 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 44 | Mistral AI | 119B | 256K | $0.15 / $0.60 | ||
| 45 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 46 | Meta | 400B | — | — | ||
| 47 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 48 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 49 | Google | 25B | — | — | ||
| 50 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 |
What is MMMU-Pro?
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.
MMMU-Pro is a multimodal benchmark evaluating models on multimodal, reasoning, general, and vision tasks. LLM Stats tracks 60 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.
Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for general and best AI for vision leaderboards.
Current leaders
Gemini 3.5 Flash from Google currently leads the MMMU-Pro leaderboard with a score of 0.836 across 60 evaluated AI models.
Source paper
- Title
- MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
- Authors
- Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, and 9 others
- Published
- arXiv
- 2409.02813
Abstract
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
FAQ
Common questions about the MMMU-Pro benchmark and leaderboard.