MMMU-Pro

Progress Over Time

Interactive timeline showing model performance evolution on MMMU-Pro

State-of-the-art frontier
Open
Proprietary

MMMU-Pro Leaderboard

60 models
ContextCostLicense
11.0M$1.50 / $9.00
2
OpenAI
OpenAI
1.1M$5.00 / $30.00
3
ByteDance
ByteDance
4
5
OpenAI
OpenAI
1.0M$2.50 / $15.00
51.0M$0.50 / $3.00
7
81.0M$2.50 / $15.00
9
10
Moonshot AI
Moonshot AI
1.0T262K$0.75 / $3.50
11
OpenAI
OpenAI
400K$1.75 / $14.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.32 / $1.28
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
14
Moonshot AI
Moonshot AI
1.0T
15
OpenAI
OpenAI
16
MiniMax
MiniMax
1.0M$0.30 / $1.20
17
Xiaomi
Xiaomi
311B1.0M$0.17 / $0.34
181.0M$5.00 / $25.00
1931B262K$0.13 / $0.38
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
211.0M$0.25 / $1.50
22400K$0.75 / $4.50
23
OpenAI
OpenAI
24400K$5.00 / $30.00
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
26200K$3.00 / $15.00
27
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
29
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
3025B262K$0.13 / $0.40
31
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
3212B
33
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
33
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B
35400K$0.20 / $1.25
36
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
37
38218B
38
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
401.0M$0.30 / $2.50
41
42
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
42
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B
44
Mistral AI
Mistral AI
119B256K$0.15 / $0.60
45
OpenAI
OpenAI
128K$2.50 / $10.00
46400B
47
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
48
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
4925B
50
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
150 of 60
1/2
Notice missing or incorrect data?
About this benchmark

What is MMMU-Pro?

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

MMMU-Pro is a multimodal benchmark evaluating models on multimodal, reasoning, general, and vision tasks. LLM Stats tracks 60 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.8.

Compare leaders on the best AI for multimodal, best AI for reasoning, best AI for general and best AI for vision leaderboards.

Current leaders

Gemini 3.5 Flash from Google currently leads the MMMU-Pro leaderboard with a score of 0.836 across 60 evaluated AI models.

1Gemini 3.5 FlashGoogle83.6%
2GPT-5.5OpenAI83.2%
3Seed 2.1 ProByteDance82.7%
OSSKimi K2.6#10 open-weight80.1%

Source paper

Title
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Authors
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, and 9 others
Published
Abstract

This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.

FAQ

Common questions about the MMMU-Pro benchmark and leaderboard.

What is the MMMU-Pro benchmark?

A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.

What is the MMMU-Pro leaderboard?

The MMMU-Pro leaderboard ranks 60 AI models based on their performance on this benchmark. Currently, Gemini 3.5 Flash by Google leads with a score of 0.836. The average score across all models is 0.671.

What is the highest MMMU-Pro score?

The highest MMMU-Pro score is 0.836, achieved by Gemini 3.5 Flash from Google.

How many models are evaluated on MMMU-Pro?

60 models have been evaluated on the MMMU-Pro benchmark, with 0 verified results and 60 self-reported results.

Where can I find the MMMU-Pro paper?

The MMMU-Pro paper is available at https://arxiv.org/abs/2409.02813. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MMMU-Pro cover?

MMMU-Pro is categorized under multimodal, reasoning, general, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on MMMU-Pro?

Kimi K2.6 by Moonshot AI is the top-ranked open-source model on MMMU-Pro, with a score of 0.801 (rank #10).

Which model offers the best value on MMMU-Pro?

Among models scoring within 10% of the leader, Gemma 4 31B from Google is the cheapest, at $0.13 per million input tokens with a score of 0.769.

How is MMMU-Pro scored?

MMMU-Pro is scored using accuracy, reported on a 0–1 scale. Lower is better only when explicitly noted; on this leaderboard, higher scores indicate better performance.

How recent are the MMMU-Pro leaderboard results?

The MMMU-Pro leaderboard was last updated in July 2026 and currently includes 60 evaluated models.