MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMLU-Pro

State-of-the-art frontier
Open
Proprietary

MMLU-Pro Leaderboard

115 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2230B1.0M$0.30 / $1.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
4
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $3.00
5
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
931B262K$0.14 / $0.40
10685B164K$0.26 / $0.38
10685B
10671B131K$0.55 / $2.19
10685B
14309B256K$0.10 / $0.30
15
Zhipu AI
Zhipu AI
355B
151.0T
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
18
Zhipu AI
Zhipu AI
358B205K$0.60 / $2.20
19
LG AI Research
LG AI Research
236B33K$0.60 / $1.00
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
21120B
22671B164K$0.27 / $1.00
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
25560B128K$0.30 / $1.20
26560B
2625B262K$0.13 / $0.40
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
28
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
30
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
31
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
32
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.50
33
Sarvam AI
Sarvam AI
105B
34
Zhipu AI
Zhipu AI
106B
35671B164K$0.28 / $1.14
36456B1.0M$0.55 / $2.20
36
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
361.0T
39117B131K$0.10 / $0.50
40
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
40456B
42
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
42400B1.0M$0.17 / $0.85
44
Sarvam AI
Sarvam AI
30B
45
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
46
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
4732B262K$0.06 / $0.24
4869B256K$0.10 / $0.40
49
Mistral AI
Mistral AI
119B256K$0.15 / $0.60
50
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
150 of 115
1/3
Notice missing or incorrect data?

FAQ

Common questions about MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
The MMLU-Pro paper is available at https://arxiv.org/abs/2406.01574. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMLU-Pro leaderboard ranks 115 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.885. The average score across all models is 0.705.
The highest MMLU-Pro score is 0.885, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.
115 models have been evaluated on the MMLU-Pro benchmark, with 0 verified results and 115 self-reported results.
MMLU-Pro is categorized under finance, general, healthcare, language, legal, math, and reasoning. The benchmark evaluates text models.