MMLU-Pro
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Progress Over Time
Interactive timeline showing model performance evolution on MMLU-Pro
State-of-the-art frontier
Open
Proprietary
MMLU-Pro Leaderboard
115 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 2 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 3 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 4 | Moonshot AI | 1.0T | 262K | $0.60 / $3.00 | ||
| 5 | Baidu | — | — | — | ||
| 6 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 7 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 8 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 9 | Google | 31B | 262K | $0.14 / $0.40 | ||
| 10 | DeepSeek | 685B | 164K | $0.26 / $0.38 | ||
| 10 | DeepSeek | 685B | — | — | ||
| 10 | DeepSeek | 671B | 131K | $0.55 / $2.19 | ||
| 10 | DeepSeek | 685B | — | — | ||
| 14 | Xiaomi | 309B | 256K | $0.10 / $0.30 | ||
| 15 | Zhipu AI | 355B | — | — | ||
| 15 | Moonshot AI | 1.0T | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 / $3.00 | ||
| 18 | Zhipu AI | 358B | 205K | $0.60 / $2.20 | ||
| 19 | LG AI Research | 236B | 33K | $0.60 / $1.00 | ||
| 19 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 21 | 120B | — | — | |||
| 22 | DeepSeek | 671B | 164K | $0.27 / $1.00 | ||
| 23 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 | ||
| 24 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 25 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 26 | Meituan | 560B | — | — | ||
| 26 | Google | 25B | 262K | $0.13 / $0.40 | ||
| 28 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 28 | Moonshot AI | 1.0T | 262K | $0.60 / $2.50 | ||
| 30 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 31 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 32 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.50 | ||
| 33 | Sarvam AI | 105B | — | — | ||
| 34 | Zhipu AI | 106B | — | — | ||
| 35 | DeepSeek | 671B | 164K | $0.28 / $1.14 | ||
| 36 | MiniMax | 456B | 1.0M | $0.55 / $2.20 | ||
| 36 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 36 | Moonshot AI | 1.0T | — | — | ||
| 39 | OpenAI | 117B | 131K | $0.10 / $0.50 | ||
| 40 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 40 | MiniMax | 456B | — | — | ||
| 42 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 42 | Meta | 400B | 1.0M | $0.17 / $0.85 | ||
| 44 | Sarvam AI | 30B | — | — | ||
| 45 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 46 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 47 | 32B | 262K | $0.06 / $0.24 | |||
| 48 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 49 | Mistral AI | 119B | 256K | $0.15 / $0.60 | ||
| 50 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 |
1–50 of 115
1/3
Notice missing or incorrect data?
FAQ
Common questions about MMLU-Pro
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
The MMLU-Pro paper is available at https://arxiv.org/abs/2406.01574. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMLU-Pro leaderboard ranks 115 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.885. The average score across all models is 0.705.
The highest MMLU-Pro score is 0.885, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.
115 models have been evaluated on the MMLU-Pro benchmark, with 0 verified results and 115 self-reported results.
MMLU-Pro is categorized under finance, general, healthcare, language, legal, math, and reasoning. The benchmark evaluates text models.