MMLU-Pro
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Qwen3.7 Max from Alibaba Cloud / Qwen Team currently leads the MMLU-Pro leaderboard with a score of 0.896 across 120 evaluated AI models.
What MMLU-Pro measures
MMLU-Pro is a text benchmark that evaluates large language models on math, reasoning, finance, general, healthcare, language, and legal tasks. LLM Stats tracks 120 models on this benchmark, with a maximum possible score of 1. Current average across reported models is 0.7, with the leader reaching 0.9.
Compare leaders on the best AI for math, best AI for reasoning, best AI for finance, best AI for general, best AI for healthcare, best AI for language and best AI for legal leaderboards.
Publication
- Paper
- MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
- Authors
- Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, and 13 others
- Published
- arXiv
- 2406.01574
Abstract
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
Qwen3.7 Max leads with 89.6%, followed by
Qwen3.6 Plus at 88.5% and
MiniMax M2.1 at 88.0%.
Progress Over Time
Interactive timeline showing model performance evolution on MMLU-Pro
MMLU-Pro Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 2 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 3 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 4 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 5 | DeepSeek | 1.6T | 1.0M | $1.74 / $3.48 | ||
| 6 | Moonshot AI | 1.0T | — | — | ||
| 7 | Baidu | — | — | — | ||
| 8 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 9 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 9 | DeepSeek | 284B | 1.0M | $0.14 / $0.28 | ||
| 11 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 12 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 13 | Google | 31B | 262K | $0.14 / $0.40 | ||
| 13 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 15 | DeepSeek | 685B | — | — | ||
| 15 | DeepSeek | 685B | — | — | ||
| 15 | DeepSeek | 685B | — | — | ||
| 15 | DeepSeek | 671B | 131K | $0.55 / $2.19 | ||
| 19 | Xiaomi | 309B | — | — | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 20 | Zhipu AI | 355B | — | — | ||
| 22 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 23 | Zhipu AI | 358B | — | — | ||
| 24 | LG AI Research | 236B | — | — | ||
| 24 | Alibaba Cloud / Qwen Team | 236B | — | — | ||
| 26 | 120B | — | — | |||
| 27 | DeepSeek | 671B | — | — | ||
| 28 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 29 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 30 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 31 | Meituan | 560B | — | — | ||
| 31 | Google | 25B | 262K | $0.13 / $0.40 | ||
| 33 | Moonshot AI | 1.0T | — | — | ||
| 33 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 35 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 36 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 37 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 38 | Sarvam AI | 105B | — | — | ||
| 39 | Zhipu AI | 106B | — | — | ||
| 40 | DeepSeek | 671B | 164K | $0.28 / $1.14 | ||
| 41 | Moonshot AI | 1.0T | — | — | ||
| 41 | Moonshot AI | 1.0T | — | — | ||
| 41 | MiniMax | 456B | — | — | ||
| 44 | OpenAI | 117B | 131K | $0.10 / $0.50 | ||
| 45 | MiniMax | 456B | — | — | ||
| 45 | Alibaba Cloud / Qwen Team | 80B | — | — | ||
| 47 | Meta | 400B | — | — | ||
| 47 | Alibaba Cloud / Qwen Team | 31B | — | — | ||
| 49 | Sarvam AI | 30B | — | — | ||
| 50 | Alibaba Cloud / Qwen Team | 4B | — | — |
FAQ
Common questions about MMLU-Pro.
More evaluations to explore
Related benchmarks in the same category
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.