MMLU
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
Progress Over Time
Interactive timeline showing model performance evolution on MMLU
State-of-the-art frontier
Open
Proprietary
MMLU Leaderboard
99 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 2 | OpenAI | — | 200K | $15.00 / $60.00 | ||
| 3 | OpenAI | — | 128K | $15.00 / $60.00 | ||
| 3 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 5 | Sarvam AI | 105B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 7 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 7 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 9 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 9 | Moonshot AI | 1.0T | 262K | $0.60 / $2.50 | ||
| 11 | OpenAI | 117B | 131K | $0.09 / $0.45 | ||
| 12 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 13 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 13 | Moonshot AI | 1.0T | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 16 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 16 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 18 | DeepSeek | 671B | 131K | $0.27 / $1.10 | ||
| 19 | Alibaba Cloud / Qwen Team | 235B | 128K | $0.10 / $0.10 | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 21 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 22 | xAI | — | 128K | $2.00 / $10.00 | ||
| 22 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 24 | Moonshot AI | — | — | — | ||
| 25 | 405B | 128K | $0.89 / $0.89 | |||
| 26 | OpenAI | — | 200K | $1.10 / $4.40 | ||
| 27 | Anthropic | — | 200K | $15.00 / $75.00 | ||
| 28 | OpenAI | — | 128K | $10.00 / $30.00 | ||
| 29 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 29 | OpenAI | — | 33K | $30.00 / $60.00 | ||
| 31 | xAI | — | — | — | ||
| 32 | 90B | 128K | $0.35 / $0.40 | |||
| 32 | 70B | 128K | $0.20 / $0.20 | |||
| 34 | Amazon | — | 300K | $0.80 / $3.20 | ||
| 34 | Google | — | 2.1M | $2.50 / $10.00 | ||
| 36 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 37 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 38 | Meta | 400B | 1.0M | $0.17 / $0.60 | ||
| 39 | OpenAI | 21B | 131K | $0.05 / $0.20 | ||
| 40 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 40 | OpenAI | — | 128K | $3.00 / $12.00 | ||
| 42 | Sarvam AI | 30B | — | — | ||
| 43 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 44 | Microsoft | 15B | 16K | $0.07 / $0.14 | ||
| 45 | Mistral AI | 123B | 128K | $2.00 / $6.00 | ||
| 46 | 70B | 128K | $0.20 / $0.20 | |||
| 47 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 48 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 49 | OpenAI | — | 128K | $0.15 / $0.60 | ||
| 50 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 |
1–50 of 99
1/2
Notice missing or incorrect data?
FAQ
Common questions about MMLU
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
The MMLU paper is available at https://arxiv.org/abs/2009.03300. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMLU leaderboard ranks 99 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.925. The average score across all models is 0.801.
The highest MMLU score is 0.925, achieved by GPT-5 from OpenAI.
99 models have been evaluated on the MMLU benchmark, with 0 verified results and 98 self-reported results.
MMLU is categorized under finance, general, healthcare, language, legal, math, and reasoning. The benchmark evaluates text models.