MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMLU

State-of-the-art frontier
Open
Proprietary

MMLU Leaderboard

99 models
ContextCostLicense
1
OpenAI
OpenAI
400K$1.25 / $10.00
2
OpenAI
OpenAI
200K$15.00 / $60.00
3128K$15.00 / $60.00
3
OpenAI
OpenAI
128K$75.00 / $150.00
5
Sarvam AI
Sarvam AI
105B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
7200K$3.00 / $15.00
7200K$3.00 / $15.00
9
OpenAI
OpenAI
1.0M$2.00 / $8.00
9
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
11117B131K$0.09 / $0.45
12560B128K$0.30 / $1.20
13
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
131.0T
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
16
OpenAI
OpenAI
128K$2.50 / $10.00
18
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B128K$0.10 / $0.10
20
Moonshot AI
Moonshot AI
1.0T
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
22128K$2.00 / $10.00
221.0M$0.40 / $1.60
24
Moonshot AI
Moonshot AI
25405B128K$0.89 / $0.89
26
OpenAI
OpenAI
200K$1.10 / $4.40
27
Anthropic
Anthropic
200K$15.00 / $75.00
28128K$10.00 / $30.00
29
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
29
OpenAI
OpenAI
33K$30.00 / $60.00
31
3290B128K$0.35 / $0.40
3270B128K$0.20 / $0.20
34
Amazon
Amazon
300K$0.80 / $3.20
342.1M$2.50 / $10.00
36
OpenAI
OpenAI
128K$2.50 / $10.00
3769B256K$0.10 / $0.40
38400B1.0M$0.17 / $0.60
3921B131K$0.05 / $0.20
40
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
40
OpenAI
OpenAI
128K$3.00 / $12.00
42
Sarvam AI
Sarvam AI
30B
43
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
44
Microsoft
Microsoft
15B16K$0.07 / $0.14
45
Mistral AI
Mistral AI
123B128K$2.00 / $6.00
4670B128K$0.20 / $0.20
47
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
48
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
49128K$0.15 / $0.60
50
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
150 of 99
1/2
Notice missing or incorrect data?

FAQ

Common questions about MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
The MMLU paper is available at https://arxiv.org/abs/2009.03300. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMLU leaderboard ranks 99 AI models based on their performance on this benchmark. Currently, GPT-5 by OpenAI leads with a score of 0.925. The average score across all models is 0.801.
The highest MMLU score is 0.925, achieved by GPT-5 from OpenAI.
99 models have been evaluated on the MMLU benchmark, with 0 verified results and 98 self-reported results.
MMLU is categorized under finance, general, healthcare, language, legal, math, and reasoning. The benchmark evaluates text models.