MMLU-ProX

Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MMLU-ProX

State-of-the-art frontier
Open
Proprietary

MMLU-ProX Leaderboard

29 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.30 / $3.00
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.45 / $3.49
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
9120B262K$0.10 / $0.50
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.49
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
80B66K$0.15 / $1.50
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $1.00
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
31B262K$0.20 / $0.70
19
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.18 / $2.09
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B262K$0.08 / $0.50
21
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $1.00
2232B262K$0.06 / $0.24
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B262K$0.10 / $0.60
24
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
25
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
262B
268B32K$20.00 / $40.00
288B
282B
Notice missing or incorrect data?

FAQ

Common questions about MMLU-ProX

Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.
The MMLU-ProX paper is available at https://arxiv.org/abs/2406.01574. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMLU-ProX leaderboard ranks 29 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.847. The average score across all models is 0.647.
The highest MMLU-ProX score is 0.847, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
29 models have been evaluated on the MMLU-ProX benchmark, with 0 verified results and 29 self-reported results.
MMLU-ProX is categorized under finance, general, healthcare, language, legal, math, and reasoning. The benchmark evaluates text models.