MMLU-ProX
Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.
Progress Over Time
Interactive timeline showing model performance evolution on MMLU-ProX
State-of-the-art frontier
Open
Proprietary
MMLU-ProX Leaderboard
29 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 1 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 3 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 5 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.30 / $3.00 | ||
| 5 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 7 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 8 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 | ||
| 9 | 120B | 262K | $0.10 / $0.50 | |||
| 10 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 11 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 12 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 80B | 66K | $0.15 / $1.50 | ||
| 14 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 16 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 19 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 20 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 21 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 22 | 32B | 262K | $0.06 / $0.24 | |||
| 23 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 24 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 25 | Alibaba Cloud / Qwen Team | 800M | — | — | ||
| 26 | 2B | — | — | |||
| 26 | Google | 8B | 32K | $20.00 / $40.00 | ||
| 28 | Google | 8B | — | — | ||
| 28 | 2B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MMLU-ProX
Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.
The MMLU-ProX paper is available at https://arxiv.org/abs/2406.01574. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMLU-ProX leaderboard ranks 29 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.847. The average score across all models is 0.647.
The highest MMLU-ProX score is 0.847, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
29 models have been evaluated on the MMLU-ProX benchmark, with 0 verified results and 29 self-reported results.
MMLU-ProX is categorized under finance, general, healthcare, language, legal, math, and reasoning. The benchmark evaluates text models.