Benchmarks/general/Multilingual MMLU

Multilingual MMLU

MMLU-ProX is a comprehensive multilingual benchmark covering 29 typologically diverse languages, building upon MMLU-Pro. Each language version consists of 11,829 identical questions enabling direct cross-linguistic comparisons. The benchmark evaluates large language models' reasoning capabilities across linguistic and cultural boundaries through challenging, reasoning-focused questions with 10 answer choices.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on Multilingual MMLU

State-of-the-art frontier
Open
Proprietary

Multilingual MMLU Leaderboard

5 models
ContextCostLicense
1
OpenAI
OpenAI
200K$1.10 / $4.40
214B
38B
43B
5
Microsoft
Microsoft
4B
Notice missing or incorrect data?

FAQ

Common questions about Multilingual MMLU

MMLU-ProX is a comprehensive multilingual benchmark covering 29 typologically diverse languages, building upon MMLU-Pro. Each language version consists of 11,829 identical questions enabling direct cross-linguistic comparisons. The benchmark evaluates large language models' reasoning capabilities across linguistic and cultural boundaries through challenging, reasoning-focused questions with 10 answer choices.
The Multilingual MMLU paper is available at https://arxiv.org/abs/2503.10497. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The Multilingual MMLU leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.807. The average score across all models is 0.680.
The highest Multilingual MMLU score is 0.807, achieved by o3-mini from OpenAI.
5 models have been evaluated on the Multilingual MMLU benchmark, with 0 verified results and 5 self-reported results.
Multilingual MMLU is categorized under general, language, and reasoning. The benchmark evaluates text models with multilingual support.