MMMLU
Multilingual Massive Multitask Language Understanding dataset released by OpenAI, featuring professionally translated MMLU test questions across 14 languages including Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, and Chinese. Contains approximately 15,908 multiple-choice questions per language covering 57 subjects.
Progress Over Time
Interactive timeline showing model performance evolution on MMMLU
State-of-the-art frontier
Open
Proprietary
MMMLU Leaderboard
45 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | $25.00 / $125.00 | ||
| 2 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 3 | Google | — | — | — | ||
| 3 | Google | — | 1.0M | $0.50 / $3.00 | ||
| 5 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 6 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 7 | Anthropic | — | 200K | $5.00 / $25.00 | ||
| 8 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 9 | Anthropic | — | 200K | $15.00 / $75.00 | ||
| 9 | Alibaba Cloud / Qwen Team | — | — | — | ||
| 11 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 12 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 13 | Google | — | 1.0M | $0.25 / $1.50 | ||
| 14 | Anthropic | — | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 / $3.60 | ||
| 16 | Google | 31B | 262K | $0.14 / $0.40 | ||
| 17 | OpenAI | — | 200K | $15.00 / $60.00 | ||
| 18 | OpenAI | — | 1.0M | $2.00 / $8.00 | ||
| 19 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 / $3.20 | ||
| 19 | Alibaba Cloud / Qwen Team | 235B | 128K | $0.10 / $0.10 | ||
| 21 | Anthropic | — | — | — | ||
| 22 | Google | 25B | 262K | $0.13 / $0.40 | ||
| 23 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 24 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 25 | LG AI Research | 236B | 33K | $0.60 / $1.00 | ||
| 26 | Mistral AI | 675B | — | — | ||
| 26 | Mistral AI | 675B | 262K | $0.50 / $1.50 | ||
| 26 | 675B | — | — | |||
| 26 | 675B | — | — | |||
| 30 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 / $2.00 | ||
| 31 | OpenAI | — | 128K | $75.00 / $150.00 | ||
| 32 | OpenAI | 117B | 131K | $0.10 / $0.50 | ||
| 33 | Anthropic | — | 200K | $1.00 / $5.00 | ||
| 34 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 35 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 36 | OpenAI | — | 1.0M | $0.40 / $1.60 | ||
| 37 | Google | 8B | — | — | ||
| 38 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 39 | Mistral AI | 675B | 128K | $2.00 / $5.00 | ||
| 40 | Microsoft | 60B | — | — | ||
| 41 | Google | 5B | — | — | ||
| 42 | OpenAI | — | 1.0M | $0.10 / $0.40 | ||
| 43 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 44 | Microsoft | 4B | 128K | $0.10 / $0.10 | ||
| 45 | Alibaba Cloud / Qwen Team | 800M | — | — |
Notice missing or incorrect data?
FAQ
Common questions about MMMLU
Multilingual Massive Multitask Language Understanding dataset released by OpenAI, featuring professionally translated MMLU test questions across 14 languages including Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, and Chinese. Contains approximately 15,908 multiple-choice questions per language covering 57 subjects.
The MMMLU paper is available at https://arxiv.org/abs/2009.03300. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The MMMLU leaderboard ranks 45 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.927. The average score across all models is 0.830.
The highest MMMLU score is 0.927, achieved by Claude Mythos Preview from Anthropic.
45 models have been evaluated on the MMMLU benchmark, with 0 verified results and 45 self-reported results.
MMMLU is categorized under general, language, math, and reasoning. The benchmark evaluates text models with multilingual support.