OpenAI MMLU
MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark that measures a text model's multitask accuracy across 57 diverse academic and professional subjects. The test covers elementary mathematics, US history, computer science, law, morality, business ethics, clinical knowledge, and many other domains spanning STEM, humanities, social sciences, and professional fields. To attain high accuracy, models must possess extensive world knowledge and problem-solving ability.
Progress Over Time
Interactive timeline showing model performance evolution on OpenAI MMLU
State-of-the-art frontier
Open
Proprietary
OpenAI MMLU Leaderboard
2 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | 8B | 32K | $20.00 / $40.00 | ||
| 2 | Google | 8B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about OpenAI MMLU
MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark that measures a text model's multitask accuracy across 57 diverse academic and professional subjects. The test covers elementary mathematics, US history, computer science, law, morality, business ethics, clinical knowledge, and many other domains spanning STEM, humanities, social sciences, and professional fields. To attain high accuracy, models must possess extensive world knowledge and problem-solving ability.
The OpenAI MMLU paper is available at https://arxiv.org/abs/2009.03300. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OpenAI MMLU leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Gemma 3n E4B Instructed by Google leads with a score of 0.356. The average score across all models is 0.289.
The highest OpenAI MMLU score is 0.356, achieved by Gemma 3n E4B Instructed from Google.
2 models have been evaluated on the OpenAI MMLU benchmark, with 0 verified results and 2 self-reported results.
OpenAI MMLU is categorized under chemistry, economics, finance, general, healthcare, legal, math, physics, psychology, and reasoning. The benchmark evaluates text models.