Benchmarks/chemistry/OpenAI MMLU

OpenAI MMLU

MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark that measures a text model's multitask accuracy across 57 diverse academic and professional subjects. The test covers elementary mathematics, US history, computer science, law, morality, business ethics, clinical knowledge, and many other domains spanning STEM, humanities, social sciences, and professional fields. To attain high accuracy, models must possess extensive world knowledge and problem-solving ability.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OpenAI MMLU

State-of-the-art frontier
Open
Proprietary

OpenAI MMLU Leaderboard

2 models
ContextCostLicense
18B32K$20.00 / $40.00
28B
Notice missing or incorrect data?

FAQ

Common questions about OpenAI MMLU

MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark that measures a text model's multitask accuracy across 57 diverse academic and professional subjects. The test covers elementary mathematics, US history, computer science, law, morality, business ethics, clinical knowledge, and many other domains spanning STEM, humanities, social sciences, and professional fields. To attain high accuracy, models must possess extensive world knowledge and problem-solving ability.
The OpenAI MMLU paper is available at https://arxiv.org/abs/2009.03300. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The OpenAI MMLU leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Gemma 3n E4B Instructed by Google leads with a score of 0.356. The average score across all models is 0.289.
The highest OpenAI MMLU score is 0.356, achieved by Gemma 3n E4B Instructed from Google.
2 models have been evaluated on the OpenAI MMLU benchmark, with 0 verified results and 2 self-reported results.
OpenAI MMLU is categorized under chemistry, economics, finance, general, healthcare, legal, math, physics, psychology, and reasoning. The benchmark evaluates text models.