CMMLU

CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CMMLU

State-of-the-art frontier
Open
Proprietary

CMMLU Leaderboard

4 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
2560B128K$0.30 / $1.20
369B256K$0.10 / $0.40
421B128K$0.40 / $4.00
Notice missing or incorrect data?

FAQ

Common questions about CMMLU

CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.
The CMMLU paper is available at https://arxiv.org/abs/2306.09212. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CMMLU leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Qwen2 72B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.901. The average score across all models is 0.742.
The highest CMMLU score is 0.901, achieved by Qwen2 72B Instruct from Alibaba Cloud / Qwen Team.
4 models have been evaluated on the CMMLU benchmark, with 0 verified results and 4 self-reported results.
CMMLU is categorized under general, language, and reasoning. The benchmark evaluates text models with multilingual support.