CMMLU
CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.
Progress Over Time
Interactive timeline showing model performance evolution on CMMLU
State-of-the-art frontier
Open
Proprietary
CMMLU Leaderboard
4 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 2 | Meituan | 560B | 128K | $0.30 / $1.20 | ||
| 3 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 4 | Baidu | 21B | 128K | $0.40 / $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about CMMLU
CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.
The CMMLU paper is available at https://arxiv.org/abs/2306.09212. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CMMLU leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Qwen2 72B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.901. The average score across all models is 0.742.
The highest CMMLU score is 0.901, achieved by Qwen2 72B Instruct from Alibaba Cloud / Qwen Team.
4 models have been evaluated on the CMMLU benchmark, with 0 verified results and 4 self-reported results.
CMMLU is categorized under general, language, and reasoning. The benchmark evaluates text models with multilingual support.