CMMLU
Progress Over Time
Interactive timeline showing model performance evolution on CMMLU
CMMLU Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Xiaomi | 1.0T | 1.0M | $0.43 / $0.87 | ||
| 2 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 3 | Meituan | 560B | — | — | ||
| 4 | Meituan | 69B | 256K | $0.10 / $0.40 | ||
| 5 | OpenBMB | 9B | — | — | ||
| 6 | Baidu | 21B | — | — |
What is CMMLU?
CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.
CMMLU is a text benchmark evaluating models on language, reasoning, and general tasks. LLM Stats tracks 6 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.
Compare leaders on the best AI for language, best AI for reasoning and best AI for general leaderboards.
Current leaders
MiMo-V2.5-Pro from Xiaomi currently leads the CMMLU leaderboard with a score of 0.902 across 6 evaluated AI models.
Source paper
- Title
- CMMLU: Measuring massive multitask language understanding in Chinese
- Authors
- Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, and 4 others
- Published
- arXiv
- 2306.09212
Abstract
As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
FAQ
Common questions about the CMMLU benchmark and leaderboard.