CMMLU

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CMMLU

State-of-the-art frontier
Open
Proprietary

CMMLU Leaderboard

6 models
ContextCostLicense
11.0T1.0M$0.43 / $0.87
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
3560B
469B256K$0.10 / $0.40
59B
621B
Notice missing or incorrect data?
About this benchmark

What is CMMLU?

CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.

CMMLU is a text benchmark evaluating models on language, reasoning, and general tasks. LLM Stats tracks 6 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.

Compare leaders on the best AI for language, best AI for reasoning and best AI for general leaderboards.

Current leaders

MiMo-V2.5-Pro from Xiaomi currently leads the CMMLU leaderboard with a score of 0.902 across 6 evaluated AI models.

1MiMo-V2.5-ProXiaomi90.2%
2Qwen2 72B InstructAlibaba Cloud / Qwen Team90.1%
3LongCat-Flash-ChatMeituan84.3%

Source paper

Title
CMMLU: Measuring massive multitask language understanding in Chinese
Authors
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, and 4 others
Published
Abstract

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

FAQ

Common questions about the CMMLU benchmark and leaderboard.

What is the CMMLU benchmark?

CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.

What is the CMMLU leaderboard?

The CMMLU leaderboard ranks 6 AI models based on their performance on this benchmark. Currently, MiMo-V2.5-Pro by Xiaomi leads with a score of 0.902. The average score across all models is 0.781.

What is the highest CMMLU score?

The highest CMMLU score is 0.902, achieved by MiMo-V2.5-Pro from Xiaomi.

How many models are evaluated on CMMLU?

6 models have been evaluated on the CMMLU benchmark, with 0 verified results and 6 self-reported results.

Where can I find the CMMLU paper?

The CMMLU paper is available at https://arxiv.org/abs/2306.09212. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does CMMLU cover?

CMMLU is categorized under language, reasoning, and general. The benchmark evaluates text models with multilingual support.

What is the best open-source model on CMMLU?

MiMo-V2.5-Pro by Xiaomi is the top-ranked open-source model on CMMLU, with a score of 0.902 (rank #1).

Which model offers the best value on CMMLU?

Among models scoring within 10% of the leader, LongCat-Flash-Lite from Meituan is the cheapest, at $0.10 per million input tokens with a score of 0.825.

How recent are the CMMLU leaderboard results?

The CMMLU leaderboard was last updated in July 2026 and currently includes 6 evaluated models.