C-Eval

Paper

Progress Over Time

Interactive timeline showing model performance evolution on C-Eval

State-of-the-art frontier
Open
Proprietary

C-Eval Leaderboard

18 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.50 / $3.00
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
3
Moonshot AI
Moonshot AI
1.0T
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
51.0T1.0M$0.43 / $0.87
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
28B262K$0.60 / $3.60
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
10
Moonshot AI
Moonshot AI
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
12
DeepSeek
DeepSeek
671B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
14
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
16
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
1821B
Notice missing or incorrect data?
About this benchmark

What is C-Eval?

C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.

C-Eval is a text benchmark evaluating models on reasoning and general tasks. LLM Stats tracks 18 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.

Compare leaders on the best AI for reasoning and best AI for general leaderboards.

Current leaders

Qwen3.6 Plus from Alibaba Cloud / Qwen Team currently leads the C-Eval leaderboard with a score of 0.933 across 18 evaluated AI models.

1Qwen3.6 PlusAlibaba Cloud / Qwen Team93.3%
2Qwen3.5-397B-A17BAlibaba Cloud / Qwen Team93.0%
3Kimi K2 BaseMoonshot AI92.5%

Source paper

Title
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
Authors
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, and 9 others
Published
Abstract

New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. The questions span 52 diverse disciplines, ranging from humanities to science and engineering. C-Eval is accompanied by C-Eval Hard, a subset of very challenging subjects in C-Eval that requires advanced reasoning abilities to solve. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models. Results indicate that only GPT-4 could achieve an average accuracy of over 60%, suggesting that there is still significant room for improvement for current LLMs. We anticipate C-Eval will help analyze important strengths and shortcomings of foundation models, and foster their development and growth for Chinese users.

FAQ

Common questions about the C-Eval benchmark and leaderboard.

What is the C-Eval benchmark?

C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.

What is the C-Eval leaderboard?

The C-Eval leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, Qwen3.6 Plus by Alibaba Cloud / Qwen Team leads with a score of 0.933. The average score across all models is 0.832.

What is the highest C-Eval score?

The highest C-Eval score is 0.933, achieved by Qwen3.6 Plus from Alibaba Cloud / Qwen Team.

How many models are evaluated on C-Eval?

18 models have been evaluated on the C-Eval benchmark, with 0 verified results and 18 self-reported results.

Where can I find the C-Eval paper?

The C-Eval paper is available at https://arxiv.org/abs/2305.08322. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does C-Eval cover?

C-Eval is categorized under reasoning and general. The benchmark evaluates text models with multilingual support.

What is the best open-source model on C-Eval?

Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on C-Eval, with a score of 0.930 (rank #2).

Which model offers the best value on C-Eval?

Among models scoring within 10% of the leader, Qwen3.5-27B from Alibaba Cloud / Qwen Team is the cheapest, at $0.30 per million input tokens with a score of 0.905.

How recent are the C-Eval leaderboard results?

The C-Eval leaderboard was last updated in June 2026 and currently includes 18 evaluated models.