C-Eval

C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on C-Eval

State-of-the-art frontier
Open
Proprietary

C-Eval Leaderboard

14 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
2
Moonshot AI
Moonshot AI
1.0T
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
5
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
6
Moonshot AI
Moonshot AI
7
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
9B
8
DeepSeek
DeepSeek
671B131K
$0.27
$1.10
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
4B
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
11
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
2B
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
800M
14
21B128K
$0.40
$4.00
Notice missing or incorrect data?

FAQ

Common questions about C-Eval

C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.
The C-Eval paper is available at https://arxiv.org/abs/2305.08322. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The C-Eval leaderboard ranks 14 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.930. The average score across all models is 0.808.
The highest C-Eval score is 0.930, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
14 models have been evaluated on the C-Eval benchmark, with 0 verified results and 14 self-reported results.
C-Eval is categorized under general and reasoning. The benchmark evaluates text models with multilingual support.