C-Eval
C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.
Progress Over Time
Interactive timeline showing model performance evolution on C-Eval
State-of-the-art frontier
Open
Proprietary
C-Eval Leaderboard
14 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
2 | Moonshot AI | 1.0T | — | — | ||
3 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 $3.20 | ||
4 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
5 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 $2.00 | ||
6 | Moonshot AI | — | — | — | ||
7 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
8 | DeepSeek | 671B | 131K | $0.27 $1.10 | ||
9 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
10 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
11 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
12 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
13 | Alibaba Cloud / Qwen Team | 800M | — | — | ||
14 | Baidu | 21B | 128K | $0.40 $4.00 |
Notice missing or incorrect data?
FAQ
Common questions about C-Eval
C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.
The C-Eval paper is available at https://arxiv.org/abs/2305.08322. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The C-Eval leaderboard ranks 14 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.930. The average score across all models is 0.808.
The highest C-Eval score is 0.930, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
14 models have been evaluated on the C-Eval benchmark, with 0 verified results and 14 self-reported results.
C-Eval is categorized under general and reasoning. The benchmark evaluates text models with multilingual support.