C-Eval
Progress Over Time
Interactive timeline showing model performance evolution on C-Eval
C-Eval Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.50 / $3.00 | ||
| 2 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 3 | Moonshot AI | 1.0T | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 5 | Xiaomi | 1.0T | 1.0M | $0.43 / $0.87 | ||
| 6 | Alibaba Cloud / Qwen Team | 28B | 262K | $0.60 / $3.60 | ||
| 7 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 8 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 10 | Moonshot AI | — | — | — | ||
| 11 | Alibaba Cloud / Qwen Team | 9B | — | — | ||
| 12 | DeepSeek | 671B | — | — | ||
| 13 | Alibaba Cloud / Qwen Team | 4B | — | — | ||
| 14 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 16 | Alibaba Cloud / Qwen Team | 2B | — | — | ||
| 17 | Alibaba Cloud / Qwen Team | 800M | — | — | ||
| 18 | Baidu | 21B | — | — |
What is C-Eval?
C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. It comprises 13,948 multiple-choice questions across 52 diverse disciplines spanning humanities, science, and engineering, with four difficulty levels: middle school, high school, college, and professional. The benchmark includes C-Eval Hard, a subset of very challenging subjects requiring advanced reasoning abilities.
C-Eval is a text benchmark evaluating models on reasoning and general tasks. LLM Stats tracks 18 models on this benchmark, scored on a 0–1 scale. The current average is 0.8, with the leader at 0.9.
Compare leaders on the best AI for reasoning and best AI for general leaderboards.
Current leaders
Qwen3.6 Plus from Alibaba Cloud / Qwen Team currently leads the C-Eval leaderboard with a score of 0.933 across 18 evaluated AI models.
Source paper
- Title
- C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
- Authors
- Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, and 9 others
- Published
- arXiv
- 2305.08322
Abstract
New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. The questions span 52 diverse disciplines, ranging from humanities to science and engineering. C-Eval is accompanied by C-Eval Hard, a subset of very challenging subjects in C-Eval that requires advanced reasoning abilities to solve. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models. Results indicate that only GPT-4 could achieve an average accuracy of over 60%, suggesting that there is still significant room for improvement for current LLMs. We anticipate C-Eval will help analyze important strengths and shortcomings of foundation models, and foster their development and growth for Chinese users.
FAQ
Common questions about the C-Eval benchmark and leaderboard.