CSimpleQA
Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It contains 3,000 high-quality questions spanning 6 major topics with 99 diverse subtopics, designed to assess Chinese factual knowledge across humanities, science, engineering, culture, and society.
Progress Over Time
Interactive timeline showing model performance evolution on CSimpleQA
State-of-the-art frontier
Open
Proprietary
CSimpleQA Leaderboard
7 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | DeepSeek | 1.6T | 1.0M | $1.74 / $3.48 | ||
| 2 | Alibaba Cloud / Qwen Team | 235B | 262K | $0.15 / $0.80 | ||
| 3 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.50 | ||
| 4 | DeepSeek | 284B | 1.0M | $0.14 / $0.28 | ||
| 5 | Moonshot AI | 1.0T | 200K | $0.50 / $0.50 | ||
| 6 | Moonshot AI | 1.0T | — | — | ||
| 7 | DeepSeek | 671B | 131K | $0.27 / $1.10 |
Notice missing or incorrect data?
FAQ
Common questions about CSimpleQA
Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It contains 3,000 high-quality questions spanning 6 major topics with 99 diverse subtopics, designed to assess Chinese factual knowledge across humanities, science, engineering, culture, and society.
The CSimpleQA paper is available at https://arxiv.org/abs/2411.07140. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CSimpleQA leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, DeepSeek-V4-Pro-Max by DeepSeek leads with a score of 0.844. The average score across all models is 0.788.
The highest CSimpleQA score is 0.844, achieved by DeepSeek-V4-Pro-Max from DeepSeek.
7 models have been evaluated on the CSimpleQA benchmark, with 0 verified results and 7 self-reported results.
CSimpleQA is categorized under general and language. The benchmark evaluates text models with multilingual support.