CSimpleQA

Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It contains 3,000 high-quality questions spanning 6 major topics with 99 diverse subtopics, designed to assess Chinese factual knowledge across humanities, science, engineering, culture, and society.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on CSimpleQA

State-of-the-art frontier
Open
Proprietary

CSimpleQA Leaderboard

7 models
ContextCostLicense
11.6T1.0M$1.74 / $3.48
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B262K$0.15 / $0.80
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
236B262K$0.30 / $1.50
4284B1.0M$0.14 / $0.28
5
Moonshot AI
Moonshot AI
1.0T200K$0.50 / $0.50
6
Moonshot AI
Moonshot AI
1.0T
7
DeepSeek
DeepSeek
671B131K$0.27 / $1.10
Notice missing or incorrect data?

FAQ

Common questions about CSimpleQA

Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It contains 3,000 high-quality questions spanning 6 major topics with 99 diverse subtopics, designed to assess Chinese factual knowledge across humanities, science, engineering, culture, and society.
The CSimpleQA paper is available at https://arxiv.org/abs/2411.07140. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The CSimpleQA leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, DeepSeek-V4-Pro-Max by DeepSeek leads with a score of 0.844. The average score across all models is 0.788.
The highest CSimpleQA score is 0.844, achieved by DeepSeek-V4-Pro-Max from DeepSeek.
7 models have been evaluated on the CSimpleQA benchmark, with 0 verified results and 7 self-reported results.
CSimpleQA is categorized under general and language. The benchmark evaluates text models with multilingual support.