CSimpleQA Leaderboard

Progress Over Time

Interactive timeline showing model performance evolution on CSimpleQA

State-of-the-art frontier

Open

Proprietary

CSimpleQA Leaderboard

7 models

			Context	Cost
1	DeepSeek-V4-Pro-MaxNew DeepSeek	1.6T	1.0M	$1.74 / $3.48
2	Qwen3-235B-A22B-Instruct-2507 Alibaba Cloud / Qwen Team	235B	262K	$0.15 / $0.80
3	Qwen3 VL 235B A22B Instruct Alibaba Cloud / Qwen Team	236B	262K	$0.30 / $1.50
4	DeepSeek-V4-Flash-MaxNew DeepSeek	284B	1.0M	$0.14 / $0.28
5	Kimi K2 Instruct Moonshot AI	1.0T	200K	$0.50 / $0.50
6	Kimi K2 Base Moonshot AI	1.0T	—	—
7	DeepSeek-V3 DeepSeek	671B	131K	$0.27 / $1.10

FAQ

Common questions about CSimpleQA

Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It contains 3,000 high-quality questions spanning 6 major topics with 99 diverse subtopics, designed to assess Chinese factual knowledge across humanities, science, engineering, culture, and society.

The CSimpleQA paper is available at https://arxiv.org/abs/2411.07140. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The CSimpleQA leaderboard ranks 7 AI models based on their performance on this benchmark. Currently, DeepSeek-V4-Pro-Max by DeepSeek leads with a score of 0.844. The average score across all models is 0.788.

The highest CSimpleQA score is 0.844, achieved by DeepSeek-V4-Pro-Max from DeepSeek.

7 models have been evaluated on the CSimpleQA benchmark, with 0 verified results and 7 self-reported results.

CSimpleQA is categorized under general and language. The benchmark evaluates text models with multilingual support.

CSimpleQA

Progress Over Time

CSimpleQA Leaderboard

FAQ

What is the CSimpleQA benchmark?

Where can I find the CSimpleQA paper?

What is the CSimpleQA leaderboard?

What is the highest CSimpleQA score?

How many models are evaluated on CSimpleQA?

What categories does CSimpleQA cover?