Benchmarks/reasoning/BrowseComp-zh

BrowseComp-zh

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp-zh

State-of-the-art frontier
Open
Proprietary

BrowseComp-zh Leaderboard

12 models • 0 verified
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
4
560B128K
$0.30
$1.20
5
Zhipu AI
Zhipu AI
358B205K
$0.60
$2.20
6
685B
7
1.0T
8
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
9
671B164K
$0.27
$1.00
10
MiniMax
MiniMax
230B1.0M
$0.30
$1.20
11
685B
12
671B131K
$0.50
$2.15
Notice missing or incorrect data?

FAQ

Common questions about BrowseComp-zh

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
The BrowseComp-zh paper is available at https://arxiv.org/abs/2504.19314. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BrowseComp-zh leaderboard ranks 12 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.703. The average score across all models is 0.597.
The highest BrowseComp-zh score is 0.703, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
12 models have been evaluated on the BrowseComp-zh benchmark, with 0 verified results and 12 self-reported results.
BrowseComp-zh is categorized under reasoning and search. The benchmark evaluates text models with multilingual support.