BrowseComp-zh
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
Progress Over Time
Interactive timeline showing model performance evolution on BrowseComp-zh
State-of-the-art frontier
Open
Proprietary
BrowseComp-zh Leaderboard
12 models • 0 verified
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
1 | Alibaba Cloud / Qwen Team | 397B | 262K | $0.60 $3.60 | ||
2 | Alibaba Cloud / Qwen Team | 122B | 262K | $0.40 $3.20 | ||
3 | Alibaba Cloud / Qwen Team | 35B | 262K | $0.25 $2.00 | ||
4 | Meituan | 560B | 128K | $0.30 $1.20 | ||
5 | Zhipu AI | 358B | 205K | $0.60 $2.20 | ||
6 | DeepSeek | 685B | — | — | ||
7 | Moonshot AI | 1.0T | — | — | ||
8 | Alibaba Cloud / Qwen Team | 27B | — | — | ||
9 | DeepSeek | 671B | 164K | $0.27 $1.00 | ||
10 | MiniMax | 230B | 1.0M | $0.30 $1.20 | ||
11 | DeepSeek | 685B | — | — | ||
12 | DeepSeek | 671B | 131K | $0.50 $2.15 |
Notice missing or incorrect data?
FAQ
Common questions about BrowseComp-zh
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
The BrowseComp-zh paper is available at https://arxiv.org/abs/2504.19314. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BrowseComp-zh leaderboard ranks 12 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.703. The average score across all models is 0.597.
The highest BrowseComp-zh score is 0.703, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.
12 models have been evaluated on the BrowseComp-zh benchmark, with 0 verified results and 12 self-reported results.
BrowseComp-zh is categorized under reasoning and search. The benchmark evaluates text models with multilingual support.