BrowseComp-zh
Progress Over Time
Interactive timeline showing model performance evolution on BrowseComp-zh
BrowseComp-zh Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 2 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 3 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 4 | Meituan | 560B | — | — | ||
| 5 | Zhipu AI | 358B | — | — | ||
| 6 | DeepSeek | 685B | — | — | ||
| 6 | DeepSeek | 685B | — | — | ||
| 8 | Moonshot AI | 1.0T | — | — | ||
| 9 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 10 | DeepSeek | 671B | — | — | ||
| 11 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 12 | DeepSeek | 685B | — | — | ||
| 13 | DeepSeek | 671B | 131K | $0.55 / $2.19 |
What is BrowseComp-zh?
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
BrowseComp-zh is a text benchmark evaluating models on reasoning and search tasks. LLM Stats tracks 13 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.7.
Compare leaders on the best AI for reasoning and best AI for search leaderboards.
Current leaders
Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team currently leads the BrowseComp-zh leaderboard with a score of 0.703 across 13 evaluated AI models.
Source paper
- Title
- BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
- Authors
- Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, and 12 others
- Published
- arXiv
- 2504.19314
Abstract
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
FAQ
Common questions about the BrowseComp-zh benchmark and leaderboard.