BrowseComp-zh

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp-zh

State-of-the-art frontier
Open
Proprietary

BrowseComp-zh Leaderboard

13 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
2
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
3
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
4560B
5
Zhipu AI
Zhipu AI
358B
6685B
6685B
81.0T
9
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
10671B
11
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
12685B
13671B131K$0.55 / $2.19
Notice missing or incorrect data?
About this benchmark

What is BrowseComp-zh?

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

BrowseComp-zh is a text benchmark evaluating models on reasoning and search tasks. LLM Stats tracks 13 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.7.

Compare leaders on the best AI for reasoning and best AI for search leaderboards.

Current leaders

Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team currently leads the BrowseComp-zh leaderboard with a score of 0.703 across 13 evaluated AI models.

1Qwen3.5-397B-A17BAlibaba Cloud / Qwen Team70.3%
2Qwen3.5-122B-A10BAlibaba Cloud / Qwen Team69.9%
3Qwen3.5-35B-A3BAlibaba Cloud / Qwen Team69.5%

Source paper

Title
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Authors
Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, and 12 others
Published
Abstract

As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

FAQ

Common questions about the BrowseComp-zh benchmark and leaderboard.

What is the BrowseComp-zh benchmark?

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

What is the BrowseComp-zh leaderboard?

The BrowseComp-zh leaderboard ranks 13 AI models based on their performance on this benchmark. Currently, Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team leads with a score of 0.703. The average score across all models is 0.601.

What is the highest BrowseComp-zh score?

The highest BrowseComp-zh score is 0.703, achieved by Qwen3.5-397B-A17B from Alibaba Cloud / Qwen Team.

How many models are evaluated on BrowseComp-zh?

13 models have been evaluated on the BrowseComp-zh benchmark, with 0 verified results and 13 self-reported results.

Where can I find the BrowseComp-zh paper?

The BrowseComp-zh paper is available at https://arxiv.org/abs/2504.19314. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BrowseComp-zh cover?

BrowseComp-zh is categorized under reasoning and search. The benchmark evaluates text models with multilingual support.

What's the difference between BrowseComp-zh and BrowseComp?

BrowseComp-zh is a variant of BrowseComp. See the BrowseComp leaderboard for the broader benchmark and per-model comparison.

What is the best open-source model on BrowseComp-zh?

Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team is the top-ranked open-source model on BrowseComp-zh, with a score of 0.703 (rank #1).

How recent are the BrowseComp-zh leaderboard results?

The BrowseComp-zh leaderboard was last updated in June 2026 and currently includes 13 evaluated models.