Benchmarks/agents/BrowseComp

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp

State-of-the-art frontier
Open
Proprietary

BrowseComp Leaderboard

39 models
ContextCostLicense
1$25.00 / $125.00
21.0M$2.50 / $15.00
31.0M$5.00 / $25.00
4
OpenAI
OpenAI
1.0M$2.50 / $15.00
5
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
6400K$21.00 / $168.00
7
ByteDance
ByteDance
8230B1.0M$0.30 / $1.20
9
Zhipu AI
Zhipu AI
744B200K$1.00 / $3.20
10
Moonshot AI
Moonshot AI
1.0T262K$0.60 / $2.50
11200K$3.00 / $15.00
12
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K$0.60 / $3.60
12196B66K$0.10 / $0.40
14
OpenAI
OpenAI
400K$1.75 / $14.00
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K$0.40 / $3.20
16230B1.0M$0.30 / $1.20
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
17
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K$0.25 / $2.00
191.0T
20309B256K$0.10 / $0.30
21560B128K$0.30 / $1.20
22
OpenAI
OpenAI
400K$1.25 / $10.00
23
Zhipu AI
Zhipu AI
358B205K$0.60 / $2.20
24
OpenAI
OpenAI
200K$1.10 / $4.40
25685B164K$0.26 / $0.38
25685B
27
OpenAI
OpenAI
200K$2.00 / $8.00
28
Sarvam AI
Sarvam AI
105B
29
Zhipu AI
Zhipu AI
357B131K$0.55 / $2.19
302.0M$0.20 / $0.50
31
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
3230B128K$0.07 / $0.40
33685B
34
Sarvam AI
Sarvam AI
30B
35120B262K$0.10 / $0.50
36671B164K$0.27 / $1.00
37
Zhipu AI
Zhipu AI
355B131K$0.40 / $1.60
38
Zhipu AI
Zhipu AI
106B
39671B131K$0.50 / $2.15
Notice missing or incorrect data?

FAQ

Common questions about BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.
The BrowseComp paper is available at https://arxiv.org/abs/2504.12516. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BrowseComp leaderboard ranks 39 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.869. The average score across all models is 0.573.
The highest BrowseComp score is 0.869, achieved by Claude Mythos Preview from Anthropic.
39 models have been evaluated on the BrowseComp benchmark, with 0 verified results and 39 self-reported results.
BrowseComp is categorized under agents, reasoning, and search. The benchmark evaluates text models.

Sub-benchmarks

BrowseComp Long Context 128k

A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.

textMax 1

BrowseComp Long Context 256k

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

textMax 1

BrowseComp-VL

BrowseComp-VL is the vision-language variant of BrowseComp, evaluating multimodal models on web browsing comprehension tasks that require processing visual web page content alongside text.

multimodalMax 1

BrowseComp-zh

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

textMax 1