Benchmarks/agents/BrowseComp

BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp

State-of-the-art frontier
Open
Proprietary

BrowseComp Leaderboard

36 models • 0 verified
ContextCostLicense
1
1.0M
$2.50
$15.00
2
1.0M
$5.00
$25.00
3
OpenAI
OpenAI
1.0M
$2.50
$15.00
4
400K
$21.00
$168.00
5
ByteDance
ByteDance
6
230B1.0M
$0.30
$1.20
7
Zhipu AI
Zhipu AI
744B200K
$1.00
$3.20
8
Moonshot AI
Moonshot AI
1.0T262K
$0.60
$2.50
9
200K
$3.00
$15.00
10
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B262K
$0.60
$3.60
10
196B66K
$0.10
$0.40
12
OpenAI
OpenAI
400K
$1.75
$14.00
13
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B262K
$0.40
$3.20
14
230B1.0M
$0.30
$1.20
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B262K
$0.25
$2.00
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B
17
1.0T
18
309B256K
$0.10
$0.30
19
560B128K
$0.30
$1.20
20
OpenAI
OpenAI
400K
$1.25
$10.00
21
Zhipu AI
Zhipu AI
358B205K
$0.60
$2.20
22
OpenAI
OpenAI
200K
$1.10
$4.40
23
685B
24
OpenAI
OpenAI
200K
$2.00
$8.00
25
Sarvam AI
Sarvam AI
105B
26
Zhipu AI
Zhipu AI
357B131K
$0.55
$2.19
27
2.0M
$0.20
$0.50
28
MiniMax
MiniMax
230B1.0M
$0.30
$1.20
29
30B128K
$0.07
$0.40
30
685B
31
Sarvam AI
Sarvam AI
30B
32
120B262K
$0.10
$0.50
33
671B164K
$0.27
$1.00
34
Zhipu AI
Zhipu AI
355B131K
$0.40
$1.60
35
Zhipu AI
Zhipu AI
106B
36
671B131K
$0.50
$2.15
Notice missing or incorrect data?

FAQ

Common questions about BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.
The BrowseComp paper is available at https://arxiv.org/abs/2504.12516. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BrowseComp leaderboard ranks 36 AI models based on their performance on this benchmark. Currently, Gemini 3.1 Pro by Google leads with a score of 0.859. The average score across all models is 0.560.
The highest BrowseComp score is 0.859, achieved by Gemini 3.1 Pro from Google.
36 models have been evaluated on the BrowseComp benchmark, with 0 verified results and 36 self-reported results.
BrowseComp is categorized under agents, reasoning, and search. The benchmark evaluates text models.

Sub-benchmarks