BrowseComp

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp

State-of-the-art frontier
Open
Proprietary

BrowseComp Leaderboard

52 models
ContextCostLicense
1
2
3
Moonshot AI
Moonshot AI
1.0T262K$0.75 / $3.50
4
ByteDance
ByteDance
51.0M$2.50 / $15.00
6
ByteDance
ByteDance
7
Anthropic
Anthropic
1.0M$3.00 / $15.00
8
OpenAI
OpenAI
1.1M$5.00 / $30.00
91.0M$5.00 / $25.00
101.0M$5.00 / $25.00
11
MiniMax
MiniMax
1.0M$0.30 / $1.20
121.6T1.0M$1.60 / $3.20
13
OpenAI
OpenAI
1.0M$2.50 / $15.00
141.0M$5.00 / $25.00
14
Zhipu AI
Zhipu AI
754B200K$1.40 / $4.40
16
17
ByteDance
ByteDance
256K$0.50 / $3.00
18230B1.0M$0.30 / $1.20
19
Zhipu AI
Zhipu AI
744B200K$1.00 / $3.20
20
Moonshot AI
Moonshot AI
1.0T
21200K$3.00 / $15.00
22284B1.0M$0.10 / $0.20
23
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
397B
23196B66K$0.10 / $0.40
25
OpenAI
OpenAI
400K$1.75 / $14.00
26
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
122B
27230B1.0M$0.30 / $1.20
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
35B
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
27B262K$0.30 / $2.40
301.0T
31309B
32560B
33
OpenAI
OpenAI
34
Zhipu AI
Zhipu AI
358B
35
OpenAI
OpenAI
36685B
36685B
38
OpenAI
OpenAI
39
Sarvam AI
Sarvam AI
105B
40128B256K$1.50 / $7.50
41
Zhipu AI
Zhipu AI
357B
422.0M$0.20 / $0.50
43550B
44
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
4530B
46685B
47
Sarvam AI
Sarvam AI
30B
48120B
49671B
50
Zhipu AI
Zhipu AI
355B
150 of 52
1/2
Notice missing or incorrect data?

Sub-benchmarks

BrowseComp Long Context 128k

A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.

textMax 1

BrowseComp Long Context 256k

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

textMax 1

BrowseComp-VL

BrowseComp-VL is the vision-language variant of BrowseComp, evaluating multimodal models on web browsing comprehension tasks that require processing visual web page content alongside text.

multimodalMax 1

BrowseComp-zh

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

textMax 1
About this benchmark

What is BrowseComp?

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

BrowseComp is a text benchmark evaluating models on reasoning, search, and agents tasks. LLM Stats tracks 52 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.9.

Compare leaders on the best AI for reasoning, best AI for search and best AI for agents leaderboards.

Current leaders

GPT-5.5 Pro from OpenAI currently leads the BrowseComp leaderboard with a score of 0.901 across 52 evaluated AI models.

1GPT-5.5 ProOpenAI90.1%
2Claude Mythos PreviewAnthropic86.9%
3Kimi K2.6Moonshot AI86.3%

Source paper

Title
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Authors
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, and 6 others
Published
Abstract

We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.

FAQ

Common questions about the BrowseComp benchmark and leaderboard.

What is the BrowseComp benchmark?

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

What is the BrowseComp leaderboard?

The BrowseComp leaderboard ranks 52 AI models based on their performance on this benchmark. Currently, GPT-5.5 Pro by OpenAI leads with a score of 0.901. The average score across all models is 0.624.

What is the highest BrowseComp score?

The highest BrowseComp score is 0.901, achieved by GPT-5.5 Pro from OpenAI.

How many models are evaluated on BrowseComp?

52 models have been evaluated on the BrowseComp benchmark, with 0 verified results and 52 self-reported results.

Where can I find the BrowseComp paper?

The BrowseComp paper is available at https://arxiv.org/abs/2504.12516. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BrowseComp cover?

BrowseComp is categorized under reasoning, search, and agents. The benchmark evaluates text models.

Are there variants of BrowseComp?

What is the best open-source model on BrowseComp?

Kimi K2.6 by Moonshot AI is the top-ranked open-source model on BrowseComp, with a score of 0.863 (rank #3).

Which model offers the best value on BrowseComp?

Among models scoring within 10% of the leader, MiniMax M3 from MiniMax is the cheapest, at $0.30 per million input tokens with a score of 0.835.

How recent are the BrowseComp leaderboard results?

The BrowseComp leaderboard was last updated in June 2026 and currently includes 52 evaluated models.