BrowseComp
Progress Over Time
Interactive timeline showing model performance evolution on BrowseComp
BrowseComp Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | — | — | ||
| 2 | Anthropic | — | — | — | ||
| 3 | Moonshot AI | 1.0T | 262K | $0.75 / $3.50 | ||
| 4 | Seed 2.1 ProNew ByteDance | — | — | — | ||
| 5 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 6 | ByteDance | — | — | — | ||
| 7 | Anthropic | — | 1.0M | $3.00 / $15.00 | ||
| 8 | OpenAI | — | 1.1M | $5.00 / $30.00 | ||
| 9 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 10 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 11 | MiniMax | — | 1.0M | $0.30 / $1.20 | ||
| 12 | DeepSeek | 1.6T | 1.0M | $1.60 / $3.20 | ||
| 13 | OpenAI | — | 1.0M | $2.50 / $15.00 | ||
| 14 | Anthropic | — | 1.0M | $5.00 / $25.00 | ||
| 14 | Zhipu AI | 754B | 200K | $1.40 / $4.40 | ||
| 16 | OpenAI | — | — | — | ||
| 17 | ByteDance | — | 256K | $0.50 / $3.00 | ||
| 18 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 19 | Zhipu AI | 744B | 200K | $1.00 / $3.20 | ||
| 20 | Moonshot AI | 1.0T | — | — | ||
| 21 | Anthropic | — | 200K | $3.00 / $15.00 | ||
| 22 | DeepSeek | 284B | 1.0M | $0.10 / $0.20 | ||
| 23 | Alibaba Cloud / Qwen Team | 397B | — | — | ||
| 23 | StepFun | 196B | 66K | $0.10 / $0.40 | ||
| 25 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 26 | Alibaba Cloud / Qwen Team | 122B | — | — | ||
| 27 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 28 | Alibaba Cloud / Qwen Team | 35B | — | — | ||
| 28 | Alibaba Cloud / Qwen Team | 27B | 262K | $0.30 / $2.40 | ||
| 30 | Moonshot AI | 1.0T | — | — | ||
| 31 | Xiaomi | 309B | — | — | ||
| 32 | Meituan | 560B | — | — | ||
| 33 | OpenAI | — | — | — | ||
| 34 | Zhipu AI | 358B | — | — | ||
| 35 | OpenAI | — | — | — | ||
| 36 | DeepSeek | 685B | — | — | ||
| 36 | DeepSeek | 685B | — | — | ||
| 38 | OpenAI | — | — | — | ||
| 39 | Sarvam AI | 105B | — | — | ||
| 40 | Mistral AI | 128B | 256K | $1.50 / $7.50 | ||
| 41 | Zhipu AI | 357B | — | — | ||
| 42 | xAI | — | 2.0M | $0.20 / $0.50 | ||
| 43 | 550B | — | — | |||
| 44 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 45 | Zhipu AI | 30B | — | — | ||
| 46 | DeepSeek | 685B | — | — | ||
| 47 | Sarvam AI | 30B | — | — | ||
| 48 | 120B | — | — | |||
| 49 | DeepSeek | 671B | — | — | ||
| 50 | Zhipu AI | 355B | — | — |
Sub-benchmarks
BrowseComp Long Context 128k
A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.
BrowseComp Long Context 256k
BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.
BrowseComp-VL
BrowseComp-VL is the vision-language variant of BrowseComp, evaluating multimodal models on web browsing comprehension tasks that require processing visual web page content alongside text.
BrowseComp-zh
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
What is BrowseComp?
BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.
BrowseComp is a text benchmark evaluating models on reasoning, search, and agents tasks. LLM Stats tracks 52 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.9.
Compare leaders on the best AI for reasoning, best AI for search and best AI for agents leaderboards.
Current leaders
GPT-5.5 Pro from OpenAI currently leads the BrowseComp leaderboard with a score of 0.901 across 52 evaluated AI models.
Source paper
- Title
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
- Authors
- Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, and 6 others
- Published
- arXiv
- 2504.12516
Abstract
We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.
FAQ
Common questions about the BrowseComp benchmark and leaderboard.