BrowseComp Long Context 128k
Progress Over Time
Interactive timeline showing model performance evolution on BrowseComp Long Context 128k
BrowseComp Long Context 128k Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | 400K | $1.75 / $14.00 | ||
| 2 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 2 | OpenAI | — | 400K | $1.25 / $10.00 | ||
| 2 | OpenAI | — | — | — | ||
| 2 | OpenAI | — | — | — |
What is BrowseComp Long Context 128k?
A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.
BrowseComp Long Context 128k is a text benchmark evaluating models on reasoning and search tasks. LLM Stats tracks 5 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.
Compare leaders on the best AI for reasoning and best AI for search leaderboards.
Current leaders
GPT-5.2 from OpenAI currently leads the BrowseComp Long Context 128k leaderboard with a score of 0.920 across 5 evaluated AI models.
Source paper
- Title
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
- Authors
- Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, and 6 others
- Published
- arXiv
- 2504.12516
Abstract
We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.
FAQ
Common questions about the BrowseComp Long Context 128k benchmark and leaderboard.