Benchmarks/reasoning/BrowseComp Long Context 256k

BrowseComp Long Context 256k

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp Long Context 256k

State-of-the-art frontier
Open
Proprietary

BrowseComp Long Context 256k Leaderboard

2 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
400K
$1.75
$14.00
2
OpenAI
OpenAI
400K
$1.25
$10.00
Notice missing or incorrect data?

FAQ

Common questions about BrowseComp Long Context 256k

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.
The BrowseComp Long Context 256k paper is available at https://arxiv.org/abs/2504.12516. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BrowseComp Long Context 256k leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 0.898. The average score across all models is 0.893.
The highest BrowseComp Long Context 256k score is 0.898, achieved by GPT-5.2 from OpenAI.
2 models have been evaluated on the BrowseComp Long Context 256k benchmark, with 0 verified results and 2 self-reported results.
BrowseComp Long Context 256k is categorized under reasoning and search. The benchmark evaluates text models.