Benchmarks/reasoning/BrowseComp Long Context 128k

BrowseComp Long Context 128k

A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp Long Context 128k

State-of-the-art frontier
Open
Proprietary

BrowseComp Long Context 128k Leaderboard

5 models • 0 verified
ContextCostLicense
1
OpenAI
OpenAI
400K
$1.75
$14.00
2
OpenAI
OpenAI
400K
$1.25
$10.00
2
400K
$1.25
$10.00
2
400K
$1.25
$10.00
2
OpenAI
OpenAI
400K
$1.25
$10.00
Notice missing or incorrect data?

FAQ

Common questions about BrowseComp Long Context 128k

A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.
The BrowseComp Long Context 128k paper is available at https://arxiv.org/abs/2504.12516. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BrowseComp Long Context 128k leaderboard ranks 5 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 0.920. The average score across all models is 0.904.
The highest BrowseComp Long Context 128k score is 0.920, achieved by GPT-5.2 from OpenAI.
5 models have been evaluated on the BrowseComp Long Context 128k benchmark, with 0 verified results and 5 self-reported results.
BrowseComp Long Context 128k is categorized under reasoning and search. The benchmark evaluates text models.