BrowseComp Long Context 256k
BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.
Progress Over Time
Interactive timeline showing model performance evolution on BrowseComp Long Context 256k
State-of-the-art frontier
Open
Proprietary
BrowseComp Long Context 256k Leaderboard
2 models • 0 verified
Notice missing or incorrect data?
FAQ
Common questions about BrowseComp Long Context 256k
BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.
The BrowseComp Long Context 256k paper is available at https://arxiv.org/abs/2504.12516. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BrowseComp Long Context 256k leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 0.898. The average score across all models is 0.893.
The highest BrowseComp Long Context 256k score is 0.898, achieved by GPT-5.2 from OpenAI.
2 models have been evaluated on the BrowseComp Long Context 256k benchmark, with 0 verified results and 2 self-reported results.
BrowseComp Long Context 256k is categorized under reasoning and search. The benchmark evaluates text models.