What is the BrowseComp Long Context 256k leaderboard?

The BrowseComp Long Context 256k leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 0.898. The average score across all models is 0.893.

What is the highest BrowseComp Long Context 256k score?

The highest BrowseComp Long Context 256k score is 0.898, achieved by GPT-5.2 from OpenAI.

How many models are evaluated on BrowseComp Long Context 256k?

2 models have been evaluated on the BrowseComp Long Context 256k benchmark, with 0 verified results and 2 self-reported results.

Where can I find the BrowseComp Long Context 256k paper?

The BrowseComp Long Context 256k paper is available at https://arxiv.org/abs/2504.12516. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BrowseComp Long Context 256k cover?

BrowseComp Long Context 256k is categorized under reasoning and search. The benchmark evaluates text models.

All benchmarks

BrowseComp Long Context 256k Benchmark Leaderboard

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

GPT-5.2 from OpenAI currently leads the BrowseComp Long Context 256k leaderboard with a score of 0.898 across 2 evaluated AI models.

Paper