BrowseComp Long Context 256k

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BrowseComp Long Context 256k

State-of-the-art frontier
Open
Proprietary

BrowseComp Long Context 256k Leaderboard

2 models
ContextCostLicense
1
OpenAI
OpenAI
400K$1.75 / $14.00
2
OpenAI
OpenAI
Notice missing or incorrect data?
About this benchmark

What is BrowseComp Long Context 256k?

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

BrowseComp Long Context 256k is a text benchmark evaluating models on reasoning and search tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for reasoning and best AI for search leaderboards.

Current leaders

GPT-5.2 from OpenAI currently leads the BrowseComp Long Context 256k leaderboard with a score of 0.898 across 2 evaluated AI models.

1GPT-5.2OpenAI89.8%
2GPT-5OpenAI88.8%

Source paper

Title
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Authors
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, and 6 others
Published
Abstract

We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.

FAQ

Common questions about the BrowseComp Long Context 256k benchmark and leaderboard.

What is the BrowseComp Long Context 256k benchmark?

BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.

What is the BrowseComp Long Context 256k leaderboard?

The BrowseComp Long Context 256k leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 0.898. The average score across all models is 0.893.

What is the highest BrowseComp Long Context 256k score?

The highest BrowseComp Long Context 256k score is 0.898, achieved by GPT-5.2 from OpenAI.

How many models are evaluated on BrowseComp Long Context 256k?

2 models have been evaluated on the BrowseComp Long Context 256k benchmark, with 0 verified results and 2 self-reported results.

Where can I find the BrowseComp Long Context 256k paper?

The BrowseComp Long Context 256k paper is available at https://arxiv.org/abs/2504.12516. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does BrowseComp Long Context 256k cover?

BrowseComp Long Context 256k is categorized under reasoning and search. The benchmark evaluates text models.

What's the difference between BrowseComp Long Context 256k and BrowseComp?

BrowseComp Long Context 256k is a variant of BrowseComp. See the BrowseComp leaderboard for the broader benchmark and per-model comparison.

Which model offers the best value on BrowseComp Long Context 256k?

Among models scoring within 10% of the leader, GPT-5.2 from OpenAI is the cheapest, at $1.75 per million input tokens with a score of 0.898.

How recent are the BrowseComp Long Context 256k leaderboard results?

The BrowseComp Long Context 256k leaderboard was last updated in June 2026 and currently includes 2 evaluated models.