Benchmarks/reasoning/Graphwalks BFS <128k

Graphwalks BFS <128k

A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length under 128k tokens, returning nodes reachable at specified depths.

Progress Over Time

Interactive timeline showing model performance evolution on Graphwalks BFS <128k

State-of-the-art frontier
Open
Proprietary

Graphwalks BFS <128k Leaderboard

11 models
ContextCostLicense
1
OpenAI
OpenAI
400K$1.75 / $14.00
2
OpenAI
OpenAI
1.0M$2.50 / $15.00
3
OpenAI
OpenAI
400K$1.25 / $10.00
4400K$0.75 / $4.50
5400K$0.20 / $1.25
6
OpenAI
OpenAI
128K$75.00 / $150.00
7
OpenAI
OpenAI
1.0M$2.00 / $8.00
71.0M$0.40 / $1.60
9
OpenAI
OpenAI
200K$1.10 / $4.40
10
OpenAI
OpenAI
128K$2.50 / $10.00
111.0M$0.10 / $0.40
Notice missing or incorrect data?

FAQ

Common questions about Graphwalks BFS <128k

A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length under 128k tokens, returning nodes reachable at specified depths.
The Graphwalks BFS <128k leaderboard ranks 11 AI models based on their performance on this benchmark. Currently, GPT-5.2 by OpenAI leads with a score of 0.940. The average score across all models is 0.662.
The highest Graphwalks BFS <128k score is 0.940, achieved by GPT-5.2 from OpenAI.
11 models have been evaluated on the Graphwalks BFS <128k benchmark, with 0 verified results and 11 self-reported results.
Graphwalks BFS <128k is categorized under reasoning and spatial reasoning. The benchmark evaluates text models.