Benchmark · AgentSettled

GraphWalks BFS 256K-1M

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Updated 2026-04-07

At 256K-1M tokens, Claude Mythos scores 80%. Opus 4.6 drops to 38.7%. GPT-5.4 collapses to 21.4%. The largest frontier gap on any single benchmark in 2026.

Scoring: F1 score between the predicted set of reachable nodes and the ground-truth set. Precision = correct nodes / predicted nodes. Recall = correct nodes / true nodes. F1 = harmonic mean of precision and recall.

Models tested
4
Top score
80.0
Claude Mythos Preview
Median
80.0
min 80.0
Top-5 spread
σ 0.0
Settled
GRAPHWALKS BFS 256K-1M \u00B7 TOP 40255075100#1Claude Mythos Preview80.0#2Claude Sonnet 4.6VERIFIED73.8#3Claude Opus 4.6VERIFIED38.7#4GPT-5.4VERIFIED21.4benchgecko.ai/benchmark/graphwalks-bfs-256k

4 models tested · sorted by score · includes 4 verified scores

Details
Category
Agent
Creator
OpenAI
Max score
100
Dataset
1,150 problems
Modality
Text
Format
Parquet
License
MIT
Scoring
F1 score between the predicted set of reachable nodes and the ground-truth set. Precision = correct nodes / predicted nodes. Recall = correct nodes / true nodes. F1 = harmonic mean of precision and recall.
Models
4
Published
2025-04-12
Updated
2026-04-07
Changelog
2025-04-12

Initial dataset published.

2026-02-27

Bugfix: 24/400 parent samples had incorrect ground truth (root node included). BFS prompt clarified to specify nodes at exactly the desired depth. Credit to Opus 4.6 system card.

Tests
Long-context reasoningGraph traversalMulti-hop logicRelational reasoning
Does not test
SpeedSafetyVisionTool useMultilingualCost efficiency
Gecko's Take

GraphWalks BFS 256K-1M exposes every "1M context window" marketing claim. If a model cannot score above 40% here, its long-context window is decoration.

Same category · related evaluations