GraphWalks BFS 256K-1M
GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.
At 256K-1M tokens, Claude Mythos scores 80%. Opus 4.6 drops to 38.7%. GPT-5.4 collapses to 21.4%. The largest frontier gap on any single benchmark in 2026.
Scoring: F1 score between the predicted set of reachable nodes and the ground-truth set. Precision = correct nodes / predicted nodes. Recall = correct nodes / true nodes. F1 = harmonic mean of precision and recall.
Full rankings
4 models tested · sorted by score · includes 4 verified scores
| # | Model | Score |
|---|---|---|
| 1 | 80.0 | |
| 2 | 73.8 | |
| 3 | 38.7 | |
| 4 | 21.4 |
Score distribution
Where models cluster
Correlated benchmarks
Pearson r · original research
Context gap
Short vs long context performance
GPT-5.4 scores 93% on GraphWalks 0-128K but collapses to 21.4% at 256K-1M. Most models show similar degradation curves. Only Mythos holds above 75% across the full range.
How it works
Evaluation methodology
The model receives a directed graph encoded as a plain-text edge list (e.g., "A -> B, A -> C, B -> D") and must perform breadth-first search from a specified starting node at a given depth, then return the set of reachable nodes. The dataset contains 1,150 problems total: 100 at 256K tokens, 100 at 1,024K tokens, and the rest at shorter context lengths. Evaluation uses 3-shot prompting with three worked examples prepended to each problem. Graphs range from hundreds to thousands of nodes, requiring the model to track relationships across the entire context window.
Example prompt from the dataset
Given the following directed graph edges:
A -> B
A -> C
B -> D
B -> E
C -> F
D -> G
E -> H
...
[256,000+ tokens of edges]
Starting from node A, perform a breadth-first search to depth 3. List all reachable nodes at exactly depth 3.
Answer: {G, H, ...}Industry relevance
Why teams track this benchmark
Most frontier models advertise 1M+ context windows, but GraphWalks BFS 256K-1M separates real long-context reasoning from marketing copy. A model that cannot score above 40% here is not meaningfully processing its full context window. This benchmark is the gold standard for validating long-context claims.
Practical takeaways
By role
Use this benchmark to validate if a model truly works at 200K+ tokens before building retrieval, legal analysis, or codebase understanding features on it.
Mythos dominance at 80% signals Anthropic's long-context moat. Watch for competitors closing the gap · whoever cracks 90% next shifts the market.
MIT-licensed dataset on HuggingFace with grading code included. 1,150 problems across multiple context lengths. Fully reproducible evaluation.
Data schema
Dataset columns
| Column | Type | Description |
|---|---|---|
| prompt | string | 3-shot example followed by the graph edge list and the operation to perform. |
| answer_nodes | list | Ground-truth list of node IDs the model should return. |
| prompt_chars | int64 | Character count of the prompt (2.7K to 1.75M). |
| problem_type | string | Either "bfs" or "parents" for the graph operation requested. |
| date_added | string | Date the problem was added to the dataset. |
Extraction and grading
How answers are scored
# Answer extraction
def get_list(response: str) -> tuple[list[str], bool]:
line = response.split("\n")[-1]
if "Final Answer:" not in line:
return [], True
list_part = re.search(r"Final Answer: ?\[.*\]", line)
if list_part:
result = list_part.group(0).strip("[]").split(",")
return [item.strip() for item in result if item.strip()], False
return [], True
# Grading (per problem)
n_overlap = len(sampled_set & truth_set)
recall = n_overlap / n_golden if n_golden > 0 else 0
precision = n_overlap / n_sampled if n_sampled > 0 else 0
f1 = 2 * (recall * precision) / (recall + precision) \
if recall + precision > 0 else 1Frequently asked
About GraphWalks BFS 256K-1M
What does GraphWalks BFS 256K-1M measure?
GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026. 1 AI models have been tested on it. Scores range from 80.0 to 80.0 out of 100.
Which model leads on GraphWalks BFS 256K-1M?
Claude Mythos Preview from Anthropic leads GraphWalks BFS 256K-1M with a score of 80.0. The median score across 1 tested models is 80.0.
Is GraphWalks BFS 256K-1M saturated?
No · the top score is 80.0 out of 100 (80%). There is still meaningful room for improvement on GraphWalks BFS 256K-1M.
What makes GraphWalks BFS 256K-1M distinctive?
GraphWalks BFS 256K-1M is a agent benchmark with limited overlap to the rest of the catalog · it measures capabilities that are not well-covered by other benchmarks we track.
How often is GraphWalks BFS 256K-1M data refreshed?
BenchGecko pulls updates daily. New model scores on GraphWalks BFS 256K-1M appear as soon as they are published by Epoch AI or the model provider.
- Category
- Agent
- Creator
- OpenAI
- Max score
- 100
- Dataset
- 1,150 problems
- Modality
- Text
- Format
- Parquet
- License
- MIT
- Scoring
- F1 score between the predicted set of reachable nodes and the ground-truth set. Precision = correct nodes / predicted nodes. Recall = correct nodes / true nodes. F1 = harmonic mean of precision and recall.
- Models
- 4
- Published
- 2025-04-12
- Updated
- 2026-04-07
Initial dataset published.
Bugfix: 24/400 parent samples had incorrect ground truth (root node included). BFS prompt clarified to specify nodes at exactly the desired depth. Credit to Opus 4.6 system card.
“GraphWalks BFS 256K-1M exposes every "1M context window" marketing claim. If a model cannot score above 40% here, its long-context window is decoration.”
Top on GraphWalks BFS 256K-1M
Claude Mythos Preview · 80.0Claude Sonnet 4.6 · 73.8Claude Opus 4.6 · 38.7GPT-5.4 · 21.4More agent benchmarks
Same category · related evaluations