Which model leads on GraphWalks BFS 256K-1M?

Claude Mythos Preview from Anthropic leads GraphWalks BFS 256K-1M with a score of 80.0. The median score across 1 tested models is 80.0.

Is GraphWalks BFS 256K-1M saturated?

No · the top score is 80.0 out of 100 (80%). There is still meaningful room for improvement on GraphWalks BFS 256K-1M.

What makes GraphWalks BFS 256K-1M distinctive?

GraphWalks BFS 256K-1M is a agent benchmark with limited overlap to the rest of the catalog · it measures capabilities that are not well-covered by other benchmarks we track.

How often is GraphWalks BFS 256K-1M data refreshed?

BenchGecko pulls updates daily. New model scores on GraphWalks BFS 256K-1M appear as soon as they are published by Epoch AI or the model provider.

Benchmark · AgentSettled

GraphWalks BFS 256K-1M

Name: GraphWalks BFS 256K-1M Benchmark
Creator: BenchGecko
License: https://creativecommons.org/licenses/by/4.0/

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Updated 2026-04-07

At 256K-1M tokens, Claude Mythos scores 80%. Opus 4.6 drops to 38.7%. GPT-5.4 collapses to 21.4%. The largest frontier gap on any single benchmark in 2026.

Scoring: F1 score between the predicted set of reachable nodes and the ground-truth set. Precision = correct nodes / predicted nodes. Recall = correct nodes / true nodes. F1 = harmonic mean of precision and recall.

Models tested

Top score

80.0

Claude Mythos Preview

Median

80.0

min 80.0

Top-5 spread

σ 0.0

Settled

Full rankings

4 models tested · sorted by score · includes 4 verified scores

#	Model	Score	Price	Source
1	Claude Mythos Preview· Anthropic	80.0	—	Anthropic Mythos System Card, Apr 2026
2	Claude Sonnet 4.6· Anthropicverified	73.8	—	Reddit/Community (omitted from Anthropic official)
3	Claude Opus 4.6· Anthropicverified	38.7	—	Anthropic Opus 4.6 System Card, Feb 2026
4	GPT-5.4· OpenAIverified	21.4	—	OpenAI GPT-5.4 Blog, Mar 2026

How it works

Evaluation methodology

The model receives a directed graph encoded as a plain-text edge list (e.g., "A -> B, A -> C, B -> D") and must perform breadth-first search from a specified starting node at a given depth, then return the set of reachable nodes. The dataset contains 1,150 problems total: 100 at 256K tokens, 100 at 1,024K tokens, and the rest at shorter context lengths. Evaluation uses 3-shot prompting with three worked examples prepended to each problem. Graphs range from hundreds to thousands of nodes, requiring the model to track relationships across the entire context window.

View dataset→

Example prompt from the dataset

Given the following directed graph edges:
A -> B
A -> C
B -> D
B -> E
C -> F
D -> G
E -> H
...
[256,000+ tokens of edges]

Starting from node A, perform a breadth-first search to depth 3. List all reachable nodes at exactly depth 3.

Answer: {G, H, ...}

Industry relevance

Why teams track this benchmark

Most frontier models advertise 1M+ context windows, but GraphWalks BFS 256K-1M separates real long-context reasoning from marketing copy. A model that cannot score above 40% here is not meaningfully processing its full context window. This benchmark is the gold standard for validating long-context claims.

Practical takeaways

By role

Builder

Use this benchmark to validate if a model truly works at 200K+ tokens before building retrieval, legal analysis, or codebase understanding features on it.

Investor

Mythos dominance at 80% signals Anthropic's long-context moat. Watch for competitors closing the gap · whoever cracks 90% next shifts the market.

Researcher

MIT-licensed dataset on HuggingFace with grading code included. 1,150 problems across multiple context lengths. Fully reproducible evaluation.

Data schema

Dataset columns

Column	Type	Description
prompt	string	3-shot example followed by the graph edge list and the operation to perform.
answer_nodes	list	Ground-truth list of node IDs the model should return.
prompt_chars	int64	Character count of the prompt (2.7K to 1.75M).
problem_type	string	Either "bfs" or "parents" for the graph operation requested.
date_added	string	Date the problem was added to the dataset.

Extraction and grading

How answers are scored

Python

# Answer extraction
def get_list(response: str) -> tuple[list[str], bool]:
    line = response.split("\n")[-1]
    if "Final Answer:" not in line:
        return [], True
    list_part = re.search(r"Final Answer: ?\[.*\]", line)
    if list_part:
        result = list_part.group(0).strip("[]").split(",")
        return [item.strip() for item in result if item.strip()], False
    return [], True

# Grading (per problem)
n_overlap = len(sampled_set & truth_set)
recall = n_overlap / n_golden if n_golden > 0 else 0
precision = n_overlap / n_sampled if n_sampled > 0 else 0
f1 = 2 * (recall * precision) / (recall + precision) \
     if recall + precision > 0 else 1

Details

Category: Agent
Creator: OpenAI
Max score: 100
Dataset: 1,150 problems
Modality: Text
Format: Parquet
License: MIT
Scoring: F1 score between the predicted set of reachable nodes and the ground-truth set. Precision = correct nodes / predicted nodes. Recall = correct nodes / true nodes. F1 = harmonic mean of precision and recall.
Models: 4
Published: 2025-04-12
Updated: 2026-04-07

Changelog

2025-04-12

Initial dataset published.

2026-02-27

Bugfix: 24/400 parent samples had incorrect ground truth (root node included). BFS prompt clarified to specify nodes at exactly the desired depth. Credit to Opus 4.6 system card.

Tests

Long-context reasoningGraph traversalMulti-hop logicRelational reasoning

Does not test

SpeedSafetyVisionTool useMultilingualCost efficiency

Links

Dataset OpenAI GraphWalks Dataset (HuggingFace)Anthropic Mythos System Card, Apr 2026 OpenAI GPT-5.4 Blog, Mar 2026 Anthropic Opus 4.6 System Card, Feb 2026

Gecko's Take

“GraphWalks BFS 256K-1M exposes every "1M context window" marketing claim. If a model cannot score above 40% here, its long-context window is decoration.”

Related benchmarks

APEX-Agents17 models The Agent Company13 models OSWorld9 models