Compare · ModelsLive · 2 picked · head to head

Claude Mythos Preview vs Claude Mythos Preview

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Claude Mythos Preview wins on 14/14 benchmarks

Claude Mythos Preview wins 14 of 14 shared benchmarks. Leads in reasoning · knowledge · agentic.

Claude Mythos Preview

14 / 14

Claude Mythos Preview

14 / 14

Category leads

reasoning·Claude Mythos Previewknowledge·Claude Mythos Previewagentic·Claude Mythos Previewcoding·Claude Mythos Previewmath·Claude Mythos Preview

Hype vs Reality

Attention vs performance

Claude Mythos Preview

#4 by perf·#2 by attention

DESERVED

Claude Mythos Preview

#4 by perf·#2 by attention

DESERVED

See full mindshare →

Best value

Pricing unknown

Claude Mythos Preview

—

no price

Claude Mythos Preview

—

no price

Explore pricing →

Vendor risk

Who is behind the model

Anthropic

$380.0B·Tier 1

Medium risk

Anthropic

$380.0B·Tier 1

Medium risk

See the AI economy →

Head to head

14 benchmarks · 2 models

Claude Mythos PreviewClaude Mythos Preview

CharXiv Reasoning

CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.

Claude Mythos Preview

86.1

Claude Mythos Preview

86.1

CharXiv Reasoning (with tools)

CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.

Claude Mythos Preview

93.2

Claude Mythos Preview

93.2

GPQA diamond

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Claude Mythos Preview

94.5

Claude Mythos Preview

94.5

GraphWalks BFS 256K-1M

GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.

Claude Mythos Preview

80.0

Claude Mythos Preview

80.0

HLE

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Claude Mythos Preview

56.8

Claude Mythos Preview

56.8

HLE (with tools)

Claude Mythos Preview

64.7

Claude Mythos Preview

64.7

MMMLU

Claude Mythos Preview

92.7

Claude Mythos Preview

92.7

OSWorld

OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.

Claude Mythos Preview

79.6

Claude Mythos Preview

79.6

SWE-bench Multilingual

Claude Mythos Preview

87.3

Claude Mythos Preview

87.3

SWE-bench Multimodal

Claude Mythos Preview

59.0

Claude Mythos Preview

59.0

SWE-bench Pro

Claude Mythos Preview

77.8

Claude Mythos Preview

77.8

SWE-Bench verified

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Claude Mythos Preview

93.9

Claude Mythos Preview

93.9

Terminal Bench

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Claude Mythos Preview

82.0

Claude Mythos Preview

82.0

USAMO

Claude Mythos Preview

97.6

Claude Mythos Preview

97.6

Full benchmark table

Benchmark	Claude Mythos Preview	Claude Mythos Preview
CharXiv Reasoning CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.	86.1	86.1
CharXiv Reasoning (with tools) CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.	93.2	93.2
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	94.5	94.5
GraphWalks BFS 256K-1M GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.	80.0	80.0
HLE HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.	56.8	56.8
HLE (with tools)	64.7	64.7
MMMLU	92.7	92.7
OSWorld OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.	79.6	79.6
SWE-bench Multilingual	87.3	87.3
SWE-bench Multimodal	59.0	59.0
SWE-bench Pro	77.8	77.8
SWE-Bench verified SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.	93.9	93.9
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.	82.0	82.0
USAMO	97.6	97.6

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
Claude Mythos Preview	—	—	1.0M tokens (~500 books)	—
Claude Mythos Preview	—	—	1.0M tokens (~500 books)	—

People also compared

Claude Mythos Preview vs GPT-5.5 Claude Mythos Preview vs Claude Opus 4.6 Claude Mythos Preview vs GPT-5.4 Claude Mythos Preview vs Gemini 3.1 Pro Preview Claude Mythos Preview vs o3 Pro Claude Mythos Preview vs GPT-5.5 Pro Claude Mythos Preview vs GPT-5 Chat Claude Mythos Preview vs Qwen3.5 397B A17B