Compare · ModelsLive · 2 picked · head to head
Claude Mythos Preview vs Claude Mythos Preview
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Claude Mythos Preview wins on 14/14 benchmarks
Claude Mythos Preview wins 14 of 14 shared benchmarks. Leads in reasoning · knowledge · agentic.
Category leads
reasoning·Claude Mythos Previewknowledge·Claude Mythos Previewagentic·Claude Mythos Previewcoding·Claude Mythos Previewmath·Claude Mythos Preview
Hype vs Reality
Attention vs performance
Claude Mythos Preview
#4 by perf·#2 by attention
Claude Mythos Preview
#4 by perf·#2 by attention
Best value
Pricing unknown
Claude Mythos Preview
—
no price
Claude Mythos Preview
—
no price
Vendor risk
Who is behind the model
Anthropic
$380.0B·Tier 1
Anthropic
$380.0B·Tier 1
Head to head
14 benchmarks · 2 models
Claude Mythos PreviewClaude Mythos Preview
CharXiv Reasoning
CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems.
Claude Mythos Preview
86.1
Claude Mythos Preview
86.1
CharXiv Reasoning (with tools)
CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability.
Claude Mythos Preview
93.2
Claude Mythos Preview
93.2
GPQA diamond
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Claude Mythos Preview
94.5
Claude Mythos Preview
94.5
GraphWalks BFS 256K-1M
GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026.
Claude Mythos Preview
80.0
Claude Mythos Preview
80.0
HLE
HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.
Claude Mythos Preview
56.8
Claude Mythos Preview
56.8
HLE (with tools)
Claude Mythos Preview
64.7
Claude Mythos Preview
64.7
MMMLU
Claude Mythos Preview
92.7
Claude Mythos Preview
92.7
OSWorld
OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use.
Claude Mythos Preview
79.6
Claude Mythos Preview
79.6
SWE-bench Multilingual
Claude Mythos Preview
87.3
Claude Mythos Preview
87.3
SWE-bench Multimodal
Claude Mythos Preview
59.0
Claude Mythos Preview
59.0
SWE-bench Pro
Claude Mythos Preview
77.8
Claude Mythos Preview
77.8
SWE-Bench verified
SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.
Claude Mythos Preview
93.9
Claude Mythos Preview
93.9
Terminal Bench
Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.
Claude Mythos Preview
82.0
Claude Mythos Preview
82.0
USAMO
Claude Mythos Preview
97.6
Claude Mythos Preview
97.6
Full benchmark table
| Benchmark | Claude Mythos Preview | Claude Mythos Preview |
|---|---|---|
CharXiv Reasoning CharXiv Reasoning · tests a model's ability to understand and reason about charts, plots, and figures extracted from real arXiv scientific papers. The model must answer questions that require reading axes, comparing data series, identifying trends, and drawing conclusions from visual scientific data. This benchmark specifically measures the intersection of visual understanding and scientific reasoning · a critical capability for research assistants and document analysis. Performance varies dramatically with and without tool use, making it a key differentiator for multimodal and agentic AI systems. | 86.1 | 86.1 |
CharXiv Reasoning (with tools) CharXiv Reasoning (with tools) · the tool-augmented variant of CharXiv Reasoning. Models can use code execution, image processing, and other tools to analyze charts from arXiv papers. Claude Mythos Preview scores 93.2% with tools vs 86.1% without · demonstrating how tool use dramatically improves visual scientific reasoning. The gap between tool-augmented and bare performance is a key signal for agent capability. | 93.2 | 93.2 |
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs. | 94.5 | 94.5 |
GraphWalks BFS 256K-1M GraphWalks BFS 256K-1M · a long-context reasoning benchmark created by OpenAI that tests whether a model can perform breadth-first search (BFS) traversal across massive graphs encoded in 256,000 to 1,024,000 tokens of context. The model receives a graph represented as an edge list and must follow parent-child relationships across the entire extended context window. This is not simple retrieval · it requires true relational reasoning over hundreds of thousands of tokens. The dataset includes 100 problems at 256K context and 100 problems at 1,024K context. Claude Mythos Preview leads at 80.0%, more than doubling Opus 4.6 (38.7%) and far exceeding GPT-5.4 (21.4%). The massive performance gap between models makes this one of the most discriminating benchmarks for real long-context capability in 2026. | 80.0 | 80.0 |
HLE HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%. | 56.8 | 56.8 |
HLE (with tools) | 64.7 | 64.7 |
MMMLU | 92.7 | 92.7 |
OSWorld OSWorld · tests AI agents on real-world computer tasks across operating systems, including web browsing, file management, and application use. | 79.6 | 79.6 |
SWE-bench Multilingual | 87.3 | 87.3 |
SWE-bench Multimodal | 59.0 | 59.0 |
SWE-bench Pro | 77.8 | 77.8 |
SWE-Bench verified SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability. | 93.9 | 93.9 |
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence. | 82.0 | 82.0 |
USAMO | 97.6 | 97.6 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| — | — | 1.0M tokens (~500 books) | — | |
| — | — | 1.0M tokens (~500 books) | — |
People also compared