Anthropic's most capable model · Claude Mythos Preview. Tops SWE-bench Verified (93.9%), GPQA Diamond (94.5%), USAMO (97.6%), and HLE with tools (64.7%). Adaptive thinking at max effort, context up to 1M tokens.
Tested on 14 benchmarks with 81.8% average. Top scores: USAMO (97.6%), GPQA diamond (94.5%), SWE-Bench verified (93.9%).
Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.
SWE-bench extended to non-Python languages. Tests coding ability across Java, JS, Go, Rust, and more.
Complex terminal-based engineering tasks. Models must use command-line tools, navigate filesystems, and debug systems through shell interaction.
Chart reasoning with tool use. Models can use code execution to analyze scientific figures.
Chart and figure reasoning from arXiv papers. Tests ability to interpret scientific visualizations.
Graph traversal benchmark at 256K context. Tests ability to follow breadth-first search paths in large graph structures.
USA Mathematical Olympiad problems. Among the hardest math competitions, requiring elegant proofs and deep mathematical insight.
- Typetext
- Context1.0M tokens (~500 books)
- ReleasedApr 2026
- LicenseProprietary
- Statuspreview