Claude Sonnet 4 significantly enhances the capabilities of its predecessor, Sonnet 3.7, excelling in both coding and reasoning tasks with improved precision and controllability. Achieving state-of-the-art performance on SWE-bench (72.7%),...
Tested on 27 benchmarks with 44.6% average. Top scores: MASK (95.3%), OpenCompass — IFEval (88.3%), MATH level 5 (84.4%).
gpt-oss-120b scores 43.7 (100% as good) at $0.04/1M input · 99% cheaper
SWE-bench Verified solved using only bash commands, no specialized frameworks. Tests raw terminal-based problem solving.
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
OpenCompass Live Code Bench v6. Fresh competitive programming problems to evaluate code generation without memorization.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
OpenCompass evaluation on AIME 2025 problems. Tests mathematical reasoning on fresh competition problems.
- Typemultimodal
- Context1.0M tokens (~500 books)
- ReleasedMay 2025
- LicenseProprietary
- StatusActive
- Cost / Message~$0.021