For tasks that demand low latency, GPT‑4.1 nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million...
Tested on 14 benchmarks with 35.2% average. Top scores: HELM — IFEval (84.3%), HELM — WildBench (81.1%), MATH level 5 (70.0%).
Gemma 3 27B scores 25.1 (97% as good) at $0.08/1M input · 20% cheaper
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typemultimodal
- Context1.0M tokens (~524 books)
- ReleasedApr 2025
- LicenseProprietary
- StatusActive
- Cost / Message~$0.001