GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard...
Tested on 16 benchmarks with 44.5% average. Top scores: HELM — IFEval (90.4%), MATH level 5 (87.3%), HELM — WildBench (83.8%).
gpt-oss-120b scores 43.7 (101% as good) at $0.04/1M input · 90% cheaper
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
SWE-bench Verified solved using only bash commands, no specialized frameworks. Tests raw terminal-based problem solving.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typemultimodal
- Context1.0M tokens (~524 books)
- ReleasedApr 2025
- LicenseProprietary
- StatusActive
- Cost / Message~$0.002