The o-series of models are trained with reinforcement learning to think before they answer and perform complex reasoning. The o3-pro model uses more compute to think harder and provide consistently...
Tested on 8 benchmarks with 61.2% average. Top scores: Fiction.LiveBench (97.2%), Lech Mazur Writing (86.3%), Aider polyglot (84.9%).
gpt-oss-20b (free) scores 61.0 (100% as good) at $0.00/1M input · 100% cheaper
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
LiveBench fiction analysis. Tests literary comprehension and creative text understanding.
Writing quality evaluation by Lech Mazur. Tests prose quality, coherence, and stylistic ability.
SEAL Pro Reasoning Legal. Tests legal reasoning and case analysis ability.
- Typemultimodal
- Context200K tokens (~100 books)
- ReleasedJun 2025
- LicenseProprietary
- StatusActive
- Cost / Message~$0.120