The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason...
Tested on 14 benchmarks with 56.4% average. Top scores: MATH level 5 (94.7%), Aider — Code Editing (84.2%), Fiction.LiveBench (83.3%).
Qwen3 235B A22B Thinking 2507 scores 59.4 (100% as good) at $0.15/1M input · 99% cheaper
Code editing benchmark from the Aider project. Measures ability to apply targeted code changes while maintaining correctness and style.
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Computer-aided design evaluation. Tests understanding of CAD concepts, 3D modeling, and engineering design principles.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
Deceptively simple questions that humans find easy but AI models often get wrong. Tests common sense and reasoning gaps.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Original research-level math problems created by professional mathematicians. Problems are unpublished and cannot be memorized.
- Typemultimodal
- Context200K tokens (~100 books)
- ReleasedDec 2024
- LicenseProprietary
- StatusActive
- Cost / Message~$0.090