GPT-5.3-Codex is OpenAI’s most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2-Codex with the broader reasoning and professional knowledge capabilities of GPT-5.2. It achieves state-of-the-art results...
Tested on 9 benchmarks with 52.2% average. Top scores: WeirdML (79.3%), Terminal Bench (77.3%), SWE-Bench verified (74.8%).
MiMo-V2-Flash scores 81.7 (102% as good) at $0.09/1M input · 95% cheaper
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Complex terminal-based engineering tasks. Models must use command-line tools, navigate filesystems, and debug systems through shell interaction.
Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.
Evaluates post-training behaviors including instruction following, safety, and helpfulness balance.
SEAL SWE Atlas Codebase Q&A. Tests understanding of large codebases through question answering.
Agent performance evaluation testing multi-step tool use, planning, and execution in realistic environments.
- Typemultimodal
- Context400K tokens (~200 books)
- ReleasedFeb 2026
- LicenseProprietary
- StatusActive
- Cost / Message~$0.018