GPT-5.4 is OpenAI’s latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window (922K input, 128K output) with support for...
Tested on 22 benchmarks with 58.9% average. Top scores: Chatbot Arena Elo — Overall (1467.7%), Chatbot Arena Elo — Coding (1411.1%), OTIS Mock AIME 2024-2025 (95.3%).
Llama 3.3 70B Instruct scores 75.9 (99% as good) at $0.10/1M input · 96% cheaper
Complex terminal-based engineering tasks. Models must use command-line tools, navigate filesystems, and debug systems through shell interaction.
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
- Typemultimodal
- Context1.1M tokens (~525 books)
- ReleasedMar 2026
- LicenseProprietary
- StatusActive
- Cost / Message~$0.020