GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable...
Tested on 20 benchmarks with 43.2% average. Top scores: Chatbot Arena Elo — Overall (1317.2%), GSM8K (91.3%), HELM — WildBench (79.1%).
Llama 3.1 8B Instruct scores 34.3 (103% as good) at $0.02/1M input · 87% cheaper
Unusual and adversarial machine learning challenges. Tests robustness of reasoning about edge cases in ML systems.
Multi-language code editing from Aider. Tests editing ability across Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
ARC-AGI 2, harder sequel to ARC. More complex abstract reasoning patterns that test generalization ability beyond training data.
Grade school math word problems. 8,500 problems testing multi-step arithmetic reasoning. A foundational math benchmark.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Stanford HELM evaluation of mathematical reasoning across diverse problem types.
- Typemultimodal
- Context128K tokens (~64 books)
- ReleasedJul 2024
- LicenseProprietary
- StatusActive
- Cost / Message~$0.001