GPT-5 Mini is a compact version of GPT-5, designed to handle lighter-weight reasoning tasks. It provides the same instruction-following and safety-tuning benefits as GPT-5, but with reduced latency and cost....
Tested on 28 benchmarks with 56.0% average. Top scores: MATH level 5 (97.8%), HELM — IFEval (92.7%), OTIS Mock AIME 2024-2025 (86.7%).
Qwen3 30B A3B Thinking 2507 scores 63.5 (101% as good) at $0.08/1M input · 68% cheaper
Regularly refreshed coding problems that avoid data contamination. New problems added monthly to prevent memorization.
Real-world software engineering tasks from GitHub issues. Models must diagnose bugs and write patches that pass test suites. Human-verified subset of SWE-bench.
SWE-bench Verified solved using only bash commands, no specialized frameworks. Tests raw terminal-based problem solving.
Stanford HELM WildBench evaluation. Tests reasoning on challenging real-world tasks.
Regularly refreshed reasoning problems testing logical deduction, spatial reasoning, and analytical thinking.
Abstraction and Reasoning Corpus. Tests fluid intelligence through novel visual pattern recognition puzzles. Core measure of general intelligence.
Competition-level math from AMC, AIME, and olympiad problems. Level 5 is the hardest tier, requiring creative problem-solving.
Mock AIME (American Invitational Mathematics Exam) problems from OTIS. Tests mathematical competition performance.
Regularly updated math problems that test numerical reasoning, algebra, calculus, and combinatorics.
- Typemultimodal
- Context400K tokens (~200 books)
- ReleasedAug 2025
- LicenseProprietary
- StatusActive
- Cost / Message~$0.003