Best AI Models for Math
A math-focused composite built from arithmetic, competition-style, and frontier math benchmarks. Saturated benchmarks are useful, but coverage across harder tests gets clearer confidence.
GPT-5.4 Pro
OpenAI
GPT-5.4
OpenAI
Qwen3 Max
Alibaba Qwen
Ranked model table
Scores are based on the visible benchmark set and available metadata.
| Rank | Model | Score | Evidence | Input price | Context |
|---|---|---|---|---|---|
| #1 | GPT-5.4 Pro OpenAI | 90 | 2 benchmarks · Medium | $30/M | 1.1M |
| #2 | GPT-5.4 OpenAI | 82.2 | 3 benchmarks · Medium | $2.50/M | 1.1M |
| #3 | Qwen3 Max Alibaba Qwen | 78.8 | 2 benchmarks · Medium | $0.78/M | 262K |
| #4 | R1 0528 DeepSeek | 75.3 | 2 benchmarks · Medium | $0.50/M | 164K |
| #5 | Claude Opus 4.6 Anthropic | 74.2 | 3 benchmarks · Medium | $5.00/M | 1M |
| #6 | GPT-5.2 OpenAI | 71.3 | 3 benchmarks · Medium | $1.75/M | 400K |
| #7 | GPT-5 OpenAI | 69.6 | 4 benchmarks · High | $1.25/M | 400K |
| #8 | gpt-oss-120b OpenAI | 69.6 | 2 benchmarks · Medium | $0.04/M | 131K |
| #9 | R1 DeepSeek | 67.4 | 2 benchmarks · Medium | $0.70/M | 64K |
| #10 | GPT-5 Mini OpenAI | 65 | 5 benchmarks · High | $0.25/M | 400K |
| #11 | o4 Mini OpenAI | 59.5 | 4 benchmarks · High | $1.10/M | 200K |
| #12 | o1 OpenAI | 58.9 | 3 benchmarks · Medium | $15/M | 200K |
Math scores depend heavily on benchmark format. Use the linked benchmark pages before choosing a model for high-stakes mathematical work.
BenchGecko ranks models from published benchmark scores and model metadata. Scores do not measure every use case, and missing data can affect rankings.
Best AI Models for Reasoning
Reasoning models ranked from public benchmark scores across GPQA Diamond, BBH, ARC-AGI, SimpleBench, and related tests.
Best AI Models for Coding
Coding models ranked from published coding benchmark scores, listed prices, and model metadata tracked by BenchGecko.
Best Open-weight AI Models
Open-weight AI models ranked from available benchmark data, coverage confidence, pricing metadata, and listed license signals.
Questions
Which benchmarks are used for math rankings?
This page uses GSM8K, MATH-level tasks, AIME-style tasks, and FrontierMath when those scores are available.
Why show confidence labels?
A model with two math scores is less proven than a model with broader evidence. Confidence labels make that coverage visible.
Can this ranking replace manual math validation?
No. Use it as a benchmark-backed shortlist, then validate models on your own problem set.