Math rankingData updated · May 6, 202612 ranked models

Best AI Models for Math

A math-focused composite built from arithmetic, competition-style, and frontier math benchmarks. Saturated benchmarks are useful, but coverage across harder tests gets clearer confidence.

Rank 1Medium confidence

OpenAI

Composite score
Rank 2Medium confidence

OpenAI

Composite score
Rank 3Medium confidence

Alibaba Qwen

Composite score

Scores are based on the visible benchmark set and available metadata.

Missing prices stay missing
RankModelScoreEvidenceInput priceContext
#1GPT-5.4 Pro
OpenAI
902 benchmarks · Medium$30/M1.1M
#2GPT-5.4
OpenAI
82.23 benchmarks · Medium$2.50/M1.1M
#3Qwen3 Max
Alibaba Qwen
78.82 benchmarks · Medium$0.78/M262K
#4R1 0528
DeepSeek
75.32 benchmarks · Medium$0.50/M164K
#5Claude Opus 4.6
Anthropic
74.23 benchmarks · Medium$5.00/M1M
#6GPT-5.2
OpenAI
71.33 benchmarks · Medium$1.75/M400K
#7GPT-5
OpenAI
69.64 benchmarks · High$1.25/M400K
#8gpt-oss-120b
OpenAI
69.62 benchmarks · Medium$0.04/M131K
#9R1
DeepSeek
67.42 benchmarks · Medium$0.70/M64K
#10GPT-5 Mini
OpenAI
655 benchmarks · High$0.25/M400K
#11o4 Mini
OpenAI
59.54 benchmarks · High$1.10/M200K
#12o1
OpenAI
58.93 benchmarks · Medium$15/M200K
Strict caveat

Math scores depend heavily on benchmark format. Use the linked benchmark pages before choosing a model for high-stakes mathematical work.

BenchGecko ranks models from published benchmark scores and model metadata. Scores do not measure every use case, and missing data can affect rankings.

Related ranking

Reasoning models ranked from public benchmark scores across GPQA Diamond, BBH, ARC-AGI, SimpleBench, and related tests.

Related ranking

Coding models ranked from published coding benchmark scores, listed prices, and model metadata tracked by BenchGecko.

Related ranking

Open-weight AI models ranked from available benchmark data, coverage confidence, pricing metadata, and listed license signals.

Which benchmarks are used for math rankings?

This page uses GSM8K, MATH-level tasks, AIME-style tasks, and FrontierMath when those scores are available.

Why show confidence labels?

A model with two math scores is less proven than a model with broader evidence. Confidence labels make that coverage visible.

Can this ranking replace manual math validation?

No. Use it as a benchmark-backed shortlist, then validate models on your own problem set.