Reasoning rankingData updated · May 6, 202612 ranked models

Best AI Models for Reasoning

A composite ranking for multi-step reasoning, abstract inference, and difficult question answering. The table favors models with evidence across more than one reasoning benchmark.

Rank 1High confidence

OpenAI

Composite score
Rank 2High confidence

Anthropic

Composite score
Rank 3Medium confidence

OpenAI

Composite score

Scores are based on the visible benchmark set and available metadata.

Missing prices stay missing
RankModelScoreEvidenceInput priceContext
#1GPT-5.4 Pro
OpenAI
94.54 benchmarks · High$30/M1.1M
#2Claude Opus 4.6
Anthropic
90.35 benchmarks · High$5.00/M1M
#3GPT-5.5
OpenAI
89.72 benchmarks · Medium$5.00/M400K
#4GPT-5.4
OpenAI
87.63 benchmarks · Medium$2.50/M1.1M
#5Claude Sonnet 4.6
Anthropic
77.63 benchmarks · Medium$3.00/M1M
#6GPT-5.2
OpenAI
73.15 benchmarks · High$1.75/M400K
#7GPT-5.2 Pro
OpenAI
71.13 benchmarks · Medium$21/M400K
#8Claude Opus 4.5
Anthropic
70.65 benchmarks · High$5.00/M200K
#9GPT-5 Pro
OpenAI
62.74 benchmarks · High$15/M400K
#10GLM 4.7
z-ai
60.92 benchmarks · Medium$0.38/M203K
#11GPT-5.1
OpenAI
60.85 benchmarks · High$1.25/M400K
#12Grok 4
xAI
604 benchmarks · High$3.00/M256K
Strict caveat

Reasoning benchmarks are proxies. A high score here does not guarantee better answers in every professional or domain-specific setting.

BenchGecko ranks models from published benchmark scores and model metadata. Scores do not measure every use case, and missing data can affect rankings.

Related ranking

Math models ranked from public benchmark scores across GSM8K, MATH-level tests, AIME-style tasks, and FrontierMath where available.

Related ranking

Coding models ranked from published coding benchmark scores, listed prices, and model metadata tracked by BenchGecko.

Related ranking

Multimodal models ranked from public benchmark scores across video, image, chart, and visual reasoning tests where available.

What makes a model good at reasoning?

BenchGecko uses published scores on reasoning benchmarks such as GPQA Diamond, BBH, ARC-AGI, SimpleBench, and HLE.

Why not use one benchmark only?

Single benchmarks can be saturated or narrow. The composite uses multiple reasoning tests and labels confidence by evidence coverage.

Are missing scores counted as zero?

No. Missing scores reduce coverage confidence but are not treated as failed benchmark attempts.