Compare · ModelsLive · 3 picked · head to head

Gemini 3 Pro vs GPT-5 Chat vs GPT-5 Mini

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Gemini 3 Pro wins on 16/18 benchmarks

Gemini 3 Pro wins 16 of 18 shared benchmarks. Leads in knowledge · math · reasoning.

Category leads

knowledge·Gemini 3 Prolanguage·GPT-5 Minimath·Gemini 3 Proreasoning·Gemini 3 Proarena·Gemini 3 Procoding·Gemini 3 Pro

Hype vs Reality

Attention vs performance

Gemini 3 Pro

#40 by perf·no signal

QUIET

GPT-5 Chat

#3 by perf·#1 by attention

DESERVED

GPT-5 Mini

#65 by perf·no signal

QUIET

See full mindshare →

Best value

GPT-5 Mini

3.4x better value than GPT-5 Chat

Gemini 3 Pro

—

no price

GPT-5 Chat

14.6 pts/$

$5.63/M

GPT-5 Mini

49.8 pts/$

$1.13/M

Explore pricing →

Vendor risk

Who is behind the model

Google DeepMind

$4.00T·Tier 1

Low risk

OpenAI

$840.0B·Tier 1

Medium risk

OpenAI

$840.0B·Tier 1

Medium risk

See the AI economy →

Head to head

18 benchmarks · 3 models

Gemini 3 ProGPT-5 ChatGPT-5 Mini

HELM · GPQA

Gemini 3 Pro leads by +1.2

Gemini 3 Pro

80.3

GPT-5 Chat

79.1

GPT-5 Mini

75.6

HELM · IFEval

GPT-5 Mini leads by +5.1

Gemini 3 Pro

87.6

GPT-5 Chat

87.5

GPT-5 Mini

92.7

HELM · MMLU-Pro

Gemini 3 Pro leads by +4.0

Gemini 3 Pro

90.3

GPT-5 Chat

86.3

GPT-5 Mini

83.5

HELM · Omni-MATH

GPT-5 Mini leads by +7.5

Gemini 3 Pro

55.6

GPT-5 Chat

64.7

GPT-5 Mini

72.2

HELM · WildBench

Gemini 3 Pro leads by +0.2

Gemini 3 Pro

85.9

GPT-5 Chat

85.7

GPT-5 Mini

85.5

ARC-AGI

Gemini 3 Pro leads by +20.7

ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.

Gemini 3 Pro

75.0

GPT-5 Mini

54.3

ARC-AGI-2

Gemini 3 Pro leads by +26.7

ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.

Gemini 3 Pro

31.1

GPT-5 Mini

4.4

Chatbot Arena Elo · Overall

Gemini 3 Pro leads by +60.1

Gemini 3 Pro

1486.2

GPT-5 Chat

1426.0

FrontierMath-2025-02-28-Private

Gemini 3 Pro leads by +10.4

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

Gemini 3 Pro

37.6

GPT-5 Mini

27.2

FrontierMath-Tier-4-2025-07-01-Private

Gemini 3 Pro leads by +12.5

FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.

Gemini 3 Pro

18.8

GPT-5 Mini

6.3

GPQA diamond

Gemini 3 Pro leads by +23.5

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Gemini 3 Pro

90.2

GPT-5 Mini

66.7

HLE

Gemini 3 Pro leads by +19.0

HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.

Gemini 3 Pro

34.4

GPT-5 Mini

15.4

OTIS Mock AIME 2024-2025

Gemini 3 Pro leads by +4.7

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Gemini 3 Pro

91.4

GPT-5 Mini

86.7

SimpleQA Verified

Gemini 3 Pro leads by +51.9

SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.

Gemini 3 Pro

72.9

GPT-5 Mini

21.0

SWE-Bench verified

Gemini 3 Pro leads by +8.3

SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.

Gemini 3 Pro

72.9

GPT-5 Mini

64.7

Terminal Bench

Gemini 3 Pro leads by +34.6

Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.

Gemini 3 Pro

69.4

GPT-5 Mini

34.8

VPCT

Gemini 3 Pro leads by +76.2

VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.

Gemini 3 Pro

86.5

GPT-5 Mini

10.3

WeirdML

Gemini 3 Pro leads by +17.3

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

Gemini 3 Pro

69.9

GPT-5 Mini

52.7

Full benchmark table

Benchmark	Gemini 3 Pro	GPT-5 Chat	GPT-5 Mini
HELM · GPQA	80.3	79.1	75.6
HELM · IFEval	87.6	87.5	92.7
HELM · MMLU-Pro	90.3	86.3	83.5
HELM · Omni-MATH	55.6	64.7	72.2
HELM · WildBench	85.9	85.7	85.5
ARC-AGI ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.	75.0	—	54.3
ARC-AGI-2 ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.	31.1	—	4.4
Chatbot Arena Elo · Overall	1486.2	1426.0	—
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	37.6	—	27.2
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.	18.8	—	6.3
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	90.2	—	66.7
HLE HLE (Humanity's Last Exam) · a reasoning benchmark designed to be the hardest public evaluation of AI. Questions span mathematics, physics, philosophy, and logic · curated to be at or beyond the frontier of human expert capability. Tested with and without tool augmentation. Claude Opus 4.7 scores 46.9% without tools and 54.7% with tools · making it one of the few benchmarks where the top score is below 60%.	34.4	—	15.4
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	91.4	—	86.7
SimpleQA Verified SimpleQA Verified · short factual questions with verified answers, measuring factual accuracy and the tendency to hallucinate or provide incorrect information.	72.9	—	21.0
SWE-Bench verified SWE-bench Verified · 500 human-validated tasks from 12 real Python repositories (Django, Flask, scikit-learn, sympy, and others). Each task requires the model to produce a git patch that resolves a real GitHub issue and passes the test suite. The verified subset eliminates ambiguous tasks from the original SWE-bench. Claude Mythos Preview leads at 93.9%, crossing 90% for the first time in 2026. Opus 4.6 scores 80.8%. The benchmark remains the most-cited evaluation for code-generation capability.	72.9	—	64.7
Terminal Bench Terminal-Bench 2.0 · evaluates AI agents on real terminal-based coding tasks · writing scripts, debugging, running tests, and managing projects entirely through command-line interaction. Tests both code quality and terminal fluency. Claude Opus 4.7 scores 69.4%, demonstrating significant agentic terminal competence.	69.4	—	34.8
VPCT VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.	86.5	—	10.3
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.	69.9	—	52.7

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
Gemini 3 Pro	—	—	—	—
GPT-5 Chat	$1.25	$10.00	128K tokens (~64 books)	$34.38
GPT-5 Mini	$0.25	$2.00	400K tokens (~200 books)	$6.88

People also compared

GPT-5 Mini vs Qwen3 235B A22B Claude Haiku 4.5 vs GPT-5 Mini Gemini 2.5 Flash vs GPT-5 Mini GPT-4.1 Mini vs GPT-5 Mini GPT-5.5 Pro vs GPT-5 Chat GPT-5.5 vs GPT-5 Chat Claude Mythos Preview vs GPT-5 Chat GPT-5 Chat vs Qwen3.5 397B A17B