Beta
Compare · ModelsLive · 2 picked · head to head

Gemini 2.5 Flash vs Gemini 2.5 Pro

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

Gemini 2.5 Pro wins 22 of 25 shared benchmarks. Leads in coding · reasoning · arena.

Category leads
coding·Gemini 2.5 Proreasoning·Gemini 2.5 Proarena·Gemini 2.5 Proknowledge·Gemini 2.5 Promath·Gemini 2.5 Prolanguage·Gemini 2.5 Flashagentic·Gemini 2.5 Flash
Hype vs Reality
Gemini 2.5 Flash
#142 by perf·#14 by attention
OVERHYPED
Gemini 2.5 Pro
#59 by perf·no signal
QUIET
Best value
2.9x better value than Gemini 2.5 Pro
Gemini 2.5 Flash
28.6 pts/$
$1.40/M
Gemini 2.5 Pro
10.0 pts/$
$5.63/M
Vendor risk
Google DeepMind logo
Google DeepMind
$4.00T·Tier 1
Low risk
Google DeepMind logo
Google DeepMind
$4.00T·Tier 1
Low risk
Head to head
Gemini 2.5 FlashGemini 2.5 Pro
Aider polyglot
Gemini 2.5 Pro leads by +36.0
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
Gemini 2.5 Flash
47.1
Gemini 2.5 Pro
83.1
ARC-AGI
Gemini 2.5 Pro leads by +8.7
ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.
Gemini 2.5 Flash
32.3
Gemini 2.5 Pro
41.0
ARC-AGI-2
Gemini 2.5 Pro leads by +2.3
ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.
Gemini 2.5 Flash
2.5
Gemini 2.5 Pro
4.9
Chatbot Arena Elo · Overall
Gemini 2.5 Pro leads by +37.2
Gemini 2.5 Flash
1411.0
Gemini 2.5 Pro
1448.2
Balrog
Gemini 2.5 Pro leads by +9.8
Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
Gemini 2.5 Flash
33.5
Gemini 2.5 Pro
43.3
DeepResearch Bench
Gemini 2.5 Pro leads by +20.5
DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
Gemini 2.5 Flash
29.2
Gemini 2.5 Pro
49.7
Fiction.LiveBench
Gemini 2.5 Pro leads by +44.5
Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
Gemini 2.5 Flash
47.2
Gemini 2.5 Pro
91.7
FrontierMath-2025-02-28-Private
Gemini 2.5 Pro leads by +9.3
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
Gemini 2.5 Flash
4.8
Gemini 2.5 Pro
14.1
FrontierMath-Tier-4-2025-07-01-Private
FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.
Gemini 2.5 Flash
4.2
Gemini 2.5 Pro
4.2
GeoBench
Gemini 2.5 Pro leads by +8.0
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
Gemini 2.5 Flash
73.0
Gemini 2.5 Pro
81.0
HELM · GPQA
Gemini 2.5 Pro leads by +35.9
Gemini 2.5 Flash
39.0
Gemini 2.5 Pro
74.9
HELM · IFEval
Gemini 2.5 Flash leads by +5.8
Gemini 2.5 Flash
89.8
Gemini 2.5 Pro
84.0
HELM · MMLU-Pro
Gemini 2.5 Pro leads by +22.4
Gemini 2.5 Flash
63.9
Gemini 2.5 Pro
86.3
HELM · Omni-MATH
Gemini 2.5 Pro leads by +3.2
Gemini 2.5 Flash
38.4
Gemini 2.5 Pro
41.6
HELM · WildBench
Gemini 2.5 Pro leads by +4.0
Gemini 2.5 Flash
81.7
Gemini 2.5 Pro
85.7
HLE
Gemini 2.5 Pro leads by +10.0
HLE (Humanity's Last Exam) · crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains.
Gemini 2.5 Flash
7.7
Gemini 2.5 Pro
17.7
Lech Mazur Writing
Gemini 2.5 Pro leads by +9.5
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
Gemini 2.5 Flash
76.5
Gemini 2.5 Pro
86.0
OTIS Mock AIME 2024-2025
Gemini 2.5 Pro leads by +11.7
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Gemini 2.5 Flash
73.0
Gemini 2.5 Pro
84.7
AudioMultiChallenge
Gemini 2.5 Pro leads by +6.9
Gemini 2.5 Flash
40.0
Gemini 2.5 Pro
46.9
AudioMultiChallenge · Text Output
Gemini 2.5 Pro leads by +6.9
Gemini 2.5 Flash
40.0
Gemini 2.5 Pro
46.9
SimpleBench
Gemini 2.5 Pro leads by +25.4
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
Gemini 2.5 Flash
29.4
Gemini 2.5 Pro
54.9
Terminal Bench
Gemini 2.5 Pro leads by +15.5
Terminal Bench · tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency.
Gemini 2.5 Flash
17.1
Gemini 2.5 Pro
32.6
The Agent Company
Gemini 2.5 Flash leads by +10.8
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Gemini 2.5 Flash
41.1
Gemini 2.5 Pro
30.3
VPCT
Gemini 2.5 Pro leads by +12.6
VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.
Gemini 2.5 Flash
7.0
Gemini 2.5 Pro
19.6
WeirdML
Gemini 2.5 Pro leads by +13.1
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Gemini 2.5 Flash
41.0
Gemini 2.5 Pro
54.0
Full benchmark table
BenchmarkGemini 2.5 FlashGemini 2.5 Pro
Aider polyglot
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
47.183.1
ARC-AGI
ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.
32.341.0
ARC-AGI-2
ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.
2.54.9
Chatbot Arena Elo · Overall
1411.01448.2
Balrog
Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
33.543.3
DeepResearch Bench
DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
29.249.7
Fiction.LiveBench
Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
47.291.7
FrontierMath-2025-02-28-Private
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
4.814.1
FrontierMath-Tier-4-2025-07-01-Private
FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.
4.24.2
GeoBench
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
73.081.0
HELM · GPQA
39.074.9
HELM · IFEval
89.884.0
HELM · MMLU-Pro
63.986.3
HELM · Omni-MATH
38.441.6
HELM · WildBench
81.785.7
HLE
HLE (Humanity's Last Exam) · crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains.
7.717.7
Lech Mazur Writing
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
76.586.0
OTIS Mock AIME 2024-2025
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
73.084.7
AudioMultiChallenge
40.046.9
AudioMultiChallenge · Text Output
40.046.9
SimpleBench
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
29.454.9
Terminal Bench
Terminal Bench · tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency.
17.132.6
The Agent Company
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
41.130.3
VPCT
VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.
7.019.6
WeirdML
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
41.054.0
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
Google DeepMind logoGemini 2.5 Flash$0.30$2.501.0M tokens (~524 books)$8.50
Google DeepMind logoGemini 2.5 Pro$1.25$10.001.0M tokens (~524 books)$34.38