Compare · ModelsLive · 2 picked · head to head
Gemini 2.5 Flash vs Gemini 2.5 Pro
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Gemini 2.5 Pro wins on 22/25 benchmarks
Gemini 2.5 Pro wins 22 of 25 shared benchmarks. Leads in coding · reasoning · arena.
Category leads
coding·Gemini 2.5 Proreasoning·Gemini 2.5 Proarena·Gemini 2.5 Proknowledge·Gemini 2.5 Promath·Gemini 2.5 Prolanguage·Gemini 2.5 Flashagentic·Gemini 2.5 Flash
Hype vs Reality
Attention vs performance
Gemini 2.5 Flash
#142 by perf·#14 by attention
Gemini 2.5 Pro
#59 by perf·no signal
Best value
Gemini 2.5 Flash
2.9x better value than Gemini 2.5 Pro
Gemini 2.5 Flash
28.6 pts/$
$1.40/M
Gemini 2.5 Pro
10.0 pts/$
$5.63/M
Vendor risk
Who is behind the model
Google DeepMind
$4.00T·Tier 1
Google DeepMind
$4.00T·Tier 1
Head to head
25 benchmarks · 2 models
Gemini 2.5 FlashGemini 2.5 Pro
Aider polyglot
Gemini 2.5 Pro leads by +36.0
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
Gemini 2.5 Flash
47.1
Gemini 2.5 Pro
83.1
ARC-AGI
Gemini 2.5 Pro leads by +8.7
ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.
Gemini 2.5 Flash
32.3
Gemini 2.5 Pro
41.0
ARC-AGI-2
Gemini 2.5 Pro leads by +2.3
ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.
Gemini 2.5 Flash
2.5
Gemini 2.5 Pro
4.9
Chatbot Arena Elo · Overall
Gemini 2.5 Pro leads by +37.2
Gemini 2.5 Flash
1411.0
Gemini 2.5 Pro
1448.2
Balrog
Gemini 2.5 Pro leads by +9.8
Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
Gemini 2.5 Flash
33.5
Gemini 2.5 Pro
43.3
DeepResearch Bench
Gemini 2.5 Pro leads by +20.5
DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
Gemini 2.5 Flash
29.2
Gemini 2.5 Pro
49.7
Fiction.LiveBench
Gemini 2.5 Pro leads by +44.5
Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
Gemini 2.5 Flash
47.2
Gemini 2.5 Pro
91.7
FrontierMath-2025-02-28-Private
Gemini 2.5 Pro leads by +9.3
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
Gemini 2.5 Flash
4.8
Gemini 2.5 Pro
14.1
FrontierMath-Tier-4-2025-07-01-Private
FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.
Gemini 2.5 Flash
4.2
Gemini 2.5 Pro
4.2
GeoBench
Gemini 2.5 Pro leads by +8.0
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
Gemini 2.5 Flash
73.0
Gemini 2.5 Pro
81.0
HELM · GPQA
Gemini 2.5 Pro leads by +35.9
Gemini 2.5 Flash
39.0
Gemini 2.5 Pro
74.9
HELM · IFEval
Gemini 2.5 Flash leads by +5.8
Gemini 2.5 Flash
89.8
Gemini 2.5 Pro
84.0
HELM · MMLU-Pro
Gemini 2.5 Pro leads by +22.4
Gemini 2.5 Flash
63.9
Gemini 2.5 Pro
86.3
HELM · Omni-MATH
Gemini 2.5 Pro leads by +3.2
Gemini 2.5 Flash
38.4
Gemini 2.5 Pro
41.6
HELM · WildBench
Gemini 2.5 Pro leads by +4.0
Gemini 2.5 Flash
81.7
Gemini 2.5 Pro
85.7
HLE
Gemini 2.5 Pro leads by +10.0
HLE (Humanity's Last Exam) · crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains.
Gemini 2.5 Flash
7.7
Gemini 2.5 Pro
17.7
Lech Mazur Writing
Gemini 2.5 Pro leads by +9.5
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
Gemini 2.5 Flash
76.5
Gemini 2.5 Pro
86.0
OTIS Mock AIME 2024-2025
Gemini 2.5 Pro leads by +11.7
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Gemini 2.5 Flash
73.0
Gemini 2.5 Pro
84.7
AudioMultiChallenge
Gemini 2.5 Pro leads by +6.9
Gemini 2.5 Flash
40.0
Gemini 2.5 Pro
46.9
AudioMultiChallenge · Text Output
Gemini 2.5 Pro leads by +6.9
Gemini 2.5 Flash
40.0
Gemini 2.5 Pro
46.9
SimpleBench
Gemini 2.5 Pro leads by +25.4
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
Gemini 2.5 Flash
29.4
Gemini 2.5 Pro
54.9
Terminal Bench
Gemini 2.5 Pro leads by +15.5
Terminal Bench · tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency.
Gemini 2.5 Flash
17.1
Gemini 2.5 Pro
32.6
The Agent Company
Gemini 2.5 Flash leads by +10.8
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Gemini 2.5 Flash
41.1
Gemini 2.5 Pro
30.3
VPCT
Gemini 2.5 Pro leads by +12.6
VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.
Gemini 2.5 Flash
7.0
Gemini 2.5 Pro
19.6
WeirdML
Gemini 2.5 Pro leads by +13.1
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Gemini 2.5 Flash
41.0
Gemini 2.5 Pro
54.0
Full benchmark table
| Benchmark | Gemini 2.5 Flash | Gemini 2.5 Pro |
|---|---|---|
Aider polyglot Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework. | 47.1 | 83.1 |
ARC-AGI ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization. | 32.3 | 41.0 |
ARC-AGI-2 ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data. | 2.5 | 4.9 |
Chatbot Arena Elo · Overall | 1411.0 | 1448.2 |
Balrog Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning. | 33.5 | 43.3 |
DeepResearch Bench DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses. | 29.2 | 49.7 |
Fiction.LiveBench Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination. | 47.2 | 91.7 |
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning. | 4.8 | 14.1 |
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning. | 4.2 | 4.2 |
GeoBench GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding. | 73.0 | 81.0 |
HELM · GPQA | 39.0 | 74.9 |
HELM · IFEval | 89.8 | 84.0 |
HELM · MMLU-Pro | 63.9 | 86.3 |
HELM · Omni-MATH | 38.4 | 41.6 |
HELM · WildBench | 81.7 | 85.7 |
HLE HLE (Humanity's Last Exam) · crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains. | 7.7 | 17.7 |
Lech Mazur Writing Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication. | 76.5 | 86.0 |
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills. | 73.0 | 84.7 |
AudioMultiChallenge | 40.0 | 46.9 |
AudioMultiChallenge · Text Output | 40.0 | 46.9 |
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking. | 29.4 | 54.9 |
Terminal Bench Terminal Bench · tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency. | 17.1 | 32.6 |
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows. | 41.1 | 30.3 |
VPCT VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations. | 7.0 | 19.6 |
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns. | 41.0 | 54.0 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| $0.30 | $2.50 | 1.0M tokens (~524 books) | $8.50 | |
| $1.25 | $10.00 | 1.0M tokens (~524 books) | $34.38 |