Compare · ModelsLive · 2 picked · head to head
Gemini 2.5 Pro vs Gemini 2.5 Flash
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Gemini 2.5 Pro wins on 23/25 benchmarks
Gemini 2.5 Pro wins 23 of 25 shared benchmarks. Leads in coding · reasoning · arena.
Category leads
coding·Gemini 2.5 Proreasoning·Gemini 2.5 Proarena·Gemini 2.5 Proknowledge·Gemini 2.5 Promath·Gemini 2.5 Prolanguage·Gemini 2.5 Flashagentic·Gemini 2.5 Flash
Hype vs Reality
Attention vs performance
Gemini 2.5 Pro
#59 by perf·no signal
Gemini 2.5 Flash
#142 by perf·#14 by attention
Best value
Gemini 2.5 Flash
2.9x better value than Gemini 2.5 Pro
Gemini 2.5 Pro
10.0 pts/$
$5.63/M
Gemini 2.5 Flash
28.6 pts/$
$1.40/M
Vendor risk
Who is behind the model
Google DeepMind
$4.00T·Tier 1
Google DeepMind
$4.00T·Tier 1
Head to head
25 benchmarks · 2 models
Gemini 2.5 ProGemini 2.5 Flash
Aider polyglot
Gemini 2.5 Pro leads by +36.0
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
Gemini 2.5 Pro
83.1
Gemini 2.5 Flash
47.1
ARC-AGI
Gemini 2.5 Pro leads by +8.7
ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization.
Gemini 2.5 Pro
41.0
Gemini 2.5 Flash
32.3
ARC-AGI-2
Gemini 2.5 Pro leads by +2.3
ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data.
Gemini 2.5 Pro
4.9
Gemini 2.5 Flash
2.5
Chatbot Arena Elo · Overall
Gemini 2.5 Pro leads by +37.2
Gemini 2.5 Pro
1448.2
Gemini 2.5 Flash
1411.0
Balrog
Gemini 2.5 Pro leads by +9.8
Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
Gemini 2.5 Pro
43.3
Gemini 2.5 Flash
33.5
DeepResearch Bench
Gemini 2.5 Pro leads by +20.5
DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses.
Gemini 2.5 Pro
49.7
Gemini 2.5 Flash
29.2
Fiction.LiveBench
Gemini 2.5 Pro leads by +44.5
Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination.
Gemini 2.5 Pro
91.7
Gemini 2.5 Flash
47.2
FrontierMath-2025-02-28-Private
Gemini 2.5 Pro leads by +9.3
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
Gemini 2.5 Pro
14.1
Gemini 2.5 Flash
4.8
FrontierMath-Tier-4-2025-07-01-Private
FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning.
Gemini 2.5 Pro
4.2
Gemini 2.5 Flash
4.2
GeoBench
Gemini 2.5 Pro leads by +8.0
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
Gemini 2.5 Pro
81.0
Gemini 2.5 Flash
73.0
HELM · GPQA
Gemini 2.5 Pro leads by +35.9
Gemini 2.5 Pro
74.9
Gemini 2.5 Flash
39.0
HELM · IFEval
Gemini 2.5 Flash leads by +5.8
Gemini 2.5 Pro
84.0
Gemini 2.5 Flash
89.8
HELM · MMLU-Pro
Gemini 2.5 Pro leads by +22.4
Gemini 2.5 Pro
86.3
Gemini 2.5 Flash
63.9
HELM · Omni-MATH
Gemini 2.5 Pro leads by +3.2
Gemini 2.5 Pro
41.6
Gemini 2.5 Flash
38.4
HELM · WildBench
Gemini 2.5 Pro leads by +4.0
Gemini 2.5 Pro
85.7
Gemini 2.5 Flash
81.7
HLE
Gemini 2.5 Pro leads by +10.0
HLE (Humanity's Last Exam) · crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains.
Gemini 2.5 Pro
17.7
Gemini 2.5 Flash
7.7
Lech Mazur Writing
Gemini 2.5 Pro leads by +9.5
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
Gemini 2.5 Pro
86.0
Gemini 2.5 Flash
76.5
OTIS Mock AIME 2024-2025
Gemini 2.5 Pro leads by +11.7
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Gemini 2.5 Pro
84.7
Gemini 2.5 Flash
73.0
AudioMultiChallenge
Gemini 2.5 Pro leads by +6.9
Gemini 2.5 Pro
46.9
Gemini 2.5 Flash
40.0
AudioMultiChallenge · Text Output
Gemini 2.5 Pro leads by +6.9
Gemini 2.5 Pro
46.9
Gemini 2.5 Flash
40.0
SimpleBench
Gemini 2.5 Pro leads by +25.4
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
Gemini 2.5 Pro
54.9
Gemini 2.5 Flash
29.4
Terminal Bench
Gemini 2.5 Pro leads by +15.5
Terminal Bench · tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency.
Gemini 2.5 Pro
32.6
Gemini 2.5 Flash
17.1
The Agent Company
Gemini 2.5 Flash leads by +10.8
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Gemini 2.5 Pro
30.3
Gemini 2.5 Flash
41.1
VPCT
Gemini 2.5 Pro leads by +12.6
VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations.
Gemini 2.5 Pro
19.6
Gemini 2.5 Flash
7.0
WeirdML
Gemini 2.5 Pro leads by +13.1
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Gemini 2.5 Pro
54.0
Gemini 2.5 Flash
41.0
Full benchmark table
| Benchmark | Gemini 2.5 Pro | Gemini 2.5 Flash |
|---|---|---|
Aider polyglot Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework. | 83.1 | 47.1 |
ARC-AGI ARC-AGI · the original Abstraction and Reasoning Corpus, testing whether AI can solve novel visual pattern recognition tasks without memorization. | 41.0 | 32.3 |
ARC-AGI-2 ARC-AGI-2 · the second iteration of the Abstraction and Reasoning Corpus, testing novel pattern recognition and abstract reasoning without prior training data. | 4.9 | 2.5 |
Chatbot Arena Elo · Overall | 1448.2 | 1411.0 |
Balrog Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning. | 43.3 | 33.5 |
DeepResearch Bench DeepResearch Bench · evaluates AI on complex multi-step research tasks requiring information gathering, synthesis, and producing comprehensive analyses. | 49.7 | 29.2 |
Fiction.LiveBench Fiction.LiveBench · a continuously updated benchmark using recently published fiction to test reading comprehension and reasoning, preventing data contamination. | 91.7 | 47.2 |
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning. | 14.1 | 4.8 |
FrontierMath-Tier-4-2025-07-01-Private FrontierMath Tier 4 (Jul 2025) · the most challenging tier of frontier mathematics, containing problems that push the absolute limits of AI mathematical reasoning. | 4.2 | 4.2 |
GeoBench GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding. | 81.0 | 73.0 |
HELM · GPQA | 74.9 | 39.0 |
HELM · IFEval | 84.0 | 89.8 |
HELM · MMLU-Pro | 86.3 | 63.9 |
HELM · Omni-MATH | 41.6 | 38.4 |
HELM · WildBench | 85.7 | 81.7 |
HLE HLE (Humanity's Last Exam) · crowdsourced expert-level questions designed to be among the hardest possible challenges for AI systems across all domains. | 17.7 | 7.7 |
Lech Mazur Writing Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication. | 86.0 | 76.5 |
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills. | 84.7 | 73.0 |
AudioMultiChallenge | 46.9 | 40.0 |
AudioMultiChallenge · Text Output | 46.9 | 40.0 |
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking. | 54.9 | 29.4 |
Terminal Bench Terminal Bench · tests the ability to accomplish real-world tasks using terminal commands, evaluating shell scripting and CLI tool proficiency. | 32.6 | 17.1 |
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows. | 30.3 | 41.1 |
VPCT VPCT (Visual Pattern Completion Test) · tests visual reasoning and pattern recognition by having models complete visual sequences and transformations. | 19.6 | 7.0 |
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns. | 54.0 | 41.0 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| $1.25 | $10.00 | 1.0M tokens (~524 books) | $34.38 | |
| $0.30 | $2.50 | 1.0M tokens (~524 books) | $8.50 |