Compare · ModelsLive · 2 picked · head to head
Gemini 2.0 Flash vs Claude 3.5 Sonnet
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Claude 3.5 Sonnet wins on 10/18 benchmarks
Claude 3.5 Sonnet wins 10 of 18 shared benchmarks. Leads in coding · arena · knowledge.
Category leads
coding·Claude 3.5 Sonnetarena·Claude 3.5 Sonnetmath·Gemini 2.0 Flashknowledge·Claude 3.5 Sonnetlanguage·Claude 3.5 Sonnetreasoning·Gemini 2.0 Flashagentic·Claude 3.5 Sonnet
Hype vs Reality
Attention vs performance
Gemini 2.0 Flash
#101 by perf·no signal
Claude 3.5 Sonnet
#129 by perf·no signal
Best value
Gemini 2.0 Flash
Gemini 2.0 Flash
192.0 pts/$
$0.25/M
Claude 3.5 Sonnet
—
no price
Vendor risk
Who is behind the model
Google DeepMind
$4.00T·Tier 1
Anthropic
$380.0B·Tier 1
Head to head
18 benchmarks · 2 models
Gemini 2.0 FlashClaude 3.5 Sonnet
Aider polyglot
Claude 3.5 Sonnet leads by +13.4
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
Gemini 2.0 Flash
38.2
Claude 3.5 Sonnet
51.6
Chatbot Arena Elo · Overall
Claude 3.5 Sonnet leads by +11.4
Gemini 2.0 Flash
1360.0
Claude 3.5 Sonnet
1371.4
CadEval
Claude 3.5 Sonnet leads by +18.0
CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.
Gemini 2.0 Flash
30.0
Claude 3.5 Sonnet
48.0
FrontierMath-2025-02-28-Private
Gemini 2.0 Flash leads by +0.7
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
Gemini 2.0 Flash
1.7
Claude 3.5 Sonnet
1.0
GeoBench
Gemini 2.0 Flash leads by +15.0
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
Gemini 2.0 Flash
77.0
Claude 3.5 Sonnet
62.0
GPQA diamond
Gemini 2.0 Flash leads by +13.5
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Gemini 2.0 Flash
52.2
Claude 3.5 Sonnet
38.7
HELM · GPQA
Claude 3.5 Sonnet leads by +0.9
Gemini 2.0 Flash
55.6
Claude 3.5 Sonnet
56.5
HELM · IFEval
Claude 3.5 Sonnet leads by +1.5
Gemini 2.0 Flash
84.1
Claude 3.5 Sonnet
85.6
HELM · MMLU-Pro
Claude 3.5 Sonnet leads by +4.0
Gemini 2.0 Flash
73.7
Claude 3.5 Sonnet
77.7
HELM · Omni-MATH
Gemini 2.0 Flash leads by +18.3
Gemini 2.0 Flash
45.9
Claude 3.5 Sonnet
27.6
HELM · WildBench
Gemini 2.0 Flash leads by +0.8
Gemini 2.0 Flash
80.0
Claude 3.5 Sonnet
79.2
Lech Mazur Writing
Claude 3.5 Sonnet leads by +8.8
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
Gemini 2.0 Flash
71.5
Claude 3.5 Sonnet
80.3
MATH level 5
Gemini 2.0 Flash leads by +30.5
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
Gemini 2.0 Flash
82.2
Claude 3.5 Sonnet
51.7
MMLU
Claude 3.5 Sonnet leads by +9.1
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Gemini 2.0 Flash
72.9
Claude 3.5 Sonnet
82.0
OTIS Mock AIME 2024-2025
Gemini 2.0 Flash leads by +24.6
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Gemini 2.0 Flash
31.0
Claude 3.5 Sonnet
6.4
SimpleBench
Gemini 2.0 Flash leads by +4.3
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
Gemini 2.0 Flash
17.3
Claude 3.5 Sonnet
13.0
The Agent Company
Claude 3.5 Sonnet leads by +12.6
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Gemini 2.0 Flash
11.4
Claude 3.5 Sonnet
24.0
WeirdML
Claude 3.5 Sonnet leads by +5.2
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Gemini 2.0 Flash
25.8
Claude 3.5 Sonnet
31.0
Full benchmark table
| Benchmark | Gemini 2.0 Flash | Claude 3.5 Sonnet |
|---|---|---|
Aider polyglot Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework. | 38.2 | 51.6 |
Chatbot Arena Elo · Overall | 1360.0 | 1371.4 |
CadEval CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge. | 30.0 | 48.0 |
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning. | 1.7 | 1.0 |
GeoBench GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding. | 77.0 | 62.0 |
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs. | 52.2 | 38.7 |
HELM · GPQA | 55.6 | 56.5 |
HELM · IFEval | 84.1 | 85.6 |
HELM · MMLU-Pro | 73.7 | 77.7 |
HELM · Omni-MATH | 45.9 | 27.6 |
HELM · WildBench | 80.0 | 79.2 |
Lech Mazur Writing Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication. | 71.5 | 80.3 |
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics. | 82.2 | 51.7 |
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge. | 72.9 | 82.0 |
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills. | 31.0 | 6.4 |
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking. | 17.3 | 13.0 |
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows. | 11.4 | 24.0 |
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns. | 25.8 | 31.0 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| $0.10 | $0.40 | 1.0M tokens (~500 books) | $1.75 | |
| — | — | — | — |
People also compared