Compare · ModelsLive · 2 picked · head to head

Claude 3.5 Sonnet vs Gemini 2.0 Flash

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

Claude 3.5 Sonnet wins 10 of 18 shared benchmarks. Leads in coding · arena · knowledge.

Category leads
coding·Claude 3.5 Sonnetarena·Claude 3.5 Sonnetmath·Gemini 2.0 Flashknowledge·Claude 3.5 Sonnetlanguage·Claude 3.5 Sonnetreasoning·Gemini 2.0 Flashagentic·Claude 3.5 Sonnet
Hype vs Reality
Claude 3.5 Sonnet
#129 by perf·no signal
QUIET
Gemini 2.0 Flash
#101 by perf·no signal
QUIET
Best value
Claude 3.5 Sonnet
no price
Gemini 2.0 Flash
192.0 pts/$
$0.25/M
Vendor risk
Anthropic logo
Anthropic
$380.0B·Tier 1
Medium risk
Google DeepMind logo
Google DeepMind
$4.00T·Tier 1
Low risk
Head to head
Claude 3.5 SonnetGemini 2.0 Flash
Aider polyglot
Claude 3.5 Sonnet leads by +13.4
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
Claude 3.5 Sonnet
51.6
Gemini 2.0 Flash
38.2
Chatbot Arena Elo · Overall
Claude 3.5 Sonnet leads by +11.4
Claude 3.5 Sonnet
1371.4
Gemini 2.0 Flash
1360.0
CadEval
Claude 3.5 Sonnet leads by +18.0
CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.
Claude 3.5 Sonnet
48.0
Gemini 2.0 Flash
30.0
FrontierMath-2025-02-28-Private
Gemini 2.0 Flash leads by +0.7
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
Claude 3.5 Sonnet
1.0
Gemini 2.0 Flash
1.7
GeoBench
Gemini 2.0 Flash leads by +15.0
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
Claude 3.5 Sonnet
62.0
Gemini 2.0 Flash
77.0
GPQA diamond
Gemini 2.0 Flash leads by +13.5
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Claude 3.5 Sonnet
38.7
Gemini 2.0 Flash
52.2
HELM · GPQA
Claude 3.5 Sonnet leads by +0.9
Claude 3.5 Sonnet
56.5
Gemini 2.0 Flash
55.6
HELM · IFEval
Claude 3.5 Sonnet leads by +1.5
Claude 3.5 Sonnet
85.6
Gemini 2.0 Flash
84.1
HELM · MMLU-Pro
Claude 3.5 Sonnet leads by +4.0
Claude 3.5 Sonnet
77.7
Gemini 2.0 Flash
73.7
HELM · Omni-MATH
Gemini 2.0 Flash leads by +18.3
Claude 3.5 Sonnet
27.6
Gemini 2.0 Flash
45.9
HELM · WildBench
Gemini 2.0 Flash leads by +0.8
Claude 3.5 Sonnet
79.2
Gemini 2.0 Flash
80.0
Lech Mazur Writing
Claude 3.5 Sonnet leads by +8.8
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
Claude 3.5 Sonnet
80.3
Gemini 2.0 Flash
71.5
MATH level 5
Gemini 2.0 Flash leads by +30.5
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
Claude 3.5 Sonnet
51.7
Gemini 2.0 Flash
82.2
MMLU
Claude 3.5 Sonnet leads by +9.1
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Claude 3.5 Sonnet
82.0
Gemini 2.0 Flash
72.9
OTIS Mock AIME 2024-2025
Gemini 2.0 Flash leads by +24.6
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Claude 3.5 Sonnet
6.4
Gemini 2.0 Flash
31.0
SimpleBench
Gemini 2.0 Flash leads by +4.3
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
Claude 3.5 Sonnet
13.0
Gemini 2.0 Flash
17.3
The Agent Company
Claude 3.5 Sonnet leads by +12.6
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Claude 3.5 Sonnet
24.0
Gemini 2.0 Flash
11.4
WeirdML
Claude 3.5 Sonnet leads by +5.2
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Claude 3.5 Sonnet
31.0
Gemini 2.0 Flash
25.8
Full benchmark table
BenchmarkClaude 3.5 SonnetGemini 2.0 Flash
Aider polyglot
Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.
51.638.2
Chatbot Arena Elo · Overall
1371.41360.0
CadEval
CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.
48.030.0
FrontierMath-2025-02-28-Private
FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.
1.01.7
GeoBench
GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.
62.077.0
GPQA diamond
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
38.752.2
HELM · GPQA
56.555.6
HELM · IFEval
85.684.1
HELM · MMLU-Pro
77.773.7
HELM · Omni-MATH
27.645.9
HELM · WildBench
79.280.0
Lech Mazur Writing
Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.
80.371.5
MATH level 5
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
51.782.2
MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
82.072.9
OTIS Mock AIME 2024-2025
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
6.431.0
SimpleBench
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
13.017.3
The Agent Company
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
24.011.4
WeirdML
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
31.025.8
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
Anthropic logoClaude 3.5 Sonnet
Google DeepMind logoGemini 2.0 Flash$0.10$0.401.0M tokens (~500 books)$1.75