Compare · ModelsLive · 2 picked · head to head

Claude 3.5 Sonnet vs Gemini 2.0 Flash

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Claude 3.5 Sonnet wins on 10/18 benchmarks

Claude 3.5 Sonnet wins 10 of 18 shared benchmarks. Leads in coding · arena · knowledge.

Category leads

coding·Claude 3.5 Sonnetarena·Claude 3.5 Sonnetmath·Gemini 2.0 Flashknowledge·Claude 3.5 Sonnetlanguage·Claude 3.5 Sonnetreasoning·Gemini 2.0 Flashagentic·Claude 3.5 Sonnet

Hype vs Reality

Attention vs performance

Claude 3.5 Sonnet

#129 by perf·no signal

QUIET

Gemini 2.0 Flash

#101 by perf·no signal

QUIET

See full mindshare →

Best value

Gemini 2.0 Flash

Claude 3.5 Sonnet

—

no price

Gemini 2.0 Flash

192.0 pts/$

$0.25/M

Explore pricing →

Vendor risk

Who is behind the model

Anthropic

$380.0B·Tier 1

Medium risk

Google DeepMind

$4.00T·Tier 1

Low risk

See the AI economy →

Head to head

18 benchmarks · 2 models

Claude 3.5 SonnetGemini 2.0 Flash

Aider polyglot

Claude 3.5 Sonnet leads by +13.4

Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.

Claude 3.5 Sonnet

51.6

Gemini 2.0 Flash

38.2

Chatbot Arena Elo · Overall

Claude 3.5 Sonnet leads by +11.4

Claude 3.5 Sonnet

1371.4

Gemini 2.0 Flash

1360.0

CadEval

Claude 3.5 Sonnet leads by +18.0

CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.

Claude 3.5 Sonnet

48.0

Gemini 2.0 Flash

30.0

FrontierMath-2025-02-28-Private

Gemini 2.0 Flash leads by +0.7

FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.

Claude 3.5 Sonnet

1.0

Gemini 2.0 Flash

1.7

GeoBench

Gemini 2.0 Flash leads by +15.0

GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.

Claude 3.5 Sonnet

62.0

Gemini 2.0 Flash

77.0

GPQA diamond

Gemini 2.0 Flash leads by +13.5

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Claude 3.5 Sonnet

38.7

Gemini 2.0 Flash

52.2

HELM · GPQA

Claude 3.5 Sonnet leads by +0.9

Claude 3.5 Sonnet

56.5

Gemini 2.0 Flash

55.6

HELM · IFEval

Claude 3.5 Sonnet leads by +1.5

Claude 3.5 Sonnet

85.6

Gemini 2.0 Flash

84.1

HELM · MMLU-Pro

Claude 3.5 Sonnet leads by +4.0

Claude 3.5 Sonnet

77.7

Gemini 2.0 Flash

73.7

HELM · Omni-MATH

Gemini 2.0 Flash leads by +18.3

Claude 3.5 Sonnet

27.6

Gemini 2.0 Flash

45.9

HELM · WildBench

Gemini 2.0 Flash leads by +0.8

Claude 3.5 Sonnet

79.2

Gemini 2.0 Flash

80.0

Lech Mazur Writing

Claude 3.5 Sonnet leads by +8.8

Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.

Claude 3.5 Sonnet

80.3

Gemini 2.0 Flash

71.5

MATH level 5

Gemini 2.0 Flash leads by +30.5

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

Claude 3.5 Sonnet

51.7

Gemini 2.0 Flash

82.2

MMLU

Claude 3.5 Sonnet leads by +9.1

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Claude 3.5 Sonnet

82.0

Gemini 2.0 Flash

72.9

OTIS Mock AIME 2024-2025

Gemini 2.0 Flash leads by +24.6

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Claude 3.5 Sonnet

6.4

Gemini 2.0 Flash

31.0

SimpleBench

Gemini 2.0 Flash leads by +4.3

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

Claude 3.5 Sonnet

13.0

Gemini 2.0 Flash

17.3

The Agent Company

Claude 3.5 Sonnet leads by +12.6

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

Claude 3.5 Sonnet

24.0

Gemini 2.0 Flash

11.4

WeirdML

Claude 3.5 Sonnet leads by +5.2

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

Claude 3.5 Sonnet

31.0

Gemini 2.0 Flash

25.8

Full benchmark table

Benchmark	Claude 3.5 Sonnet	Gemini 2.0 Flash
Aider polyglot Aider Polyglot · measures how well AI models can edit code across multiple programming languages using the Aider coding assistant framework.	51.6	38.2
Chatbot Arena Elo · Overall	1371.4	1360.0
CadEval CadEval · evaluates the ability to generate and reason about Computer-Aided Design code, testing spatial reasoning and engineering knowledge.	48.0	30.0
FrontierMath-2025-02-28-Private FrontierMath (Feb 2025) · original research-level math problems created by mathematicians, testing capabilities at the boundary of current AI mathematical reasoning.	1.0	1.7
GeoBench GeoBench · tests geographic knowledge and spatial reasoning across countries, landmarks, coordinates, and geopolitical understanding.	62.0	77.0
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	38.7	52.2
HELM · GPQA	56.5	55.6
HELM · IFEval	85.6	84.1
HELM · MMLU-Pro	77.7	73.7
HELM · Omni-MATH	27.6	45.9
HELM · WildBench	79.2	80.0
Lech Mazur Writing Lech Mazur Writing · evaluates creative writing ability, assessing prose quality, narrative coherence, and stylistic sophistication.	80.3	71.5
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.	51.7	82.2
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.	82.0	72.9
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	6.4	31.0
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	13.0	17.3
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.	24.0	11.4
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.	31.0	25.8

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
Claude 3.5 Sonnet	—	—	—	—
Gemini 2.0 Flash	$0.10	$0.40	1.0M tokens (~500 books)	$1.75

People also compared

Gemini 2.0 Flash vs GPT-4o-mini