Compare · ModelsLive · 2 picked · head to head

Gemini 1.5 Flash (May 2024) vs Llama 3.1 405B

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Llama 3.1 405B wins on 4/6 benchmarks

Llama 3.1 405B wins 4 of 6 shared benchmarks. Leads in knowledge · math.

Gemini 1.5 Flash (May 2024)

2 / 6

Llama 3.1 405B

4 / 6

Category leads

knowledge·Llama 3.1 405Bmath·Llama 3.1 405Bcoding·Gemini 1.5 Flash (May 2024)

Hype vs Reality

Attention vs performance

Gemini 1.5 Flash (May 2024)

#105 by perf·no signal

QUIET

Llama 3.1 405B

#153 by perf·no signal

QUIET

See full mindshare →

Best value

Pricing unknown

Gemini 1.5 Flash (May 2024)

—

no price

Llama 3.1 405B

—

no price

Explore pricing →

Vendor risk

Who is behind the model

Google DeepMind

$4.00T·Tier 1

Low risk

Meta AI

$1.50T·Tier 1

Low risk

See the AI economy →

Head to head

6 benchmarks · 2 models

Gemini 1.5 Flash (May 2024)Llama 3.1 405B

GPQA diamond

Llama 3.1 405B leads by +14.0

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Gemini 1.5 Flash (May 2024)

20.5

Llama 3.1 405B

34.5

MATH level 5

Llama 3.1 405B leads by +24.7

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

Gemini 1.5 Flash (May 2024)

25.1

Llama 3.1 405B

49.8

MMLU

Llama 3.1 405B leads by +8.8

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Gemini 1.5 Flash (May 2024)

70.5

Llama 3.1 405B

79.3

OTIS Mock AIME 2024-2025

Llama 3.1 405B leads by +5.8

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Gemini 1.5 Flash (May 2024)

3.8

Llama 3.1 405B

9.6

PIQA

Gemini 1.5 Flash (May 2024) leads by +3.2

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

Gemini 1.5 Flash (May 2024)

75.0

Llama 3.1 405B

71.8

WeirdML

Gemini 1.5 Flash (May 2024) leads by +3.5

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

Gemini 1.5 Flash (May 2024)

24.9

Llama 3.1 405B

21.4

Full benchmark table

Benchmark	Gemini 1.5 Flash (May 2024)	Llama 3.1 405B
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	20.5	34.5
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.	25.1	49.8
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.	70.5	79.3
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	3.8	9.6
PIQA PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.	75.0	71.8
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.	24.9	21.4

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
Gemini 1.5 Flash (May 2024)	—	—	—	—
Llama 3.1 405B	—	—	—	—

People also compared

GPT-4 vs Llama 3.1 405B