Compare · ModelsLive · 2 picked · head to head

Gemini 1.5 Pro (Feb 2024) vs Llama 3.1 405B

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

Llama 3.1 405B wins 5 of 9 shared benchmarks. Leads in knowledge · math · agentic.

Category leads
reasoning·Gemini 1.5 Pro (Feb 2024)coding·Gemini 1.5 Pro (Feb 2024)knowledge·Llama 3.1 405Bmath·Llama 3.1 405Bagentic·Llama 3.1 405B
Hype vs Reality
Gemini 1.5 Pro (Feb 2024)
#138 by perf·no signal
QUIET
Llama 3.1 405B
#153 by perf·no signal
QUIET
Best value
Gemini 1.5 Pro (Feb 2024)
no price
Llama 3.1 405B
no price
Vendor risk
Google DeepMind logo
Google DeepMind
$4.00T·Tier 1
Low risk
Meta logo
Meta AI
$1.50T·Tier 1
Low risk
Head to head
Gemini 1.5 Pro (Feb 2024)Llama 3.1 405B
BBH
Gemini 1.5 Pro (Feb 2024) leads by +1.5
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
Gemini 1.5 Pro (Feb 2024)
78.7
Llama 3.1 405B
77.2
Cybench
Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.
Gemini 1.5 Pro (Feb 2024)
7.5
Llama 3.1 405B
7.5
GPQA diamond
Llama 3.1 405B leads by +6.7
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Gemini 1.5 Pro (Feb 2024)
27.8
Llama 3.1 405B
34.5
MATH level 5
Llama 3.1 405B leads by +9.0
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
Gemini 1.5 Pro (Feb 2024)
40.8
Llama 3.1 405B
49.8
MMLU
Llama 3.1 405B leads by +2.4
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Gemini 1.5 Pro (Feb 2024)
76.9
Llama 3.1 405B
79.3
OTIS Mock AIME 2024-2025
Llama 3.1 405B leads by +2.9
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Gemini 1.5 Pro (Feb 2024)
6.7
Llama 3.1 405B
9.6
SimpleBench
Gemini 1.5 Pro (Feb 2024) leads by +4.9
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
Gemini 1.5 Pro (Feb 2024)
12.5
Llama 3.1 405B
7.6
The Agent Company
Llama 3.1 405B leads by +4.0
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Gemini 1.5 Pro (Feb 2024)
3.4
Llama 3.1 405B
7.4
WeirdML
Gemini 1.5 Pro (Feb 2024) leads by +0.8
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
Gemini 1.5 Pro (Feb 2024)
22.2
Llama 3.1 405B
21.4
Full benchmark table
BenchmarkGemini 1.5 Pro (Feb 2024)Llama 3.1 405B
BBH
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
78.777.2
Cybench
Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.
7.57.5
GPQA diamond
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
27.834.5
MATH level 5
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
40.849.8
MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
76.979.3
OTIS Mock AIME 2024-2025
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
6.79.6
SimpleBench
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
12.57.6
The Agent Company
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
3.47.4
WeirdML
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
22.221.4
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
Google DeepMind logoGemini 1.5 Pro (Feb 2024)
Meta logoLlama 3.1 405B
People also compared