Compare · ModelsLive · 2 picked · head to head
o1-preview vs Llama 3.1 405B
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
o1-preview wins on 5/6 benchmarks
o1-preview wins 5 of 6 shared benchmarks. Leads in coding · math · reasoning.
Category leads
coding·o1-previewknowledge·Llama 3.1 405Bmath·o1-previewreasoning·o1-preview
Hype vs Reality
Attention vs performance
o1-preview
#136 by perf·no signal
Llama 3.1 405B
#153 by perf·no signal
Vendor risk
Who is behind the model
OpenAI
$840.0B·Tier 1
Meta AI
$1.50T·Tier 1
Head to head
6 benchmarks · 2 models
o1-previewLlama 3.1 405B
Cybench
o1-preview leads by +2.5
Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning.
o1-preview
10.0
Llama 3.1 405B
7.5
GPQA diamond
Llama 3.1 405B leads by +0.8
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
o1-preview
33.8
Llama 3.1 405B
34.5
MATH level 5
o1-preview leads by +31.9
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
o1-preview
81.7
Llama 3.1 405B
49.8
OTIS Mock AIME 2024-2025
o1-preview leads by +21.4
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
o1-preview
31.0
Llama 3.1 405B
9.6
SimpleBench
o1-preview leads by +22.4
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
o1-preview
30.0
Llama 3.1 405B
7.6
WeirdML
o1-preview leads by +26.2
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
o1-preview
47.6
Llama 3.1 405B
21.4
Full benchmark table
| Benchmark | o1-preview | Llama 3.1 405B |
|---|---|---|
Cybench Cybench · evaluates AI on real Capture-The-Flag cybersecurity challenges, testing vulnerability analysis, exploitation, and security reasoning. | 10.0 | 7.5 |
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs. | 33.8 | 34.5 |
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics. | 81.7 | 49.8 |
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills. | 31.0 | 9.6 |
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking. | 30.0 | 7.6 |
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns. | 47.6 | 21.4 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| — | — | — | — | |
| — | — | — | — |
People also compared