Compare · ModelsLive · 2 picked · head to head
Llama 3.1 405B vs Falcon-180B
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Llama 3.1 405B wins on 9/14 benchmarks
Llama 3.1 405B wins 9 of 14 shared benchmarks. Leads in knowledge · reasoning.
Category leads
knowledge·Llama 3.1 405Breasoning·Llama 3.1 405Bgeneral·Falcon-180Blanguage·Falcon-180Bmath·Falcon-180B
Hype vs Reality
Attention vs performance
Llama 3.1 405B
#153 by perf·no signal
Falcon-180B
#119 by perf·no signal
Vendor risk
Who is behind the model
Meta AI
$1.50T·Tier 1
TII
private · undisclosed
Head to head
14 benchmarks · 2 models
Llama 3.1 405BFalcon-180B
ARC AI2
Llama 3.1 405B leads by +36.7
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
Llama 3.1 405B
93.7
Falcon-180B
57.1
BBH
Llama 3.1 405B leads by +61.1
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
Llama 3.1 405B
77.2
Falcon-180B
16.1
HellaSwag
Llama 3.1 405B leads by +0.3
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
Llama 3.1 405B
85.6
Falcon-180B
85.3
BBH (HuggingFace)
Falcon-180B leads by +14.2
Llama 3.1 405B
7.8
Falcon-180B
21.9
GPQA
Llama 3.1 405B leads by +3.1
Llama 3.1 405B
5.9
Falcon-180B
2.8
IFEval
Falcon-180B leads by +14.5
Llama 3.1 405B
18.1
Falcon-180B
32.6
MATH Level 5
Falcon-180B leads by +2.8
Llama 3.1 405B
0.0
Falcon-180B
2.8
MMLU-PRO
Llama 3.1 405B leads by +10.2
Llama 3.1 405B
25.7
Falcon-180B
15.4
MUSR
Falcon-180B leads by +5.3
Llama 3.1 405B
2.2
Falcon-180B
7.5
MMLU
Llama 3.1 405B leads by +18.5
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Llama 3.1 405B
79.3
Falcon-180B
60.8
OpenBookQA
Falcon-180B leads by +20.0
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
Llama 3.1 405B
32.3
Falcon-180B
52.3
PIQA
Llama 3.1 405B leads by +2.0
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
Llama 3.1 405B
71.8
Falcon-180B
69.8
TriviaQA
Llama 3.1 405B leads by +2.8
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
Llama 3.1 405B
82.7
Falcon-180B
79.9
Winogrande
Llama 3.1 405B leads by +4.2
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
Llama 3.1 405B
78.4
Falcon-180B
74.2
Full benchmark table
| Benchmark | Llama 3.1 405B | Falcon-180B |
|---|---|---|
ARC AI2 AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval. | 93.7 | 57.1 |
BBH BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans. | 77.2 | 16.1 |
HellaSwag HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios. | 85.6 | 85.3 |
BBH (HuggingFace) | 7.8 | 21.9 |
GPQA | 5.9 | 2.8 |
IFEval | 18.1 | 32.6 |
MATH Level 5 | 0.0 | 2.8 |
MMLU-PRO | 25.7 | 15.4 |
MUSR | 2.2 | 7.5 |
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge. | 79.3 | 60.8 |
OpenBookQA OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting. | 32.3 | 52.3 |
PIQA PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks. | 71.8 | 69.8 |
TriviaQA TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents. | 82.7 | 79.9 |
Winogrande WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs. | 78.4 | 74.2 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| — | — | — | — | |
| — | — | — | — |
People also compared