Compare · ModelsLive · 2 picked · head to head
Llama 3.1 405B vs Qwen2.5 72B Instruct
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Llama 3.1 405B wins on 9/17 benchmarks
Llama 3.1 405B wins 9 of 17 shared benchmarks. Leads in knowledge · reasoning · agentic.
Category leads
knowledge·Llama 3.1 405Breasoning·Llama 3.1 405Bgeneral·Qwen2.5 72B Instructlanguage·Qwen2.5 72B Instructmath·Qwen2.5 72B Instructagentic·Llama 3.1 405B
Hype vs Reality
Attention vs performance
Llama 3.1 405B
#151 by perf·no signal
Qwen2.5 72B Instruct
#80 by perf·no signal
Best value
Qwen2.5 72B Instruct
Llama 3.1 405B
—
no price
Qwen2.5 72B Instruct
208.6 pts/$
$0.26/M
Vendor risk
Who is behind the model
Meta AI
$1.50T·Tier 1
Alibaba (Qwen)
$293.0B·Tier 1
Head to head
17 benchmarks · 2 models
Llama 3.1 405BQwen2.5 72B Instruct
ARC AI2
Llama 3.1 405B leads by +1.1
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
Llama 3.1 405B
93.7
Qwen2.5 72B Instruct
92.7
BBH
Llama 3.1 405B leads by +4.1
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
Llama 3.1 405B
77.2
Qwen2.5 72B Instruct
73.1
GPQA diamond
Llama 3.1 405B leads by +2.3
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Llama 3.1 405B
34.5
Qwen2.5 72B Instruct
32.2
HellaSwag
Llama 3.1 405B leads by +5.9
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
Llama 3.1 405B
85.6
Qwen2.5 72B Instruct
79.7
BBH (HuggingFace)
Qwen2.5 72B Instruct leads by +54.1
Llama 3.1 405B
7.8
Qwen2.5 72B Instruct
61.9
GPQA
Qwen2.5 72B Instruct leads by +10.7
Llama 3.1 405B
5.9
Qwen2.5 72B Instruct
16.7
IFEval
Qwen2.5 72B Instruct leads by +68.2
Llama 3.1 405B
18.1
Qwen2.5 72B Instruct
86.4
MATH Level 5
Qwen2.5 72B Instruct leads by +59.8
Llama 3.1 405B
0.0
Qwen2.5 72B Instruct
59.8
MMLU-PRO
Qwen2.5 72B Instruct leads by +25.7
Llama 3.1 405B
25.7
Qwen2.5 72B Instruct
51.4
MUSR
Qwen2.5 72B Instruct leads by +9.6
Llama 3.1 405B
2.2
Qwen2.5 72B Instruct
11.7
MATH level 5
Qwen2.5 72B Instruct leads by +13.4
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
Llama 3.1 405B
49.8
Qwen2.5 72B Instruct
63.2
MMLU
Qwen2.5 72B Instruct leads by +1.1
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Llama 3.1 405B
79.3
Qwen2.5 72B Instruct
80.4
OTIS Mock AIME 2024-2025
Llama 3.1 405B leads by +1.7
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Llama 3.1 405B
9.6
Qwen2.5 72B Instruct
8.0
PIQA
Llama 3.1 405B leads by +6.6
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
Llama 3.1 405B
71.8
Qwen2.5 72B Instruct
65.2
The Agent Company
Llama 3.1 405B leads by +1.7
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Llama 3.1 405B
7.4
Qwen2.5 72B Instruct
5.7
TriviaQA
Llama 3.1 405B leads by +10.8
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
Llama 3.1 405B
82.7
Qwen2.5 72B Instruct
71.9
Winogrande
Llama 3.1 405B leads by +13.8
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
Llama 3.1 405B
78.4
Qwen2.5 72B Instruct
64.6
Full benchmark table
| Benchmark | Llama 3.1 405B | Qwen2.5 72B Instruct |
|---|---|---|
ARC AI2 AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval. | 93.7 | 92.7 |
BBH BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans. | 77.2 | 73.1 |
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs. | 34.5 | 32.2 |
HellaSwag HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios. | 85.6 | 79.7 |
BBH (HuggingFace) | 7.8 | 61.9 |
GPQA | 5.9 | 16.7 |
IFEval | 18.1 | 86.4 |
MATH Level 5 | 0.0 | 59.8 |
MMLU-PRO | 25.7 | 51.4 |
MUSR | 2.2 | 11.7 |
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics. | 49.8 | 63.2 |
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge. | 79.3 | 80.4 |
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills. | 9.6 | 8.0 |
PIQA PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks. | 71.8 | 65.2 |
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows. | 7.4 | 5.7 |
TriviaQA TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents. | 82.7 | 71.9 |
Winogrande WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs. | 78.4 | 64.6 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| — | — | — | — | |
| $0.12 | $0.39 | 33K tokens (~16 books) | $1.88 |