Compare · ModelsLive · 2 picked · head to head
Llama 3.1 70B Instruct vs Qwen2.5 72B Instruct
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Qwen2.5 72B Instruct wins on 11/15 benchmarks
Qwen2.5 72B Instruct wins 11 of 15 shared benchmarks. Leads in coding · arena · knowledge.
Category leads
coding·Qwen2.5 72B Instructarena·Qwen2.5 72B Instructknowledge·Qwen2.5 72B Instructgeneral·Qwen2.5 72B Instructlanguage·Llama 3.1 70B Instructmath·Qwen2.5 72B Instructreasoning·Llama 3.1 70B Instructagentic·Llama 3.1 70B Instruct
Hype vs Reality
Attention vs performance
Llama 3.1 70B Instruct
#152 by perf·no signal
Qwen2.5 72B Instruct
#80 by perf·no signal
Best value
Qwen2.5 72B Instruct
2.2x better value than Llama 3.1 70B Instruct
Llama 3.1 70B Instruct
94.5 pts/$
$0.40/M
Qwen2.5 72B Instruct
208.6 pts/$
$0.26/M
Vendor risk
Who is behind the model
Meta AI
$1.50T·Tier 1
Alibaba (Qwen)
$293.0B·Tier 1
Head to head
15 benchmarks · 2 models
Llama 3.1 70B InstructQwen2.5 72B Instruct
Aider · Code Editing
Qwen2.5 72B Instruct leads by +6.8
Llama 3.1 70B Instruct
58.6
Qwen2.5 72B Instruct
65.4
Chatbot Arena Elo · Overall
Qwen2.5 72B Instruct leads by +9.5
Llama 3.1 70B Instruct
1292.8
Qwen2.5 72B Instruct
1302.3
Balrog
Llama 3.1 70B Instruct leads by +11.7
Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.
Llama 3.1 70B Instruct
27.9
Qwen2.5 72B Instruct
16.2
CMMLU
Qwen2.5 72B Instruct leads by +21.3
Llama 3.1 70B Instruct
64.4
Qwen2.5 72B Instruct
85.7
GPQA diamond
Qwen2.5 72B Instruct leads by +6.6
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
Llama 3.1 70B Instruct
25.6
Qwen2.5 72B Instruct
32.2
BBH (HuggingFace)
Qwen2.5 72B Instruct leads by +5.9
Llama 3.1 70B Instruct
55.9
Qwen2.5 72B Instruct
61.9
GPQA
Qwen2.5 72B Instruct leads by +2.5
Llama 3.1 70B Instruct
14.2
Qwen2.5 72B Instruct
16.7
IFEval
Llama 3.1 70B Instruct leads by +0.3
Llama 3.1 70B Instruct
86.7
Qwen2.5 72B Instruct
86.4
MATH Level 5
Qwen2.5 72B Instruct leads by +21.8
Llama 3.1 70B Instruct
38.1
Qwen2.5 72B Instruct
59.8
MMLU-PRO
Qwen2.5 72B Instruct leads by +3.5
Llama 3.1 70B Instruct
47.9
Qwen2.5 72B Instruct
51.4
MUSR
Llama 3.1 70B Instruct leads by +6.0
Llama 3.1 70B Instruct
17.7
Qwen2.5 72B Instruct
11.7
MATH level 5
Qwen2.5 72B Instruct leads by +26.5
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
Llama 3.1 70B Instruct
36.7
Qwen2.5 72B Instruct
63.2
MMLU
Qwen2.5 72B Instruct leads by +6.9
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Llama 3.1 70B Instruct
73.5
Qwen2.5 72B Instruct
80.4
OTIS Mock AIME 2024-2025
Qwen2.5 72B Instruct leads by +4.5
OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
Llama 3.1 70B Instruct
3.5
Qwen2.5 72B Instruct
8.0
The Agent Company
Llama 3.1 70B Instruct leads by +1.2
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Llama 3.1 70B Instruct
6.9
Qwen2.5 72B Instruct
5.7
Full benchmark table
| Benchmark | Llama 3.1 70B Instruct | Qwen2.5 72B Instruct |
|---|---|---|
Aider · Code Editing | 58.6 | 65.4 |
Chatbot Arena Elo · Overall | 1292.8 | 1302.3 |
Balrog Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning. | 27.9 | 16.2 |
CMMLU | 64.4 | 85.7 |
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs. | 25.6 | 32.2 |
BBH (HuggingFace) | 55.9 | 61.9 |
GPQA | 14.2 | 16.7 |
IFEval | 86.7 | 86.4 |
MATH Level 5 | 38.1 | 59.8 |
MMLU-PRO | 47.9 | 51.4 |
MUSR | 17.7 | 11.7 |
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics. | 36.7 | 63.2 |
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge. | 73.5 | 80.4 |
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills. | 3.5 | 8.0 |
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows. | 6.9 | 5.7 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| $0.40 | $0.40 | 131K tokens (~66 books) | $4.00 | |
| $0.12 | $0.39 | 33K tokens (~16 books) | $1.88 |