Compare · ModelsLive · 3 picked · head to head

DeepSeek V3 vs Llama 3.1 405B vs Qwen2.5 72B Instruct

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

DeepSeek V3 wins 9 of 20 shared benchmarks. Leads in knowledge · reasoning · math.

Category leads
knowledge·DeepSeek V3reasoning·DeepSeek V3math·DeepSeek V3arena·DeepSeek V3general·Qwen2.5 72B Instructlanguage·Qwen2.5 72B Instructagentic·Llama 3.1 405Bcoding·DeepSeek V3
Hype vs Reality
DeepSeek V3
#45 by perf·no signal
QUIET
Llama 3.1 405B
#153 by perf·no signal
QUIET
Qwen2.5 72B Instruct
#82 by perf·no signal
QUIET
Best value
1.4x better value than DeepSeek V3
DeepSeek V3
97.5 pts/$
$0.60/M
Llama 3.1 405B
no price
Qwen2.5 72B Instruct
140.0 pts/$
$0.38/M
Vendor risk
One or more vendors flagged
DeepSeek logo
DeepSeek
$3.4B·Tier 1
Higher risk
Meta logo
Meta AI
$1.50T·Tier 1
Low risk
Alibaba Qwen logo
Alibaba (Qwen)
$293.0B·Tier 1
Low risk
Head to head
DeepSeek V3Llama 3.1 405BQwen2.5 72B Instruct
ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
DeepSeek V3
93.7
Llama 3.1 405B
93.7
Qwen2.5 72B Instruct
92.7
BBH
DeepSeek V3 leads by +6.1
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
DeepSeek V3
83.3
Llama 3.1 405B
77.2
Qwen2.5 72B Instruct
73.1
GPQA diamond
DeepSeek V3 leads by +7.5
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
DeepSeek V3
42.0
Llama 3.1 405B
34.5
Qwen2.5 72B Instruct
32.2
HellaSwag
Llama 3.1 405B leads by +0.4
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
DeepSeek V3
85.2
Llama 3.1 405B
85.6
Qwen2.5 72B Instruct
79.7
MATH level 5
DeepSeek V3 leads by +1.7
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
DeepSeek V3
64.8
Llama 3.1 405B
49.8
Qwen2.5 72B Instruct
63.2
MMLU
DeepSeek V3 leads by +2.5
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
DeepSeek V3
82.9
Llama 3.1 405B
79.3
Qwen2.5 72B Instruct
80.4
OTIS Mock AIME 2024-2025
DeepSeek V3 leads by +6.1
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
DeepSeek V3
15.8
Llama 3.1 405B
9.6
Qwen2.5 72B Instruct
8.0
PIQA
Llama 3.1 405B leads by +2.4
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
DeepSeek V3
69.4
Llama 3.1 405B
71.8
Qwen2.5 72B Instruct
65.2
TriviaQA
DeepSeek V3 leads by +0.2
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
DeepSeek V3
82.9
Llama 3.1 405B
82.7
Qwen2.5 72B Instruct
71.9
Winogrande
Llama 3.1 405B leads by +8.0
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
DeepSeek V3
70.4
Llama 3.1 405B
78.4
Qwen2.5 72B Instruct
64.6
Chatbot Arena Elo · Overall
DeepSeek V3 leads by +55.8
DeepSeek V3
1358.2
Qwen2.5 72B Instruct
1302.3
BBH (HuggingFace)
Qwen2.5 72B Instruct leads by +54.1
Llama 3.1 405B
7.8
Qwen2.5 72B Instruct
61.9
GPQA
Qwen2.5 72B Instruct leads by +10.7
Llama 3.1 405B
5.9
Qwen2.5 72B Instruct
16.7
IFEval
Qwen2.5 72B Instruct leads by +68.2
Llama 3.1 405B
18.1
Qwen2.5 72B Instruct
86.4
MATH Level 5
Qwen2.5 72B Instruct leads by +59.8
Llama 3.1 405B
0.0
Qwen2.5 72B Instruct
59.8
MMLU-PRO
Qwen2.5 72B Instruct leads by +25.7
Llama 3.1 405B
25.7
Qwen2.5 72B Instruct
51.4
MUSR
Qwen2.5 72B Instruct leads by +9.6
Llama 3.1 405B
2.2
Qwen2.5 72B Instruct
11.7
SimpleBench
Llama 3.1 405B leads by +4.9
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
DeepSeek V3
2.7
Llama 3.1 405B
7.6
The Agent Company
Llama 3.1 405B leads by +1.7
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
Llama 3.1 405B
7.4
Qwen2.5 72B Instruct
5.7
WeirdML
DeepSeek V3 leads by +14.7
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
DeepSeek V3
36.1
Llama 3.1 405B
21.4
Full benchmark table
BenchmarkDeepSeek V3Llama 3.1 405BQwen2.5 72B Instruct
ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
93.793.792.7
BBH
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
83.377.273.1
GPQA diamond
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
42.034.532.2
HellaSwag
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
85.285.679.7
MATH level 5
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
64.849.863.2
MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
82.979.380.4
OTIS Mock AIME 2024-2025
OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.
15.89.68.0
PIQA
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
69.471.865.2
TriviaQA
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
82.982.771.9
Winogrande
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
70.478.464.6
Chatbot Arena Elo · Overall
1358.21302.3
BBH (HuggingFace)
7.861.9
GPQA
5.916.7
IFEval
18.186.4
MATH Level 5
0.059.8
MMLU-PRO
25.751.4
MUSR
2.211.7
SimpleBench
SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.
2.77.6
The Agent Company
The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.
7.45.7
WeirdML
WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.
36.121.4
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
DeepSeek logoDeepSeek V3$0.32$0.89164K tokens (~82 books)$4.63
Meta logoLlama 3.1 405B
Alibaba Qwen logoQwen2.5 72B Instruct$0.36$0.4033K tokens (~16 books)$3.70