Compare · ModelsLive · 3 picked · head to head

DeepSeek V3 vs Llama 3.1 405B vs Qwen2.5 72B Instruct

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

DeepSeek V3 wins on 9/20 benchmarks

DeepSeek V3 wins 9 of 20 shared benchmarks. Leads in knowledge · reasoning · math.

Category leads

knowledge·DeepSeek V3reasoning·DeepSeek V3math·DeepSeek V3arena·DeepSeek V3general·Qwen2.5 72B Instructlanguage·Qwen2.5 72B Instructagentic·Llama 3.1 405Bcoding·DeepSeek V3

Hype vs Reality

Attention vs performance

DeepSeek V3

#45 by perf·no signal

QUIET

Llama 3.1 405B

#153 by perf·no signal

QUIET

Qwen2.5 72B Instruct

#82 by perf·no signal

QUIET

See full mindshare →

Best value

Qwen2.5 72B Instruct

1.4x better value than DeepSeek V3

DeepSeek V3

97.5 pts/$

$0.60/M

Llama 3.1 405B

—

no price

Qwen2.5 72B Instruct

140.0 pts/$

$0.38/M

Explore pricing →

Vendor risk

Mixed exposure

One or more vendors flagged

DeepSeek

$3.4B·Tier 1

Higher risk

Meta AI

$1.50T·Tier 1

Low risk

Alibaba (Qwen)

$293.0B·Tier 1

Low risk

See the AI economy →

Head to head

20 benchmarks · 3 models

DeepSeek V3Llama 3.1 405BQwen2.5 72B Instruct

ARC AI2

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

DeepSeek V3

93.7

Llama 3.1 405B

93.7

Qwen2.5 72B Instruct

92.7

BBH

DeepSeek V3 leads by +6.1

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

DeepSeek V3

83.3

Llama 3.1 405B

77.2

Qwen2.5 72B Instruct

73.1

GPQA diamond

DeepSeek V3 leads by +7.5

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

DeepSeek V3

42.0

Llama 3.1 405B

34.5

Qwen2.5 72B Instruct

32.2

HellaSwag

Llama 3.1 405B leads by +0.4

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

DeepSeek V3

85.2

Llama 3.1 405B

85.6

Qwen2.5 72B Instruct

79.7

MATH level 5

DeepSeek V3 leads by +1.7

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

DeepSeek V3

64.8

Llama 3.1 405B

49.8

Qwen2.5 72B Instruct

63.2

MMLU

DeepSeek V3 leads by +2.5

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

DeepSeek V3

82.9

Llama 3.1 405B

79.3

Qwen2.5 72B Instruct

80.4

OTIS Mock AIME 2024-2025

DeepSeek V3 leads by +6.1

OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

DeepSeek V3

15.8

Llama 3.1 405B

9.6

Qwen2.5 72B Instruct

8.0

PIQA

Llama 3.1 405B leads by +2.4

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

DeepSeek V3

69.4

Llama 3.1 405B

71.8

Qwen2.5 72B Instruct

65.2

TriviaQA

DeepSeek V3 leads by +0.2

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

DeepSeek V3

82.9

Llama 3.1 405B

82.7

Qwen2.5 72B Instruct

71.9

Winogrande

Llama 3.1 405B leads by +8.0

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

DeepSeek V3

70.4

Llama 3.1 405B

78.4

Qwen2.5 72B Instruct

64.6

Chatbot Arena Elo · Overall

DeepSeek V3 leads by +55.8

DeepSeek V3

1358.2

Qwen2.5 72B Instruct

1302.3

BBH (HuggingFace)

Qwen2.5 72B Instruct leads by +54.1

Llama 3.1 405B

7.8

Qwen2.5 72B Instruct

61.9

GPQA

Qwen2.5 72B Instruct leads by +10.7

Llama 3.1 405B

5.9

Qwen2.5 72B Instruct

16.7

IFEval

Qwen2.5 72B Instruct leads by +68.2

Llama 3.1 405B

18.1

Qwen2.5 72B Instruct

86.4

MATH Level 5

Qwen2.5 72B Instruct leads by +59.8

Llama 3.1 405B

0.0

Qwen2.5 72B Instruct

59.8

MMLU-PRO

Qwen2.5 72B Instruct leads by +25.7

Llama 3.1 405B

25.7

Qwen2.5 72B Instruct

51.4

MUSR

Qwen2.5 72B Instruct leads by +9.6

Llama 3.1 405B

2.2

Qwen2.5 72B Instruct

11.7

SimpleBench

Llama 3.1 405B leads by +4.9

SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.

DeepSeek V3

2.7

Llama 3.1 405B

7.6

The Agent Company

Llama 3.1 405B leads by +1.7

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

Llama 3.1 405B

7.4

Qwen2.5 72B Instruct

5.7

WeirdML

DeepSeek V3 leads by +14.7

WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.

DeepSeek V3

36.1

Llama 3.1 405B

21.4

Full benchmark table

Benchmark	DeepSeek V3	Llama 3.1 405B	Qwen2.5 72B Instruct
ARC AI2 AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.	93.7	93.7	92.7
BBH BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.	83.3	77.2	73.1
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	42.0	34.5	32.2
HellaSwag HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.	85.2	85.6	79.7
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.	64.8	49.8	63.2
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.	82.9	79.3	80.4
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024-2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	15.8	9.6	8.0
PIQA PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.	69.4	71.8	65.2
TriviaQA TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.	82.9	82.7	71.9
Winogrande WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.	70.4	78.4	64.6
Chatbot Arena Elo · Overall	1358.2	—	1302.3
BBH (HuggingFace)	—	7.8	61.9
GPQA	—	5.9	16.7
IFEval	—	18.1	86.4
MATH Level 5	—	0.0	59.8
MMLU-PRO	—	25.7	51.4
MUSR	—	2.2	11.7
SimpleBench SimpleBench · tests fundamental reasoning capabilities with straightforward problems designed to expose gaps in basic logical and spatial thinking.	2.7	7.6	—
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.	—	7.4	5.7
WeirdML WeirdML · tests models on unusual and adversarial machine learning tasks that require creative problem-solving beyond standard patterns.	36.1	21.4	—

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
DeepSeek V3	$0.32	$0.89	164K tokens (~82 books)	$4.63
Llama 3.1 405B	—	—	—	—
Qwen2.5 72B Instruct	$0.36	$0.40	33K tokens (~16 books)	$3.70

People also compared

DeepSeek V3 vs GPT-4o GPT-4 vs Llama 3.1 405B DeepSeek V3 vs Qwen2.5 Coder 32B Instruct DeepSeek V3 vs DeepSeek V3.2 Speciale DeepSeek V3 vs DeepSeek-V2 (MoE-236B, May 2024)DeepSeek V3 vs R1 0528 DeepSeek V3 vs DeepSeek V3 0324