Compare · ModelsLive · 2 picked · head to head

Llama 3.1 70B Instruct vs Qwen2.5 72B Instruct

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Qwen2.5 72B Instruct wins on 11/15 benchmarks

Qwen2.5 72B Instruct wins 11 of 15 shared benchmarks. Leads in coding · arena · knowledge.

Llama 3.1 70B Instruct

4 / 15

Qwen2.5 72B Instruct

11 / 15

Category leads

coding·Qwen2.5 72B Instructarena·Qwen2.5 72B Instructknowledge·Qwen2.5 72B Instructgeneral·Qwen2.5 72B Instructlanguage·Llama 3.1 70B Instructmath·Qwen2.5 72B Instructreasoning·Llama 3.1 70B Instructagentic·Llama 3.1 70B Instruct

Hype vs Reality

Attention vs performance

Llama 3.1 70B Instruct

#152 by perf·no signal

QUIET

Qwen2.5 72B Instruct

#80 by perf·no signal

QUIET

See full mindshare →

Best value

Qwen2.5 72B Instruct

2.2x better value than Llama 3.1 70B Instruct

Llama 3.1 70B Instruct

94.5 pts/$

$0.40/M

Qwen2.5 72B Instruct

208.6 pts/$

$0.26/M

Explore pricing →

Vendor risk

Who is behind the model

Meta AI

$1.50T·Tier 1

Low risk

Alibaba (Qwen)

$293.0B·Tier 1

Low risk

See the AI economy →

Head to head

15 benchmarks · 2 models

Llama 3.1 70B InstructQwen2.5 72B Instruct

Aider · Code Editing

Qwen2.5 72B Instruct leads by +6.8

Llama 3.1 70B Instruct

58.6

Qwen2.5 72B Instruct

65.4

Chatbot Arena Elo · Overall

Qwen2.5 72B Instruct leads by +9.5

Llama 3.1 70B Instruct

1292.8

Qwen2.5 72B Instruct

1302.3

Balrog

Llama 3.1 70B Instruct leads by +11.7

Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.

Llama 3.1 70B Instruct

27.9

Qwen2.5 72B Instruct

16.2

CMMLU

Qwen2.5 72B Instruct leads by +21.3

Llama 3.1 70B Instruct

64.4

Qwen2.5 72B Instruct

85.7

GPQA diamond

Qwen2.5 72B Instruct leads by +6.6

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

Llama 3.1 70B Instruct

25.6

Qwen2.5 72B Instruct

32.2

BBH (HuggingFace)

Qwen2.5 72B Instruct leads by +5.9

Llama 3.1 70B Instruct

55.9

Qwen2.5 72B Instruct

61.9

GPQA

Qwen2.5 72B Instruct leads by +2.5

Llama 3.1 70B Instruct

14.2

Qwen2.5 72B Instruct

16.7

IFEval

Llama 3.1 70B Instruct leads by +0.3

Llama 3.1 70B Instruct

86.7

Qwen2.5 72B Instruct

86.4

MATH Level 5

Qwen2.5 72B Instruct leads by +21.8

Llama 3.1 70B Instruct

38.1

Qwen2.5 72B Instruct

59.8

MMLU-PRO

Qwen2.5 72B Instruct leads by +3.5

Llama 3.1 70B Instruct

47.9

Qwen2.5 72B Instruct

51.4

MUSR

Llama 3.1 70B Instruct leads by +6.0

Llama 3.1 70B Instruct

17.7

Qwen2.5 72B Instruct

11.7

MATH level 5

Qwen2.5 72B Instruct leads by +26.5

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

Llama 3.1 70B Instruct

36.7

Qwen2.5 72B Instruct

63.2

MMLU

Qwen2.5 72B Instruct leads by +6.9

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Llama 3.1 70B Instruct

73.5

Qwen2.5 72B Instruct

80.4

OTIS Mock AIME 2024-2025

Qwen2.5 72B Instruct leads by +4.5

OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.

Llama 3.1 70B Instruct

3.5

Qwen2.5 72B Instruct

8.0

The Agent Company

Llama 3.1 70B Instruct leads by +1.2

The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.

Llama 3.1 70B Instruct

6.9

Qwen2.5 72B Instruct

5.7

Full benchmark table

Benchmark	Llama 3.1 70B Instruct	Qwen2.5 72B Instruct
Aider · Code Editing	58.6	65.4
Chatbot Arena Elo · Overall	1292.8	1302.3
Balrog Balrog · benchmarks AI agents on text-based adventure games, testing language understanding, strategic planning, and long-horizon reasoning.	27.9	16.2
CMMLU	64.4	85.7
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	25.6	32.2
BBH (HuggingFace)	55.9	61.9
GPQA	14.2	16.7
IFEval	86.7	86.4
MATH Level 5	38.1	59.8
MMLU-PRO	47.9	51.4
MUSR	17.7	11.7
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.	36.7	63.2
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.	73.5	80.4
OTIS Mock AIME 2024-2025 OTIS Mock AIME 2024–2025 · simulated American Invitational Mathematics Examination problems testing advanced problem-solving skills.	3.5	8.0
The Agent Company The Agent Company · tests AI agents on realistic corporate tasks like email management, code review, data analysis, and cross-tool workflows.	6.9	5.7

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
Llama 3.1 70B Instruct	$0.40	$0.40	131K tokens (~66 books)	$4.00
Qwen2.5 72B Instruct	$0.12	$0.39	33K tokens (~16 books)	$1.88

People also compared

Llama 3.1 70B Instruct vs Llama 3.3 70B Instruct Llama 2-13B vs Llama 3.1 70B Instruct Llama 3.1 70B Instruct vs Llama 3.2 90B LLaMA-13B vs Llama 3.1 70B Instruct