Compare · ModelsLive · 2 picked · head to head

phi-3-medium 14B vs Qwen2.5 72B Instruct

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Qwen2.5 72B Instruct wins on 6/8 benchmarks

Qwen2.5 72B Instruct wins 6 of 8 shared benchmarks. Leads in knowledge · math.

Category leads

knowledge·Qwen2.5 72B Instructreasoning·phi-3-medium 14Bmath·Qwen2.5 72B Instruct

Hype vs Reality

Attention vs performance

phi-3-medium 14B

#48 by perf·no signal

QUIET

Qwen2.5 72B Instruct

#82 by perf·no signal

QUIET

See full mindshare →

Best value

Qwen2.5 72B Instruct

phi-3-medium 14B

—

no price

Qwen2.5 72B Instruct

140.0 pts/$

$0.38/M

Explore pricing →

Vendor risk

Who is behind the model

Microsoft

$3.00T·Big Tech

Low risk

Alibaba (Qwen)

$293.0B·Tier 1

Low risk

See the AI economy →

Head to head

8 benchmarks · 2 models

phi-3-medium 14BQwen2.5 72B Instruct

ARC AI2

Qwen2.5 72B Instruct leads by +3.9

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

phi-3-medium 14B

88.8

Qwen2.5 72B Instruct

92.7

BBH

phi-3-medium 14B leads by +2.1

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

phi-3-medium 14B

75.2

Qwen2.5 72B Instruct

73.1

GPQA diamond

Qwen2.5 72B Instruct leads by +28.8

Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.

phi-3-medium 14B

3.5

Qwen2.5 72B Instruct

32.2

HellaSwag

Qwen2.5 72B Instruct leads by +3.2

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

phi-3-medium 14B

76.5

Qwen2.5 72B Instruct

79.7

MATH level 5

Qwen2.5 72B Instruct leads by +45.6

MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.

phi-3-medium 14B

17.6

Qwen2.5 72B Instruct

63.2

MMLU

Qwen2.5 72B Instruct leads by +9.7

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

phi-3-medium 14B

70.7

Qwen2.5 72B Instruct

80.4

TriviaQA

phi-3-medium 14B leads by +2.0

TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.

phi-3-medium 14B

73.9

Qwen2.5 72B Instruct

71.9

Winogrande

Qwen2.5 72B Instruct leads by +1.6

WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.

phi-3-medium 14B

63.0

Qwen2.5 72B Instruct

64.6

Full benchmark table

Benchmark	phi-3-medium 14B	Qwen2.5 72B Instruct
ARC AI2 AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.	88.8	92.7
BBH BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.	75.2	73.1
GPQA diamond Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.	3.5	32.2
HellaSwag HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.	76.5	79.7
MATH level 5 MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.	17.6	63.2
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.	70.7	80.4
TriviaQA TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.	73.9	71.9
Winogrande WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.	63.0	64.6

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
phi-3-medium 14B	—	—	—	—
Qwen2.5 72B Instruct	$0.36	$0.40	33K tokens (~16 books)	$3.70

People also compared

phi-3-medium 14B vs phi-3-small 7.4B phi-3-medium 14B vs phi-3-mini 3.8B phi-3-medium 14B vs Phi 4 phi-3-medium 14B vs WizardLM-2 8x22B