Beta
Compare · ModelsLive · 2 picked · head to head

phi-3-medium 14B vs Mixtral 8x7B Instruct

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

phi-3-medium 14B wins 6 of 9 shared benchmarks. Leads in knowledge · math.

Category leads
knowledge·phi-3-medium 14Bmath·phi-3-medium 14B
Hype vs Reality
phi-3-medium 14B
#46 by perf·no signal
QUIET
Mixtral 8x7B Instruct
#52 by perf·no signal
QUIET
Best value
phi-3-medium 14B
no price
Mixtral 8x7B Instruct
107.0 pts/$
$0.54/M
Vendor risk
Microsoft logo
Microsoft
$3.00T·Big Tech
Low risk
Mistral AI logo
Mistral AI
$14.0B·Tier 1
Medium risk
Head to head
phi-3-medium 14BMixtral 8x7B Instruct
ANLI
phi-3-medium 14B leads by +0.9
ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.
phi-3-medium 14B
33.7
Mixtral 8x7B Instruct
32.8
ARC AI2
phi-3-medium 14B leads by +5.7
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
phi-3-medium 14B
88.8
Mixtral 8x7B Instruct
83.1
GPQA diamond
Mixtral 8x7B Instruct leads by +4.0
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
phi-3-medium 14B
3.5
Mixtral 8x7B Instruct
7.5
HellaSwag
Mixtral 8x7B Instruct leads by +5.7
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
phi-3-medium 14B
76.5
Mixtral 8x7B Instruct
82.3
MATH level 5
phi-3-medium 14B leads by +7.6
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
phi-3-medium 14B
17.6
Mixtral 8x7B Instruct
9.9
MMLU
phi-3-medium 14B leads by +9.9
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
phi-3-medium 14B
70.7
Mixtral 8x7B Instruct
60.8
OpenBookQA
phi-3-medium 14B leads by +2.1
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
phi-3-medium 14B
83.2
Mixtral 8x7B Instruct
81.1
TriviaQA
Mixtral 8x7B Instruct leads by +8.3
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
phi-3-medium 14B
73.9
Mixtral 8x7B Instruct
82.2
Winogrande
phi-3-medium 14B leads by +8.6
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
phi-3-medium 14B
63.0
Mixtral 8x7B Instruct
54.4
Full benchmark table
Benchmarkphi-3-medium 14BMixtral 8x7B Instruct
ANLI
ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.
33.732.8
ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
88.883.1
GPQA diamond
Graduate-Level Google-Proof QA (Diamond set) · expert-crafted questions in physics, biology, and chemistry that are difficult even for domain PhDs.
3.57.5
HellaSwag
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
76.582.3
MATH level 5
MATH Level 5 · the hardest tier of the MATH benchmark, featuring competition-level problems from AMC, AIME, and Olympiad-style mathematics.
17.69.9
MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
70.760.8
OpenBookQA
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
83.281.1
TriviaQA
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
73.982.2
Winogrande
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
63.054.4
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
Microsoft logophi-3-medium 14B
Mistral AI logoMixtral 8x7B Instruct$0.54$0.5433K tokens (~16 books)$5.40