Compare · ModelsLive · 2 picked · head to head

MPT-30B vs Phi 2

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

Phi 2 wins 10 of 13 shared benchmarks. Leads in knowledge · reasoning · general.

Category leads
knowledge·Phi 2reasoning·Phi 2general·Phi 2language·Phi 2math·Phi 2
Hype vs Reality
MPT-30B
#182 by perf·no signal
QUIET
Phi 2
#185 by perf·no signal
QUIET
Best value
MPT-30B
no price
Phi 2
no price
Vendor risk
U
Unknown
private · undisclosed
Unknown
Microsoft logo
Microsoft
$3.00T·Big Tech
Low risk
Head to head
MPT-30BPhi 2
ARC AI2
Phi 2 leads by +33.7
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
MPT-30B
34.1
Phi 2
67.9
BBH
Phi 2 leads by +28.5
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
MPT-30B
17.3
Phi 2
45.9
HellaSwag
MPT-30B leads by +30.4
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
MPT-30B
68.5
Phi 2
38.1
BBH (HuggingFace)
Phi 2 leads by +21.5
MPT-30B
6.5
Phi 2
28.0
GPQA
Phi 2 leads by +1.6
MPT-30B
1.3
Phi 2
2.9
IFEval
Phi 2 leads by +5.9
MPT-30B
21.5
Phi 2
27.4
MATH Level 5
Phi 2 leads by +1.4
MPT-30B
1.6
Phi 2
3.0
MMLU-PRO
Phi 2 leads by +15.8
MPT-30B
2.3
Phi 2
18.1
MUSR
Phi 2 leads by +10.9
MPT-30B
2.9
Phi 2
13.8
MMLU
Phi 2 leads by +14.0
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
MPT-30B
30.5
Phi 2
44.5
OpenBookQA
Phi 2 leads by +28.8
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
MPT-30B
36.0
Phi 2
64.8
TriviaQA
MPT-30B leads by +28.4
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
MPT-30B
73.6
Phi 2
45.2
Winogrande
MPT-30B leads by +32.6
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
MPT-30B
42.0
Phi 2
9.4
Full benchmark table
BenchmarkMPT-30BPhi 2
ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
34.167.9
BBH
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
17.345.9
HellaSwag
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
68.538.1
BBH (HuggingFace)
6.528.0
GPQA
1.32.9
IFEval
21.527.4
MATH Level 5
1.63.0
MMLU-PRO
2.318.1
MUSR
2.913.8
MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
30.544.5
OpenBookQA
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
36.064.8
TriviaQA
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
73.645.2
Winogrande
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
42.09.4
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
U
MPT-30B
Microsoft logoPhi 2