Beta
Compare · ModelsLive · 2 picked · head to head

Phi 2 vs Gemma 2B

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

Gemma 2B wins 8 of 14 shared benchmarks. Leads in knowledge · math.

Category leads
knowledge·Gemma 2Breasoning·Phi 2general·Phi 2language·Phi 2math·Gemma 2B
Hype vs Reality
Phi 2
#183 by perf·no signal
QUIET
Gemma 2B
#187 by perf·no signal
QUIET
Best value
Phi 2
no price
Gemma 2B
no price
Vendor risk
Microsoft logo
Microsoft
$3.00T·Big Tech
Low risk
Google DeepMind logo
Google DeepMind
$4.00T·Tier 1
Low risk
Head to head
Phi 2Gemma 2B
ANLI
Gemma 2B leads by +9.3
ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.
Phi 2
13.8
Gemma 2B
23.1
ARC AI2
Phi 2 leads by +45.1
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
Phi 2
67.9
Gemma 2B
22.8
BBH
Phi 2 leads by +32.3
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
Phi 2
45.9
Gemma 2B
13.6
HellaSwag
Gemma 2B leads by +23.7
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
Phi 2
38.1
Gemma 2B
61.9
BBH (HuggingFace)
Phi 2 leads by +6.9
Phi 2
28.0
Gemma 2B
21.1
GPQA
Gemma 2B leads by +2.0
Phi 2
2.9
Gemma 2B
4.9
IFEval
Phi 2 leads by +0.8
Phi 2
27.4
Gemma 2B
26.6
MATH Level 5
Gemma 2B leads by +4.5
Phi 2
3.0
Gemma 2B
7.4
MMLU-PRO
Gemma 2B leads by +3.6
Phi 2
18.1
Gemma 2B
21.6
MUSR
Phi 2 leads by +2.9
Phi 2
13.8
Gemma 2B
11.0
MMLU
Phi 2 leads by +21.5
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Phi 2
44.5
Gemma 2B
23.1
OpenBookQA
Gemma 2B leads by +6.7
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
Phi 2
64.8
Gemma 2B
71.5
TriviaQA
Gemma 2B leads by +8.0
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
Phi 2
45.2
Gemma 2B
53.2
Winogrande
Gemma 2B leads by +21.4
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
Phi 2
9.4
Gemma 2B
30.8
Full benchmark table
BenchmarkPhi 2Gemma 2B
ANLI
ANLI (Adversarial NLI) · adversarially constructed natural language inference dataset where each round targets weaknesses found in previous model generations.
13.823.1
ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
67.922.8
BBH
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
45.913.6
HellaSwag
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
38.161.9
BBH (HuggingFace)
28.021.1
GPQA
2.94.9
IFEval
27.426.6
MATH Level 5
3.07.4
MMLU-PRO
18.121.6
MUSR
13.811.0
MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
44.523.1
OpenBookQA
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
64.871.5
TriviaQA
TriviaQA · reading comprehension benchmark with trivia questions, requiring models to find and reason over evidence from provided documents.
45.253.2
Winogrande
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
9.430.8
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
Microsoft logoPhi 2
Google DeepMind logoGemma 2B