Compare · ModelsLive · 2 picked · head to head
Phi-1.5 vs Cerebras-GPT-13B
Side by side · benchmarks, pricing, and signals you can act on.
Winner summary
Phi-1.5 wins on 4/5 benchmarks
Phi-1.5 wins 4 of 5 shared benchmarks. Leads in knowledge.
Category leads
knowledge·Phi-1.5
Hype vs Reality
Attention vs performance
Phi-1.5
#221 by perf·no signal
Cerebras-GPT-13B
#209 by perf·no signal
Vendor risk
Who is behind the model
Microsoft
$3.00T·Big Tech
OpenAI
$840.0B·Tier 1
Head to head
5 benchmarks · 2 models
Phi-1.5Cerebras-GPT-13B
ARC AI2
Phi-1.5 leads by +16.0
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
Phi-1.5
25.9
Cerebras-GPT-13B
9.9
HellaSwag
Cerebras-GPT-13B leads by +15.7
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
Phi-1.5
30.1
Cerebras-GPT-13B
45.9
MMLU
Phi-1.5 leads by +15.2
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Phi-1.5
16.8
Cerebras-GPT-13B
1.6
OpenBookQA
Phi-1.5 leads by +1.9
OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting.
Phi-1.5
16.3
Cerebras-GPT-13B
14.4
Winogrande
Phi-1.5 leads by +25.2
WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs.
Phi-1.5
46.8
Cerebras-GPT-13B
21.6
Full benchmark table
| Benchmark | Phi-1.5 | Cerebras-GPT-13B |
|---|---|---|
ARC AI2 AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval. | 25.9 | 9.9 |
HellaSwag HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios. | 30.1 | 45.9 |
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge. | 16.8 | 1.6 |
OpenBookQA OpenBookQA · science questions that require combining a given core fact with broad common knowledge, mimicking an open-book exam setting. | 16.3 | 14.4 |
Winogrande WinoGrande · large-scale commonsense reasoning benchmark where models must resolve ambiguous pronouns in carefully constructed sentence pairs. | 46.8 | 21.6 |
Pricing · per 1M tokens · projected $/mo at 10M tokens
| Model | Input | Output | Context | Projected $/mo |
|---|---|---|---|---|
| — | — | — | — | |
| — | — | — | — |