Compare · ModelsLive · 2 picked · head to head

Stable Beluga 2 vs Nemotron-4 15B

Side by side · benchmarks, pricing, and signals you can act on.

Winner summary

Stable Beluga 2 wins 6 of 6 shared benchmarks. Leads in knowledge · reasoning · math.

Category leads
knowledge·Stable Beluga 2reasoning·Stable Beluga 2math·Stable Beluga 2
Hype vs Reality
Stable Beluga 2
#102 by perf·no signal
QUIET
Nemotron-4 15B
#78 by perf·no signal
QUIET
Best value
Stable Beluga 2
no price
Nemotron-4 15B
no price
Vendor risk
U
Unknown
private · undisclosed
Unknown
U
Unknown
private · undisclosed
Unknown
Head to head
Stable Beluga 2Nemotron-4 15B
ARC AI2
Stable Beluga 2 leads by +40.8
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
Stable Beluga 2
81.5
Nemotron-4 15B
40.7
BBH
Stable Beluga 2 leads by +14.1
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
Stable Beluga 2
59.1
Nemotron-4 15B
44.9
GSM8K
Stable Beluga 2 leads by +23.6
Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.
Stable Beluga 2
69.6
Nemotron-4 15B
46.0
HellaSwag
Stable Beluga 2 leads by +2.3
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
Stable Beluga 2
78.8
Nemotron-4 15B
76.5
MMLU
Stable Beluga 2 leads by +13.2
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
Stable Beluga 2
58.1
Nemotron-4 15B
44.9
PIQA
Stable Beluga 2 leads by +1.8
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
Stable Beluga 2
66.6
Nemotron-4 15B
64.8
Full benchmark table
BenchmarkStable Beluga 2Nemotron-4 15B
ARC AI2
AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.
81.540.7
BBH
BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.
59.144.9
GSM8K
Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.
69.646.0
HellaSwag
HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.
78.876.5
MMLU
Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.
58.144.9
PIQA
PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.
66.664.8
Pricing · per 1M tokens · projected $/mo at 10M tokens
ModelInputOutputContextProjected $/mo
U
Stable Beluga 2
U
Nemotron-4 15B