Compare · ModelsLive · 2 picked · head to head

Stable Beluga 2 vs Nemotron-4 15B

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Stable Beluga 2 wins on 6/6 benchmarks

Stable Beluga 2 wins 6 of 6 shared benchmarks. Leads in knowledge · reasoning · math.

Category leads

knowledge·Stable Beluga 2reasoning·Stable Beluga 2math·Stable Beluga 2

Hype vs Reality

Attention vs performance

Stable Beluga 2

#102 by perf·no signal

QUIET

Nemotron-4 15B

#78 by perf·no signal

QUIET

See full mindshare →

Best value

Pricing unknown

Stable Beluga 2

—

no price

Nemotron-4 15B

—

no price

Explore pricing →

Vendor risk

Who is behind the model

Unknown

private · undisclosed

Unknown

private · undisclosed

Unknown

See the AI economy →

Head to head

6 benchmarks · 2 models

Stable Beluga 2Nemotron-4 15B

ARC AI2

Stable Beluga 2 leads by +40.8

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

Stable Beluga 2

81.5

Nemotron-4 15B

40.7

BBH

Stable Beluga 2 leads by +14.1

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

Stable Beluga 2

59.1

Nemotron-4 15B

44.9

GSM8K

Stable Beluga 2 leads by +23.6

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

Stable Beluga 2

69.6

Nemotron-4 15B

46.0

HellaSwag

Stable Beluga 2 leads by +2.3

HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.

Stable Beluga 2

78.8

Nemotron-4 15B

76.5

MMLU

Stable Beluga 2 leads by +13.2

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Stable Beluga 2

58.1

Nemotron-4 15B

44.9

PIQA

Stable Beluga 2 leads by +1.8

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

Stable Beluga 2

66.6

Nemotron-4 15B

64.8

Full benchmark table

Benchmark	Stable Beluga 2	Nemotron-4 15B
ARC AI2 AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.	81.5	40.7
BBH BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.	59.1	44.9
GSM8K Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.	69.6	46.0
HellaSwag HellaSwag · tests commonsense reasoning by asking models to predict the most plausible continuation of everyday scenarios.	78.8	76.5
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.	58.1	44.9
PIQA PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.	66.6	64.8

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
U Stable Beluga 2	—	—	—	—
U Nemotron-4 15B	—	—	—	—