Compare · ModelsLive · 2 picked · head to head

Stable Beluga 2 vs Qwen-14B

Side by side · benchmarks, pricing, and signals you can act on.

CiteAdd another

Winner summary

Stable Beluga 2 wins on 6/6 benchmarks

Stable Beluga 2 wins 6 of 6 shared benchmarks. Leads in knowledge · reasoning · math.

Category leads

knowledge·Stable Beluga 2reasoning·Stable Beluga 2math·Stable Beluga 2

Hype vs Reality

Attention vs performance

Stable Beluga 2

#102 by perf·no signal

QUIET

Qwen-14B

#37 by perf·no signal

QUIET

See full mindshare →

Best value

Pricing unknown

Stable Beluga 2

—

no price

Qwen-14B

—

no price

Explore pricing →

Vendor risk

Who is behind the model

Unknown

private · undisclosed

Unknown

Alibaba (Qwen)

$293.0B·Tier 1

Low risk

See the AI economy →

Head to head

6 benchmarks · 2 models

Stable Beluga 2Qwen-14B

ARC AI2

Stable Beluga 2 leads by +2.3

AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.

Stable Beluga 2

81.5

Qwen-14B

79.2

BBH

Stable Beluga 2 leads by +19.1

BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.

Stable Beluga 2

59.1

Qwen-14B

40.0

GSM8K

Stable Beluga 2 leads by +8.3

Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.

Stable Beluga 2

69.6

Qwen-14B

61.3

LAMBADA

Stable Beluga 2 leads by +0.2

LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.

Stable Beluga 2

71.3

Qwen-14B

71.1

MMLU

Stable Beluga 2 leads by +3.1

Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.

Stable Beluga 2

58.1

Qwen-14B

55.1

PIQA

Stable Beluga 2 leads by +6.8

PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.

Stable Beluga 2

66.6

Qwen-14B

59.8

Full benchmark table

Benchmark	Stable Beluga 2	Qwen-14B
ARC AI2 AI2 Reasoning Challenge · tests grade-school level science knowledge with multiple-choice questions requiring reasoning beyond simple retrieval.	81.5	79.2
BBH BIG-Bench Hard · a curated subset of 23 challenging tasks from BIG-Bench where language models previously failed to outperform average humans.	59.1	40.0
GSM8K Grade School Math 8K · 8,500 linguistically diverse grade-school math word problems that require multi-step reasoning to solve.	69.6	61.3
LAMBADA LAMBADA · measures the ability to predict the final word of a passage, requiring broad contextual understanding across long text spans.	71.3	71.1
MMLU Massive Multitask Language Understanding · 57 subjects spanning STEM, humanities, social sciences, and more. The standard benchmark for broad knowledge.	58.1	55.1
PIQA PIQA (Physical Interaction QA) · tests intuitive physical reasoning by asking models to select the correct approach for everyday physical tasks.	66.6	59.8

Pricing · per 1M tokens · projected $/mo at 10M tokens

Model	Input	Output	Context	Projected $/mo
U Stable Beluga 2	—	—	—	—
Qwen-14B	—	—	—	—

People also compared

Qwen-14B vs Qwen3.5 397B A17B Qwen-14B vs Qwen3.6 Plus Qwen-14B vs Qwen3 30B A3B Thinking 2507 Qwen-14B vs Qwen3 Next 80B A3B Thinking Gemma 4 31B vs Qwen-14B o3 Pro vs Qwen-14B phi-3-mini 3.8B vs Qwen-14B Gemini 3.1 Pro Preview vs Qwen-14B