Beta
Model familiesReading · ~3 min · 89 words deep

DeepSeek

Chinese AI lab that shipped DeepSeek V3 (MoE 671B) and R1 (open reasoning model) · crashed frontier pricing in 2025.

DeepSeek family on /family
TL;DR

Chinese AI lab that shipped DeepSeek V3 (MoE 671B) and R1 (open reasoning model) · crashed frontier pricing in 2025.

Level 1

DeepSeek is a Hangzhou-based AI research company. Their DeepSeek V3 (Dec 2024) was a 671B-parameter MoE with 37B active, matching GPT-4 class benchmarks at roughly 1/30th the inference cost. DeepSeek R1 (Jan 2025) demonstrated that pure RL on verifiable rewards produces reasoning behavior · and they open-sourced the weights and recipe. V3.2 (2025) and R2 (2026) continued the series. The DeepSeek moment reshaped expectations around what open-weight models could do.

Level 2

DeepSeek V3 architecture: 256 routed experts + 1 shared expert, 8 experts active per token, 671B total parameters, 37B active. Multi-head Latent Attention (MLA) reduces KV cache size by ~7× vs standard attention. DeepSeek R1 applied pure outcome-reward RL to a strong base model and observed emergent long-CoT reasoning, including self-reflection and self-correction · no process supervision needed. Distilled R1 variants (1.5B to 70B) perform well above their size class on reasoning benchmarks. DeepSeek's pricing: V3 at $0.27/M output, R1 at $2.19/M output · order-of-magnitude cheaper than Western frontier.

Level 3

DeepSeek V3 training details (publicly disclosed): 14.8T tokens, FP8 training from scratch (first frontier-class model to do so), estimated $5-6M training cost. Multi-head Latent Attention compresses KV cache via low-rank projection. DualPipe pipelining overlaps communication and computation for efficient multi-GPU training. R1 training: GRPO (Group Relative Policy Optimization) algorithm · RLHF variant that uses group-wise rewards, avoiding need for explicit value network. R1 paper (Jan 2025) showed pure RL recipe without SFT warmup also works, contradicting conventional wisdom. Architecture + training efficiency are DeepSeek's differentiators · they don't have access to H100s in the same quantity as Western labs.

Why this matters now

DeepSeek V3 and R1 opened the door to open-weight frontier quality at 1/30th the price. Every pricing comp since accounts for DeepSeek as the floor.

The takeaway for you
If you are a
Researcher
  • ·MoE + MLA + FP8 training · three stacked efficiencies
  • ·Pure RL produces reasoning without process supervision (R1 result)
  • ·GRPO as a cheaper RLHF alternative
If you are a
Builder
  • ·DeepSeek V3 for cheap high-quality inference · $0.27/M output
  • ·R1 for open-weight reasoning · $2.19/M output
  • ·Self-host for max cost savings if you have the VRAM (800GB+ for 671B MoE)
If you are a
Investor
  • ·DeepSeek crashed frontier pricing expectations · reshapes US lab margins
  • ·Demonstrates Chinese AI labs can match frontier at fraction of Western capex
  • ·Export controls on NVIDIA GPUs didn't prevent frontier-class training
If you are a
Curious · Normie
  • ·Chinese AI that's almost as smart as GPT-5 but way cheaper
  • ·Open source · anyone can download and run it
  • ·Kicked off the "AI is getting cheaper fast" narrative
Gecko's take

DeepSeek V3 + R1 are the single biggest pricing events of 2025. Every API price floor is now set by what DeepSeek charges.

Hangzhou, China. Founded by High-Flyer Quant in 2023.
Canonical sources