DeepSeek
Chinese AI lab that shipped DeepSeek V3 (MoE 671B) and R1 (open reasoning model) · crashed frontier pricing in 2025.
Chinese AI lab that shipped DeepSeek V3 (MoE 671B) and R1 (open reasoning model) · crashed frontier pricing in 2025.
Basic
DeepSeek is a Hangzhou-based AI research company. Their DeepSeek V3 (Dec 2024) was a 671B-parameter MoE with 37B active, matching GPT-4 class benchmarks at roughly 1/30th the inference cost. DeepSeek R1 (Jan 2025) demonstrated that pure RL on verifiable rewards produces reasoning behavior · and they open-sourced the weights and recipe. V3.2 (2025) and R2 (2026) continued the series. The DeepSeek moment reshaped expectations around what open-weight models could do.
Deep
DeepSeek V3 architecture: 256 routed experts + 1 shared expert, 8 experts active per token, 671B total parameters, 37B active. Multi-head Latent Attention (MLA) reduces KV cache size by ~7× vs standard attention. DeepSeek R1 applied pure outcome-reward RL to a strong base model and observed emergent long-CoT reasoning, including self-reflection and self-correction · no process supervision needed. Distilled R1 variants (1.5B to 70B) perform well above their size class on reasoning benchmarks. DeepSeek's pricing: V3 at $0.27/M output, R1 at $2.19/M output · order-of-magnitude cheaper than Western frontier.
Expert
DeepSeek V3 training details (publicly disclosed): 14.8T tokens, FP8 training from scratch (first frontier-class model to do so), estimated $5-6M training cost. Multi-head Latent Attention compresses KV cache via low-rank projection. DualPipe pipelining overlaps communication and computation for efficient multi-GPU training. R1 training: GRPO (Group Relative Policy Optimization) algorithm · RLHF variant that uses group-wise rewards, avoiding need for explicit value network. R1 paper (Jan 2025) showed pure RL recipe without SFT warmup also works, contradicting conventional wisdom. Architecture + training efficiency are DeepSeek's differentiators · they don't have access to H100s in the same quantity as Western labs.
DeepSeek V3 and R1 opened the door to open-weight frontier quality at 1/30th the price. Every pricing comp since accounts for DeepSeek as the floor.
Depending on why you're here
- ·MoE + MLA + FP8 training · three stacked efficiencies
- ·Pure RL produces reasoning without process supervision (R1 result)
- ·GRPO as a cheaper RLHF alternative
- ·DeepSeek V3 for cheap high-quality inference · $0.27/M output
- ·R1 for open-weight reasoning · $2.19/M output
- ·Self-host for max cost savings if you have the VRAM (800GB+ for 671B MoE)
- ·DeepSeek crashed frontier pricing expectations · reshapes US lab margins
- ·Demonstrates Chinese AI labs can match frontier at fraction of Western capex
- ·Export controls on NVIDIA GPUs didn't prevent frontier-class training
- ·Chinese AI that's almost as smart as GPT-5 but way cheaper
- ·Open source · anyone can download and run it
- ·Kicked off the "AI is getting cheaper fast" narrative
DeepSeek V3 + R1 are the single biggest pricing events of 2025. Every API price floor is now set by what DeepSeek charges.
Frequently Asked Questions
Read the primary sources
- DeepSeek V3 paper (2024)arxiv.org
- DeepSeek R1 paper (2025)arxiv.org