Distillation
Training a smaller student model to reproduce a larger teacher model's outputs · cheaper to serve with similar quality.
Training a smaller student model to reproduce a larger teacher model's outputs · cheaper to serve with similar quality.
Basic
Distillation (Hinton et al., 2015) trains a student model on the teacher's soft outputs (probability distributions) rather than just correct labels. The student learns not just what to predict but how confident to be, capturing more nuance than label-only training. Production AI uses distillation to ship smaller, faster variants: Claude Haiku is distilled from Claude Sonnet/Opus recipes; GPT-4o Mini is distilled from GPT-4o; Gemini Flash from Gemini Pro.
Deep
Teacher-student distillation: the student minimizes KL divergence between its output distribution and the teacher's on a training set. Temperature softens the teacher distribution · higher temperatures surface information about non-top tokens. Distillation targets: logits (most common), hidden states (feature matching), attention maps (relation-matching), or generated sequences (sequence-level). For large-scale distillation of frontier models, sequence-level works best · train the student on teacher-generated outputs as if they were ground truth. Cost: producing a high-quality distillation corpus is expensive (you pay to run the teacher at scale), but once done, student training is cheap.
Expert
KL-divergence loss: L = T² · KL(softmax(z_t/T) || softmax(z_s/T)) where T is temperature and z_t, z_s are teacher/student logits. Large T (4-10) reveals dark knowledge · probability mass in non-top tokens. Modern pipelines combine distillation with SFT on high-quality human data: student trains on both teacher outputs and human-labeled examples. For MoE distillation, the student is often a dense model (e.g., DeepSeek R1-Distill variants) · experts don't transfer cleanly to smaller architectures. Quality recovery: a 10× smaller student typically matches 80-95% of teacher benchmark performance depending on task complexity.
Depending on why you're here
- ·Teacher-student on KL(soft(z_t/T) || soft(z_s/T))
- ·Sequence-level distillation > logit-level for large-scale
- ·Dark knowledge in soft distribution is what makes distillation > hard labels
- ·Use Haiku/Flash/Mini distilled variants when latency and cost matter
- ·Small distilled models hit 85-95% of frontier quality at 10% the price
- ·Distill your own models from frontier APIs · train cheaper specialists
- ·Distillation is the recipe behind the cheap-tier model boom
- ·Cost-per-quality improvements come from distillation + MoE + better training data
- ·Frontier lab revenue shifts to distilled variants as quality gaps close
- ·A big AI teaches a smaller AI everything it knows
- ·Smaller AI runs cheaper and faster, almost as good
- ·How AI got dramatically cheaper in 2024-2026
Distillation is why the cheap tier keeps getting smarter. Every frontier release now ships a distilled cousin within weeks.
Frequently Asked Questions
Read the primary sources
- Hinton distillation paper (2015)arxiv.org
- DeepSeek R1-Distill paper (2025)arxiv.org