Beta
ConceptsReading · ~3 min · 92 words deep

Distillation

Training a smaller student model to reproduce a larger teacher model's outputs · cheaper to serve with similar quality.

TL;DR

Training a smaller student model to reproduce a larger teacher model's outputs · cheaper to serve with similar quality.

Level 1

Distillation (Hinton et al., 2015) trains a student model on the teacher's soft outputs (probability distributions) rather than just correct labels. The student learns not just what to predict but how confident to be, capturing more nuance than label-only training. Production AI uses distillation to ship smaller, faster variants: Claude Haiku is distilled from Claude Sonnet/Opus recipes; GPT-4o Mini is distilled from GPT-4o; Gemini Flash from Gemini Pro.

Level 2

Teacher-student distillation: the student minimizes KL divergence between its output distribution and the teacher's on a training set. Temperature softens the teacher distribution · higher temperatures surface information about non-top tokens. Distillation targets: logits (most common), hidden states (feature matching), attention maps (relation-matching), or generated sequences (sequence-level). For large-scale distillation of frontier models, sequence-level works best · train the student on teacher-generated outputs as if they were ground truth. Cost: producing a high-quality distillation corpus is expensive (you pay to run the teacher at scale), but once done, student training is cheap.

Level 3

KL-divergence loss: L = T² · KL(softmax(z_t/T) || softmax(z_s/T)) where T is temperature and z_t, z_s are teacher/student logits. Large T (4-10) reveals dark knowledge · probability mass in non-top tokens. Modern pipelines combine distillation with SFT on high-quality human data: student trains on both teacher outputs and human-labeled examples. For MoE distillation, the student is often a dense model (e.g., DeepSeek R1-Distill variants) · experts don't transfer cleanly to smaller architectures. Quality recovery: a 10× smaller student typically matches 80-95% of teacher benchmark performance depending on task complexity.

The takeaway for you
If you are a
Researcher
  • ·Teacher-student on KL(soft(z_t/T) || soft(z_s/T))
  • ·Sequence-level distillation > logit-level for large-scale
  • ·Dark knowledge in soft distribution is what makes distillation > hard labels
If you are a
Builder
  • ·Use Haiku/Flash/Mini distilled variants when latency and cost matter
  • ·Small distilled models hit 85-95% of frontier quality at 10% the price
  • ·Distill your own models from frontier APIs · train cheaper specialists
If you are a
Investor
  • ·Distillation is the recipe behind the cheap-tier model boom
  • ·Cost-per-quality improvements come from distillation + MoE + better training data
  • ·Frontier lab revenue shifts to distilled variants as quality gaps close
If you are a
Curious · Normie
  • ·A big AI teaches a smaller AI everything it knows
  • ·Smaller AI runs cheaper and faster, almost as good
  • ·How AI got dramatically cheaper in 2024-2026
Gecko's take

Distillation is why the cheap tier keeps getting smarter. Every frontier release now ships a distilled cousin within weeks.

Anthropic has not confirmed, but Haiku's quality-to-size ratio strongly suggests distillation from the Sonnet/Opus line.
Canonical sources