Can I distill a closed model into my own?

With API access, yes · generate outputs from the teacher, train a smaller open-weight student on them. OpenAI/Anthropic terms of service restrict building competing models but allow distillation for derivative use cases.

What is the quality-size trade-off?

Rough rule of thumb: 10× smaller = 85-95% of quality, 100× smaller = 70-85%, depends heavily on task complexity and distillation recipe.

ConceptsReading · ~3 min · 92 words deep

Distillation

Training a smaller student model to reproduce a larger teacher model's outputs · cheaper to serve with similar quality.

TL;DR

Training a smaller student model to reproduce a larger teacher model's outputs · cheaper to serve with similar quality.

Level 1

Basic

Distillation (Hinton et al., 2015) trains a student model on the teacher's soft outputs (probability distributions) rather than just correct labels. The student learns not just what to predict but how confident to be, capturing more nuance than label-only training. Production AI uses distillation to ship smaller, faster variants: Claude Haiku is distilled from Claude Sonnet/Opus recipes; GPT-4o Mini is distilled from GPT-4o; Gemini Flash from Gemini Pro.

Level 2

Deep

Teacher-student distillation: the student minimizes KL divergence between its output distribution and the teacher's on a training set. Temperature softens the teacher distribution · higher temperatures surface information about non-top tokens. Distillation targets: logits (most common), hidden states (feature matching), attention maps (relation-matching), or generated sequences (sequence-level). For large-scale distillation of frontier models, sequence-level works best · train the student on teacher-generated outputs as if they were ground truth. Cost: producing a high-quality distillation corpus is expensive (you pay to run the teacher at scale), but once done, student training is cheap.

Level 3

Expert

KL-divergence loss: L = T² · KL(softmax(z_t/T) || softmax(z_s/T)) where T is temperature and z_t, z_s are teacher/student logits. Large T (4-10) reveals dark knowledge · probability mass in non-top tokens. Modern pipelines combine distillation with SFT on high-quality human data: student trains on both teacher outputs and human-labeled examples. For MoE distillation, the student is often a dense model (e.g., DeepSeek R1-Distill variants) · experts don't transfer cleanly to smaller architectures. Quality recovery: a 10× smaller student typically matches 80-95% of teacher benchmark performance depending on task complexity.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Teacher-student on KL(soft(z_t/T) || soft(z_s/T))
·Sequence-level distillation > logit-level for large-scale
·Dark knowledge in soft distribution is what makes distillation > hard labels

If you are a

Builder

·Use Haiku/Flash/Mini distilled variants when latency and cost matter
·Small distilled models hit 85-95% of frontier quality at 10% the price
·Distill your own models from frontier APIs · train cheaper specialists

If you are a

Investor

·Distillation is the recipe behind the cheap-tier model boom
·Cost-per-quality improvements come from distillation + MoE + better training data
·Frontier lab revenue shifts to distilled variants as quality gaps close

If you are a

Curious · Normie

·A big AI teaches a smaller AI everything it knows
·Smaller AI runs cheaper and faster, almost as good
·How AI got dramatically cheaper in 2024-2026

Gecko's take

Distillation is why the cheap tier keeps getting smarter. Every frontier release now ships a distilled cousin within weeks.

Frequently Asked Questions

Anthropic has not confirmed, but Haiku's quality-to-size ratio strongly suggests distillation from the Sonnet/Opus line.

Canonical sources

Read the primary sources

Hinton distillation paper (2015)arxiv.org
DeepSeek R1-Distill paper (2025)arxiv.org

Distillation

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed