Scaling Laws
Empirical laws showing model quality improves predictably with more compute, parameters, and data · the foundation of modern AI scaling.
Empirical laws showing model quality improves predictably with more compute, parameters, and data · the foundation of modern AI scaling.
Basic
Kaplan (2020) and Chinchilla (2022) scaling laws showed that LLM loss decreases as a power law with compute, parameters, and data. Double the compute and loss drops by a predictable amount. This prediction is what convinces labs to spend $100M+ on single training runs · the improvement is mathematically expected.
Deep
Kaplan scaling: loss ∝ compute^(-α) where α ≈ 0.05. Chinchilla refined this: for compute-optimal training, the ratio of parameters to tokens should be roughly 1:20. Chinchilla 70B trained on 1.4T tokens matched Gopher 280B trained on 300B tokens · confirming that most 2020-era models were under-trained. Post-Chinchilla, labs over-train aggressively (Llama 3 used 15T tokens on 70B params = 214:1 ratio) for better inference-time efficiency at the cost of training compute.
Expert
Kaplan's original paper: L(N, D) = [N_c/N]^{α_N} + [D_c/D]^{α_D} where N is parameters, D is data tokens. Chinchilla: optimal N ∝ C^0.5, D ∝ C^0.5 for compute C. Emergent capabilities (Wei 2022) appear at specific scale thresholds and don't follow smooth scaling · chain-of-thought, instruction following, in-context learning. Post-training scaling (RLHF, SFT, reasoning RL) is a separate regime that may plateau differently. Some evidence of scaling-law slowdown at frontier scale (2024-2025), though MoE and reasoning recipes suggest architecture shifts may continue improvements.
Scaling laws justify $10B+ capex on Stargate, Jupiter, Colossus. If laws break, the entire AI buildout thesis changes.
Depending on why you're here
- ·Kaplan 2020 and Chinchilla 2022 are the foundational papers
- ·Compute-optimal ratio: ~20 tokens per parameter (Chinchilla)
- ·Emergent capabilities appear at specific scale thresholds
- ·Scaling laws guide data collection budgets · token-to-param ratio matters
- ·Fine-tuning rarely needs Chinchilla-optimal · task data is king
- ·Under-trained small models (7B on 1T tokens) often beat over-trained big ones on narrow tasks
- ·Scaling laws are the thesis behind every training capex budget
- ·Law slowdown = thesis threat · watch GPT-5 to GPT-6 quality delta carefully
- ·MoE + RL recipes may extend laws even if dense scaling plateaus
- ·More computer + more data = smarter AI, in predictable amounts
- ·Why AI companies keep spending more each year
- ·If the pattern breaks, the AI boom thesis breaks
Scaling laws either hold or the AI bubble pops. Nothing in AI investing matters more than this single question.
Frequently Asked Questions
Read the primary sources
- Kaplan scaling laws (2020)arxiv.org
- Chinchilla paper (2022)arxiv.org