Beta
ConceptsReading · ~3 min · 71 words deep

Scaling Laws

Empirical laws showing model quality improves predictably with more compute, parameters, and data · the foundation of modern AI scaling.

TL;DR

Empirical laws showing model quality improves predictably with more compute, parameters, and data · the foundation of modern AI scaling.

Level 1

Kaplan (2020) and Chinchilla (2022) scaling laws showed that LLM loss decreases as a power law with compute, parameters, and data. Double the compute and loss drops by a predictable amount. This prediction is what convinces labs to spend $100M+ on single training runs · the improvement is mathematically expected.

Level 2

Kaplan scaling: loss ∝ compute^(-α) where α ≈ 0.05. Chinchilla refined this: for compute-optimal training, the ratio of parameters to tokens should be roughly 1:20. Chinchilla 70B trained on 1.4T tokens matched Gopher 280B trained on 300B tokens · confirming that most 2020-era models were under-trained. Post-Chinchilla, labs over-train aggressively (Llama 3 used 15T tokens on 70B params = 214:1 ratio) for better inference-time efficiency at the cost of training compute.

Level 3

Kaplan's original paper: L(N, D) = [N_c/N]^{α_N} + [D_c/D]^{α_D} where N is parameters, D is data tokens. Chinchilla: optimal N ∝ C^0.5, D ∝ C^0.5 for compute C. Emergent capabilities (Wei 2022) appear at specific scale thresholds and don't follow smooth scaling · chain-of-thought, instruction following, in-context learning. Post-training scaling (RLHF, SFT, reasoning RL) is a separate regime that may plateau differently. Some evidence of scaling-law slowdown at frontier scale (2024-2025), though MoE and reasoning recipes suggest architecture shifts may continue improvements.

Why this matters now

Scaling laws justify $10B+ capex on Stargate, Jupiter, Colossus. If laws break, the entire AI buildout thesis changes.

The takeaway for you
If you are a
Researcher
  • ·Kaplan 2020 and Chinchilla 2022 are the foundational papers
  • ·Compute-optimal ratio: ~20 tokens per parameter (Chinchilla)
  • ·Emergent capabilities appear at specific scale thresholds
If you are a
Builder
  • ·Scaling laws guide data collection budgets · token-to-param ratio matters
  • ·Fine-tuning rarely needs Chinchilla-optimal · task data is king
  • ·Under-trained small models (7B on 1T tokens) often beat over-trained big ones on narrow tasks
If you are a
Investor
  • ·Scaling laws are the thesis behind every training capex budget
  • ·Law slowdown = thesis threat · watch GPT-5 to GPT-6 quality delta carefully
  • ·MoE + RL recipes may extend laws even if dense scaling plateaus
If you are a
Curious · Normie
  • ·More computer + more data = smarter AI, in predictable amounts
  • ·Why AI companies keep spending more each year
  • ·If the pattern breaks, the AI boom thesis breaks
Gecko's take

Scaling laws either hold or the AI bubble pops. Nothing in AI investing matters more than this single question.

Kaplan et al. at OpenAI (2020) for the original formulation, Hoffmann et al. at DeepMind (Chinchilla, 2022) for the refined compute-optimal ratios.
Canonical sources