Are scaling laws still holding?

Mostly yes for pretraining. The 2024-2025 data suggests slower returns at frontier scale. Post-training recipes (RL, reasoning) are opening new scaling axes.

What is Chinchilla's rule?

~20 tokens of training data per parameter for compute-optimal training. Under-training or over-training deviates from optimal loss.

ConceptsReading · ~3 min · 71 words deep

Scaling Laws

Empirical laws showing model quality improves predictably with more compute, parameters, and data · the foundation of modern AI scaling.

TL;DR

Empirical laws showing model quality improves predictably with more compute, parameters, and data · the foundation of modern AI scaling.

Level 1

Basic

Kaplan (2020) and Chinchilla (2022) scaling laws showed that LLM loss decreases as a power law with compute, parameters, and data. Double the compute and loss drops by a predictable amount. This prediction is what convinces labs to spend $100M+ on single training runs · the improvement is mathematically expected.

Level 2

Deep

Kaplan scaling: loss ∝ compute^(-α) where α ≈ 0.05. Chinchilla refined this: for compute-optimal training, the ratio of parameters to tokens should be roughly 1:20. Chinchilla 70B trained on 1.4T tokens matched Gopher 280B trained on 300B tokens · confirming that most 2020-era models were under-trained. Post-Chinchilla, labs over-train aggressively (Llama 3 used 15T tokens on 70B params = 214:1 ratio) for better inference-time efficiency at the cost of training compute.

Level 3

Expert

Kaplan's original paper: L(N, D) = [N_c/N]^{α_N} + [D_c/D]^{α_D} where N is parameters, D is data tokens. Chinchilla: optimal N ∝ C^0.5, D ∝ C^0.5 for compute C. Emergent capabilities (Wei 2022) appear at specific scale thresholds and don't follow smooth scaling · chain-of-thought, instruction following, in-context learning. Post-training scaling (RLHF, SFT, reasoning RL) is a separate regime that may plateau differently. Some evidence of scaling-law slowdown at frontier scale (2024-2025), though MoE and reasoning recipes suggest architecture shifts may continue improvements.

Why this matters now

Scaling laws justify $10B+ capex on Stargate, Jupiter, Colossus. If laws break, the entire AI buildout thesis changes.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Kaplan 2020 and Chinchilla 2022 are the foundational papers
·Compute-optimal ratio: ~20 tokens per parameter (Chinchilla)
·Emergent capabilities appear at specific scale thresholds

If you are a

Builder

·Scaling laws guide data collection budgets · token-to-param ratio matters
·Fine-tuning rarely needs Chinchilla-optimal · task data is king
·Under-trained small models (7B on 1T tokens) often beat over-trained big ones on narrow tasks

If you are a

Investor

·Scaling laws are the thesis behind every training capex budget
·Law slowdown = thesis threat · watch GPT-5 to GPT-6 quality delta carefully
·MoE + RL recipes may extend laws even if dense scaling plateaus

If you are a

Curious · Normie

·More computer + more data = smarter AI, in predictable amounts
·Why AI companies keep spending more each year
·If the pattern breaks, the AI boom thesis breaks

Gecko's take

Scaling laws either hold or the AI bubble pops. Nothing in AI investing matters more than this single question.

Frequently Asked Questions

Kaplan et al. at OpenAI (2020) for the original formulation, Hoffmann et al. at DeepMind (Chinchilla, 2022) for the refined compute-optimal ratios.

Canonical sources

Read the primary sources

Kaplan scaling laws (2020)arxiv.org
Chinchilla paper (2022)arxiv.org

Scaling Laws

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed