Pretraining
The phase where a model learns from internet-scale text via next-token prediction · typically 15-30T tokens and millions of GPU hours.
The phase where a model learns from internet-scale text via next-token prediction · typically 15-30T tokens and millions of GPU hours.
Basic
Pretraining is step one of creating an LLM. The model reads trillions of tokens of web text, code, and books, predicting the next token at each step. This gives it a base understanding of language, facts, and code patterns. Pretraining is by far the most compute-intensive phase · $50-500M+ for frontier models. After pretraining, post-training (SFT, DPO, RLHF) shapes behavior.
Deep
Data mix for 2026 pretraining: English web (~40%), code (~20%), books and scientific papers (~10%), multilingual (~20%), curated domain data (~10%). Quality filtering removes duplicate, toxic, or low-information content. Deduplication (MinHash, SimHash) is critical · duplicate data memorization is a concerning failure mode. Tokens processed: 15T for Llama 3.1, 15-30T+ for Llama 4 and GPT-5 class. Training compute measured in FLOPs: 10^25-10^26 for frontier class. FP8 pretraining has become standard since DeepSeek V3 (2024) demonstrated it at scale.
Expert
Pretraining hyperparameters: learning rate schedule (cosine decay, warmup), batch size (millions of tokens), sequence length (8K to 128K+), optimizer (AdamW dominant). Data quality outweighs quantity past a threshold · the 2023-2025 winning labs invested heavily in data curation. Scaling curves show that model quality improves as a function of compute, with caveats about the Chinchilla compute-optimal ratio. Training failures (loss spikes) are managed via gradient clipping, skip-connections, and restart-from-checkpoint protocols. A single frontier pretraining run can take 2-6 months on a 10,000+ GPU cluster.
Depending on why you're here
- ·Next-token prediction on 15-30T tokens
- ·Data quality > quantity past the threshold
- ·FP8 pretraining is the new default post-DeepSeek V3
- ·You do not pretrain · that is frontier lab work
- ·Fine-tune from a pretrained base for 10-1000× less cost
- ·Base model choice determines your ceiling · pick carefully
- ·Pretraining costs follow FLOPs curves · $100M → $500M → $1B per generation
- ·Efficiency gains (MoE, FP8) slow the cost growth but don't reverse it
- ·Frontier pretraining is a 3-5 lab oligopoly
- ·The expensive step where AI reads all the internet
- ·Costs hundreds of millions of dollars
- ·Only a handful of companies can afford to do it
Pretraining is the moat. The labs that can afford $500M training runs are the ones setting frontier quality for the rest of the decade.