Beta
ConceptsReading · ~3 min · 78 words deep

Pretraining

The phase where a model learns from internet-scale text via next-token prediction · typically 15-30T tokens and millions of GPU hours.

TL;DR

The phase where a model learns from internet-scale text via next-token prediction · typically 15-30T tokens and millions of GPU hours.

Level 1

Pretraining is step one of creating an LLM. The model reads trillions of tokens of web text, code, and books, predicting the next token at each step. This gives it a base understanding of language, facts, and code patterns. Pretraining is by far the most compute-intensive phase · $50-500M+ for frontier models. After pretraining, post-training (SFT, DPO, RLHF) shapes behavior.

Level 2

Data mix for 2026 pretraining: English web (~40%), code (~20%), books and scientific papers (~10%), multilingual (~20%), curated domain data (~10%). Quality filtering removes duplicate, toxic, or low-information content. Deduplication (MinHash, SimHash) is critical · duplicate data memorization is a concerning failure mode. Tokens processed: 15T for Llama 3.1, 15-30T+ for Llama 4 and GPT-5 class. Training compute measured in FLOPs: 10^25-10^26 for frontier class. FP8 pretraining has become standard since DeepSeek V3 (2024) demonstrated it at scale.

Level 3

Pretraining hyperparameters: learning rate schedule (cosine decay, warmup), batch size (millions of tokens), sequence length (8K to 128K+), optimizer (AdamW dominant). Data quality outweighs quantity past a threshold · the 2023-2025 winning labs invested heavily in data curation. Scaling curves show that model quality improves as a function of compute, with caveats about the Chinchilla compute-optimal ratio. Training failures (loss spikes) are managed via gradient clipping, skip-connections, and restart-from-checkpoint protocols. A single frontier pretraining run can take 2-6 months on a 10,000+ GPU cluster.

The takeaway for you
If you are a
Researcher
  • ·Next-token prediction on 15-30T tokens
  • ·Data quality > quantity past the threshold
  • ·FP8 pretraining is the new default post-DeepSeek V3
If you are a
Builder
  • ·You do not pretrain · that is frontier lab work
  • ·Fine-tune from a pretrained base for 10-1000× less cost
  • ·Base model choice determines your ceiling · pick carefully
If you are a
Investor
  • ·Pretraining costs follow FLOPs curves · $100M → $500M → $1B per generation
  • ·Efficiency gains (MoE, FP8) slow the cost growth but don't reverse it
  • ·Frontier pretraining is a 3-5 lab oligopoly
If you are a
Curious · Normie
  • ·The expensive step where AI reads all the internet
  • ·Costs hundreds of millions of dollars
  • ·Only a handful of companies can afford to do it
Gecko's take

Pretraining is the moat. The labs that can afford $500M training runs are the ones setting frontier quality for the rest of the decade.

GPT-4 class: $50-100M estimated. GPT-5 class: $250-500M+. Dominated by GPU rental ($2-3/hour per H100 × thousands × months).