Who pretrains AI models?

OpenAI, Anthropic, Google, Meta, xAI, DeepSeek, Alibaba (Qwen), Mistral. Fewer than 10 organizations at frontier scale.

Can I pretrain my own LLM?

A small one (<1B params) on a few billion tokens · yes, a few thousand dollars on rented GPUs. Frontier-class · absolutely not.

ConceptsReading · ~3 min · 78 words deep

Pretraining

Q: Can I pretrain my own LLM?

A small one (<1B params) on a few billion tokens · yes, a few thousand dollars on rented GPUs. Frontier-class · absolutely not.

The phase where a model learns from internet-scale text via next-token prediction · typically 15-30T tokens and millions of GPU hours.

TL;DR

The phase where a model learns from internet-scale text via next-token prediction · typically 15-30T tokens and millions of GPU hours.

Level 1

Basic

Pretraining is step one of creating an LLM. The model reads trillions of tokens of web text, code, and books, predicting the next token at each step. This gives it a base understanding of language, facts, and code patterns. Pretraining is by far the most compute-intensive phase · $50-500M+ for frontier models. After pretraining, post-training (SFT, DPO, RLHF) shapes behavior.

Level 2

Deep

Data mix for 2026 pretraining: English web (~40%), code (~20%), books and scientific papers (~10%), multilingual (~20%), curated domain data (~10%). Quality filtering removes duplicate, toxic, or low-information content. Deduplication (MinHash, SimHash) is critical · duplicate data memorization is a concerning failure mode. Tokens processed: 15T for Llama 3.1, 15-30T+ for Llama 4 and GPT-5 class. Training compute measured in FLOPs: 10^25-10^26 for frontier class. FP8 pretraining has become standard since DeepSeek V3 (2024) demonstrated it at scale.

Level 3

Expert

Pretraining hyperparameters: learning rate schedule (cosine decay, warmup), batch size (millions of tokens), sequence length (8K to 128K+), optimizer (AdamW dominant). Data quality outweighs quantity past a threshold · the 2023-2025 winning labs invested heavily in data curation. Scaling curves show that model quality improves as a function of compute, with caveats about the Chinchilla compute-optimal ratio. Training failures (loss spikes) are managed via gradient clipping, skip-connections, and restart-from-checkpoint protocols. A single frontier pretraining run can take 2-6 months on a 10,000+ GPU cluster.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Next-token prediction on 15-30T tokens
·Data quality > quantity past the threshold
·FP8 pretraining is the new default post-DeepSeek V3

If you are a

Builder

·You do not pretrain · that is frontier lab work
·Fine-tune from a pretrained base for 10-1000× less cost
·Base model choice determines your ceiling · pick carefully

If you are a

Investor

·Pretraining costs follow FLOPs curves · $100M → $500M → $1B per generation
·Efficiency gains (MoE, FP8) slow the cost growth but don't reverse it
·Frontier pretraining is a 3-5 lab oligopoly

If you are a

Curious · Normie

·The expensive step where AI reads all the internet
·Costs hundreds of millions of dollars
·Only a handful of companies can afford to do it

Gecko's take

Pretraining is the moat. The labs that can afford $500M training runs are the ones setting frontier quality for the rest of the decade.

Frequently Asked Questions

GPT-4 class: $50-100M estimated. GPT-5 class: $250-500M+. Dominated by GPU rental ($2-3/hour per H100 × thousands × months).

Pretraining

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed