Transformer
The 2017 "Attention Is All You Need" architecture · parallelizable, scalable, and the foundation of every modern LLM.
The 2017 "Attention Is All You Need" architecture · parallelizable, scalable, and the foundation of every modern LLM.
Basic
Transformers replaced recurrent networks (RNNs, LSTMs) by processing all tokens in parallel instead of sequentially. The core idea is self-attention: every token computes relationships with every other token, then aggregates information weighted by those relationships. Stacking many such layers produces a model that can capture long-range dependencies and train efficiently on modern GPUs.
Deep
A transformer block has: multi-head self-attention, layer normalization, a feedforward MLP, and residual connections. "Attention" computes query-key-value projections of each token, takes dot products of queries with all keys, softmaxes them into weights, and outputs a weighted sum of values. Multi-head attention runs several of these in parallel at different subspaces, capturing different relational patterns. Positional embeddings (absolute, RoPE, ALiBi) encode token position since attention itself is order-agnostic. Decoder-only transformers (GPT family, Llama, Claude) are the dominant architecture for generative LLMs. MoE transformers replace the feedforward with a bank of experts plus a router (see MoE entry).
Expert
Attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V. Multi-head: h parallel attention heads with output = Concat(head_1, ..., head_h) W^O, enabling different representational subspaces. RoPE (Rotary Position Embedding) rotates query and key vectors by position-dependent angles, preserving relative position in dot products. ALiBi adds a linear bias to attention scores based on position distance. FlashAttention reformulates attention to use tiling + online softmax, reducing memory from O(n²) to O(n). Grouped-Query Attention (GQA) shares K/V across multiple query heads, reducing KV cache by 4-8×. Multi-Query Attention (MQA) takes this to the limit with a single K/V pair. Mixture-of-Experts scales the FFN component by routing each token to a subset of expert MLPs.
Depending on why you're here
- ·Multi-head self-attention + MLP + residuals + LayerNorm
- ·Decoder-only is dominant for generative LLMs
- ·FlashAttention, GQA, MoE are the modern productivity extensions
- ·You don't need to understand transformer internals to use an LLM API
- ·Knowing KV cache exists helps you understand per-token pricing and context-window cost
- ·Architecture-specific quirks (attention sink, long-context quality) affect prompt engineering
- ·Transformer ubiquity = NVIDIA moat + compute concentration
- ·Next architecture (state-space models, linear attention) could disrupt if they hit scaling
- ·Watch Mamba/Hyena/etc. as leading indicators of post-transformer shift
- ·The 2017 breakthrough that led to ChatGPT
- ·All modern AI models are built from these
- ·"Attention" just means the AI can focus on relevant parts of what you said
The transformer is 8 years old and shows no signs of being replaced. Every proposed successor so far adds complexity without killing the scaling-law champion.
Frequently Asked Questions
Read the primary sources
- Attention Is All You Need (2017)arxiv.org
- FlashAttention (2022)arxiv.org