Why is it called "transformer"?

The architecture transforms input sequences to output sequences through stacked attention layers. The name emphasizes the universal sequence-transformation capability.

Is there a post-transformer architecture?

Candidates exist · Mamba (state-space), RWKV (hybrid recurrent), Hyena (implicit convolution). None has demonstrated dominance at frontier scale yet.

ConceptsReading · ~3 min · 97 words deep

Transformer

The 2017 "Attention Is All You Need" architecture · parallelizable, scalable, and the foundation of every modern LLM.

TL;DR

The 2017 "Attention Is All You Need" architecture · parallelizable, scalable, and the foundation of every modern LLM.

Level 1

Basic

Transformers replaced recurrent networks (RNNs, LSTMs) by processing all tokens in parallel instead of sequentially. The core idea is self-attention: every token computes relationships with every other token, then aggregates information weighted by those relationships. Stacking many such layers produces a model that can capture long-range dependencies and train efficiently on modern GPUs.

Level 2

Deep

A transformer block has: multi-head self-attention, layer normalization, a feedforward MLP, and residual connections. "Attention" computes query-key-value projections of each token, takes dot products of queries with all keys, softmaxes them into weights, and outputs a weighted sum of values. Multi-head attention runs several of these in parallel at different subspaces, capturing different relational patterns. Positional embeddings (absolute, RoPE, ALiBi) encode token position since attention itself is order-agnostic. Decoder-only transformers (GPT family, Llama, Claude) are the dominant architecture for generative LLMs. MoE transformers replace the feedforward with a bank of experts plus a router (see MoE entry).

Level 3

Expert

Attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V. Multi-head: h parallel attention heads with output = Concat(head_1, ..., head_h) W^O, enabling different representational subspaces. RoPE (Rotary Position Embedding) rotates query and key vectors by position-dependent angles, preserving relative position in dot products. ALiBi adds a linear bias to attention scores based on position distance. FlashAttention reformulates attention to use tiling + online softmax, reducing memory from O(n²) to O(n). Grouped-Query Attention (GQA) shares K/V across multiple query heads, reducing KV cache by 4-8×. Multi-Query Attention (MQA) takes this to the limit with a single K/V pair. Mixture-of-Experts scales the FFN component by routing each token to a subset of expert MLPs.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Multi-head self-attention + MLP + residuals + LayerNorm
·Decoder-only is dominant for generative LLMs
·FlashAttention, GQA, MoE are the modern productivity extensions

If you are a

Builder

·You don't need to understand transformer internals to use an LLM API
·Knowing KV cache exists helps you understand per-token pricing and context-window cost
·Architecture-specific quirks (attention sink, long-context quality) affect prompt engineering

If you are a

Investor

·Transformer ubiquity = NVIDIA moat + compute concentration
·Next architecture (state-space models, linear attention) could disrupt if they hit scaling
·Watch Mamba/Hyena/etc. as leading indicators of post-transformer shift

If you are a

Curious · Normie

·The 2017 breakthrough that led to ChatGPT
·All modern AI models are built from these
·"Attention" just means the AI can focus on relevant parts of what you said

Gecko's take

The transformer is 8 years old and shows no signs of being replaced. Every proposed successor so far adds complexity without killing the scaling-law champion.

Frequently Asked Questions

A team at Google Brain and Google Research: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin. Published in June 2017 as "Attention Is All You Need."

Canonical sources

Read the primary sources

Attention Is All You Need (2017)arxiv.org
FlashAttention (2022)arxiv.org

Transformer

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed