AI Alignment
The research discipline focused on making AI systems do what humans actually want · not just what they're told.
The research discipline focused on making AI systems do what humans actually want · not just what they're told.
Basic
Alignment is why every frontier AI is trained with RLHF (or DPO) on top of pretraining. Techniques include: instruction tuning (teach the model to follow directions), preference learning (rank outputs, train toward preferred), Constitutional AI (Anthropic's self-critique approach), and red-teaming (adversarial testing). Anthropic, OpenAI, Google, DeepMind, and xAI all have dedicated alignment teams.
Deep
Alignment research splits into (1) alignment techniques (how to train safe AI), and (2) safety research (what happens if techniques fail). Modern recipes: SFT → RLHF or DPO → safety tuning. Constitutional AI replaces human feedback with AI-generated critique following a written constitution. Scalable oversight (training AI to help evaluate AI) is an active research area. Evaluation: Anthropic's responsibility benchmarks, OpenAI's safety evals, red-team exercises. Alignment is necessary but not sufficient for safety; it is the foundation on which safety layers sit.
Expert
Technical subfields: reward modeling, preference learning, inverse reinforcement learning, mechanistic interpretability, scalable oversight, model editing. Constitutional AI uses a principles document to generate AI-feedback for RLHF-style training. RLAIF (RL from AI Feedback) scales beyond human annotator constraints. Chain-of-thought monitoring watches reasoning traces for misaligned planning. Goodhart's Law applies · optimizing proxy rewards for safety can produce models that perform safety instead of being safe. Open problems: honesty at scale, corrigibility, mesa-optimization, outer alignment.
As reasoning models get capable enough to plan multi-step actions, alignment shifts from "make it polite" to "make it pursue intended goals".
Depending on why you're here
- ·SFT → RLHF/DPO → safety tuning is the baseline stack
- ·Constitutional AI, RLAIF, scalable oversight are the active research areas
- ·Open problems: honesty, corrigibility, mesa-optimization
- ·You inherit upstream alignment · models come pre-aligned
- ·Fine-tuning can break alignment · use test sets that cover safety-critical cases
- ·Guardrails are additional · don't rely on the model's baseline alignment alone
- ·Alignment talent is a talent-war axis · top labs compete heavily
- ·Regulatory environment favors labs with demonstrable alignment investment
- ·Enterprise buyers value alignment story · differentiator
- ·Making sure AI does what humans actually want · not weird stuff
- ·The reason AI has rules about not helping with dangerous tasks
- ·Active research field, not a solved problem
Alignment is the hardest unsolved problem in AI. Every capability gain makes it harder, not easier.