Beta
ConceptsReading · ~3 min · 82 words deep

AI Alignment

The research discipline focused on making AI systems do what humans actually want · not just what they're told.

TL;DR

The research discipline focused on making AI systems do what humans actually want · not just what they're told.

Level 1

Alignment is why every frontier AI is trained with RLHF (or DPO) on top of pretraining. Techniques include: instruction tuning (teach the model to follow directions), preference learning (rank outputs, train toward preferred), Constitutional AI (Anthropic's self-critique approach), and red-teaming (adversarial testing). Anthropic, OpenAI, Google, DeepMind, and xAI all have dedicated alignment teams.

Level 2

Alignment research splits into (1) alignment techniques (how to train safe AI), and (2) safety research (what happens if techniques fail). Modern recipes: SFT → RLHF or DPO → safety tuning. Constitutional AI replaces human feedback with AI-generated critique following a written constitution. Scalable oversight (training AI to help evaluate AI) is an active research area. Evaluation: Anthropic's responsibility benchmarks, OpenAI's safety evals, red-team exercises. Alignment is necessary but not sufficient for safety; it is the foundation on which safety layers sit.

Level 3

Technical subfields: reward modeling, preference learning, inverse reinforcement learning, mechanistic interpretability, scalable oversight, model editing. Constitutional AI uses a principles document to generate AI-feedback for RLHF-style training. RLAIF (RL from AI Feedback) scales beyond human annotator constraints. Chain-of-thought monitoring watches reasoning traces for misaligned planning. Goodhart's Law applies · optimizing proxy rewards for safety can produce models that perform safety instead of being safe. Open problems: honesty at scale, corrigibility, mesa-optimization, outer alignment.

Why this matters now

As reasoning models get capable enough to plan multi-step actions, alignment shifts from "make it polite" to "make it pursue intended goals".

The takeaway for you
If you are a
Researcher
  • ·SFT → RLHF/DPO → safety tuning is the baseline stack
  • ·Constitutional AI, RLAIF, scalable oversight are the active research areas
  • ·Open problems: honesty, corrigibility, mesa-optimization
If you are a
Builder
  • ·You inherit upstream alignment · models come pre-aligned
  • ·Fine-tuning can break alignment · use test sets that cover safety-critical cases
  • ·Guardrails are additional · don't rely on the model's baseline alignment alone
If you are a
Investor
  • ·Alignment talent is a talent-war axis · top labs compete heavily
  • ·Regulatory environment favors labs with demonstrable alignment investment
  • ·Enterprise buyers value alignment story · differentiator
If you are a
Curious · Normie
  • ·Making sure AI does what humans actually want · not weird stuff
  • ·The reason AI has rules about not helping with dangerous tasks
  • ·Active research field, not a solved problem
Gecko's take

Alignment is the hardest unsolved problem in AI. Every capability gain makes it harder, not easier.

Surface-aligned yes · frontier models decline harmful requests, admit uncertainty, follow instructions. Deep alignment (pursues intended goals under distribution shift) is unsolved.