Who works on alignment?

Dedicated teams at Anthropic, OpenAI, Google DeepMind, xAI. Academic groups at Berkeley CHAI, MIRI, Redwood Research.

What is Constitutional AI?

Anthropic's approach · use a written constitution of principles to generate AI feedback instead of (or alongside) human feedback during RLHF-style training.

ConceptsReading · ~3 min · 82 words deep

AI Alignment

The research discipline focused on making AI systems do what humans actually want · not just what they're told.

TL;DR

The research discipline focused on making AI systems do what humans actually want · not just what they're told.

Level 1

Basic

Alignment is why every frontier AI is trained with RLHF (or DPO) on top of pretraining. Techniques include: instruction tuning (teach the model to follow directions), preference learning (rank outputs, train toward preferred), Constitutional AI (Anthropic's self-critique approach), and red-teaming (adversarial testing). Anthropic, OpenAI, Google, DeepMind, and xAI all have dedicated alignment teams.

Level 2

Deep

Alignment research splits into (1) alignment techniques (how to train safe AI), and (2) safety research (what happens if techniques fail). Modern recipes: SFT → RLHF or DPO → safety tuning. Constitutional AI replaces human feedback with AI-generated critique following a written constitution. Scalable oversight (training AI to help evaluate AI) is an active research area. Evaluation: Anthropic's responsibility benchmarks, OpenAI's safety evals, red-team exercises. Alignment is necessary but not sufficient for safety; it is the foundation on which safety layers sit.

Level 3

Expert

Technical subfields: reward modeling, preference learning, inverse reinforcement learning, mechanistic interpretability, scalable oversight, model editing. Constitutional AI uses a principles document to generate AI-feedback for RLHF-style training. RLAIF (RL from AI Feedback) scales beyond human annotator constraints. Chain-of-thought monitoring watches reasoning traces for misaligned planning. Goodhart's Law applies · optimizing proxy rewards for safety can produce models that perform safety instead of being safe. Open problems: honesty at scale, corrigibility, mesa-optimization, outer alignment.

Why this matters now

As reasoning models get capable enough to plan multi-step actions, alignment shifts from "make it polite" to "make it pursue intended goals".

The takeaway for you

Depending on why you're here

If you are a

Researcher

·SFT → RLHF/DPO → safety tuning is the baseline stack
·Constitutional AI, RLAIF, scalable oversight are the active research areas
·Open problems: honesty, corrigibility, mesa-optimization

If you are a

Builder

·You inherit upstream alignment · models come pre-aligned
·Fine-tuning can break alignment · use test sets that cover safety-critical cases
·Guardrails are additional · don't rely on the model's baseline alignment alone

If you are a

Investor

·Alignment talent is a talent-war axis · top labs compete heavily
·Regulatory environment favors labs with demonstrable alignment investment
·Enterprise buyers value alignment story · differentiator

If you are a

Curious · Normie

·Making sure AI does what humans actually want · not weird stuff
·The reason AI has rules about not helping with dangerous tasks
·Active research field, not a solved problem

Gecko's take

Alignment is the hardest unsolved problem in AI. Every capability gain makes it harder, not easier.

Frequently Asked Questions

Surface-aligned yes · frontier models decline harmful requests, admit uncertainty, follow instructions. Deep alignment (pursues intended goals under distribution shift) is unsolved.

AI Alignment

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed