Chain of Thought
Prompting the model to show its reasoning steps before the final answer · dramatically improves math, logic, and multi-step tasks.
Prompting the model to show its reasoning steps before the final answer · dramatically improves math, logic, and multi-step tasks.
Basic
Chain-of-thought (CoT) was first shown in a 2022 Google paper: adding "Let's think step by step" to a prompt makes models solve problems they previously failed. Modern models trained on CoT data produce step-by-step reasoning by default. Reasoning models (o1, Claude Extended Thinking, DeepSeek R1) extend this further by generating internal CoT traces that are hidden from the user but billed as output tokens.
Deep
CoT exploits the transformer's auto-regressive generation: each token conditions on all previous tokens. By generating intermediate reasoning, the model has more compute per final-answer token and can catch errors in earlier steps. Zero-shot CoT ("Let's think step by step") was the original trigger. Few-shot CoT provides worked examples in the prompt. Models trained on CoT data (Chinchilla and later) produce CoT naturally. Self-consistency improves CoT by sampling multiple reasoning paths and voting on the final answer. Tree-of-Thought extends CoT to branching exploration. For tasks with verifiable answers, sample-and-verify schemes dramatically outperform single-shot CoT.
Expert
CoT's mechanism: each step token effectively adds a forward pass of compute. For an n-step reasoning problem, the total FLOPs are n times a direct-answer pass. This matches theoretical results that transformers need either depth or sequence length to solve certain problem classes. Self-Consistency samples k traces and picks the most common answer; works well when the reasoning distribution is unimodal around correct. Tree-of-Thought (2023) searches a tree of partial reasoning states with an explicit value function · more expensive but handles multi-modal reasoning distributions. Chain-of-Verification and Reflexion add self-critique loops · diminishing returns past 3-5 iterations.
CoT-native training (DeepSeek R1's RL recipe) made CoT a core model capability rather than a prompt trick. Every frontier model now does CoT internally.
Depending on why you're here
- ·Generated intermediate steps give the model more compute per final-answer token
- ·Zero-shot trigger: "Let's think step by step"
- ·Self-Consistency + Tree-of-Thought are the main extensions
- ·For math, logic, multi-step tasks: include CoT trigger or use a reasoning model
- ·For chat, short answers: skip CoT · it costs tokens without upside
- ·Hide or show CoT depending on UX · most users want the final answer only
- ·CoT-native training is the core recipe behind the reasoning-model boom
- ·Reasoning models charge for hidden CoT · new revenue stream
- ·CoT effectiveness plateaus at ~10K tokens · limits upside of pure test-time compute
- ·Asking AI to "show its work" before answering
- ·Makes AI way better at math and logic puzzles
- ·The reason reasoning models like o1 work
CoT was a hack in 2022 and a core capability by 2025. The next bench will measure CoT efficiency, not CoT presence.
Frequently Asked Questions
Read the primary sources
- Chain-of-Thought paper (Google, 2022)arxiv.org
- Self-Consistency paper (2022)arxiv.org
- Tree-of-Thought paper (2023)arxiv.org