Reasoning Model
A model that generates internal reasoning tokens before producing the answer, trading inference cost for accuracy on math and logic.
A model that generates internal reasoning tokens before producing the answer, trading inference cost for accuracy on math and logic.
Basic
Reasoning models like o1, o3, Claude Extended Thinking, and DeepSeek R1 produce a private chain of thought before their final answer. This costs more tokens and latency but sharply improves scores on benchmarks like GPQA Diamond, ARC-AGI, and MATH. You pay ~10× more per answer but get dramatically better correctness on hard problems.
Deep
Reasoning models are trained with reinforcement learning on chain-of-thought traces that maximize a correctness reward. At inference time they allocate a variable "thinking budget" of tokens that are generated but not shown to the user. OpenAI o1 spends between 1K and 100K reasoning tokens per problem depending on difficulty. Claude Extended Thinking exposes the reasoning transparently and charges for it. DeepSeek R1 open-sourced both the weights and the RL recipe, making reasoning models accessible to anyone. The trade-off is direct: more thinking tokens equals higher cost and latency but better accuracy on benchmarks that require multi-step inference. For pure recall or chat, reasoning models are overkill.
Expert
A reasoning model is fine-tuned with RL against a reward that measures final-answer correctness on problems with verifiable answers (MATH, code, etc.). The policy learns to emit long intermediate traces that decompose the problem, self-verify partial results, and backtrack from dead ends. Test-time compute becomes a new scaling axis: OpenAI reported that o1 quality scales log-linearly with thinking tokens up to ~10K tokens per problem. Serving costs mirror this: o1-preview charges $15/M input + $60/M output, with hidden reasoning tokens billed as output. DeepSeek R1 (671B MoE reasoning model) demonstrated that the recipe generalizes: pure outcome-reward RL on a strong base model yields emergent reasoning behavior without any process-level supervision. Open questions: how far does test-time compute scale before diminishing returns, and whether reasoning-style RL harms general capabilities.
DeepSeek R1 open-sourced a frontier reasoning model in Jan 2025. OpenAI responded with o3 and dropped pricing in 2026. Reasoning capability is commoditizing fast.
Depending on why you're here
- ·RL on verifiable rewards drives emergent reasoning without process supervision
- ·Test-time compute scales log-linearly with thinking tokens up to ~10K
- ·See o1 report (OpenAI 2024) and DeepSeek R1 paper (Jan 2025)
- ·Reasoning models charge for hidden thinking tokens · budget accordingly
- ·Use for math, code, multi-step logic · not for chat or short answers
- ·Latency often 30-120s · plan UX around waiting
- ·Test-time compute is the new scaling axis · changes the capex story
- ·Commoditization risk: DeepSeek R1 is open-weight and matches o1 at 1/30th price
- ·Inference-heavy revenue models outperform pure training-spend bets
- ·The AI pauses to think before answering hard questions
- ·Slower and more expensive but much smarter on tricky problems
- ·Use it for homework and code, not for chit-chat
Often confused with
CoT is a prompting trick any model can follow. Reasoning models are trained to do it by default via RL.
Agents call tools and take actions. Reasoning models think harder internally but don't act on the world.
Reasoning models changed the pricing ladder: the frontier is no longer one price. It is now $2/M for chat, $20/M for reasoning, and $200/M for agent execution.
Routing easy questions to a chat model and only using reasoning for hard ones saves ~85% on a typical 1M-token/mo workload.
Frequently Asked Questions
Read the primary sources
- OpenAI o1 release (Sep 2024)openai.com
- DeepSeek R1 paperarxiv.org
- Anthropic Extended Thinkingwww.anthropic.com