Why is reasoning so expensive?

Reasoning models like o1, o3, Claude Extended Thinking, and DeepSeek R1 produce a private chain of thought before their final answer. This costs more tokens and latency but sharply improves scores on benchmarks like GPQA Diamond, ARC-AGI, and MATH. You pay ~10× more per answer but get dramatically better correctness on hard problems.

Level 2

Deep

Reasoning models are trained with reinforcement learning on chain-of-thought traces that maximize a correctness reward. At inference time they allocate a variable "thinking budget" of tokens that are generated but not shown to the user. OpenAI o1 spends between 1K and 100K reasoning tokens per problem depending on difficulty. Claude Extended Thinking exposes the reasoning transparently and charges for it. DeepSeek R1 open-sourced both the weights and the RL recipe, making reasoning models accessible to anyone. The trade-off is direct: more thinking tokens equals higher cost and latency but better accuracy on benchmarks that require multi-step inference. For pure recall or chat, reasoning models are overkill.

Level 3

Expert

A reasoning model is fine-tuned with RL against a reward that measures final-answer correctness on problems with verifiable answers (MATH, code, etc.). The policy learns to emit long intermediate traces that decompose the problem, self-verify partial results, and backtrack from dead ends. Test-time compute becomes a new scaling axis: OpenAI reported that o1 quality scales log-linearly with thinking tokens up to ~10K tokens per problem. Serving costs mirror this: o1-preview charges $15/M input + $60/M output, with hidden reasoning tokens billed as output. DeepSeek R1 (671B MoE reasoning model) demonstrated that the recipe generalizes: pure outcome-reward RL on a strong base model yields emergent reasoning behavior without any process-level supervision. Open questions: how far does test-time compute scale before diminishing returns, and whether reasoning-style RL harms general capabilities.

Why this matters now

DeepSeek R1 open-sourced a frontier reasoning model in Jan 2025. OpenAI responded with o3 and dropped pricing in 2026. Reasoning capability is commoditizing fast.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·RL on verifiable rewards drives emergent reasoning without process supervision
·Test-time compute scales log-linearly with thinking tokens up to ~10K
·See o1 report (OpenAI 2024) and DeepSeek R1 paper (Jan 2025)

If you are a

Builder

·Reasoning models charge for hidden thinking tokens · budget accordingly
·Use for math, code, multi-step logic · not for chat or short answers
·Latency often 30-120s · plan UX around waiting

If you are a

Investor

·Test-time compute is the new scaling axis · changes the capex story
·Commoditization risk: DeepSeek R1 is open-weight and matches o1 at 1/30th price
·Inference-heavy revenue models outperform pure training-spend bets

If you are a

Curious · Normie

·The AI pauses to think before answering hard questions
·Slower and more expensive but much smarter on tricky problems
·Use it for homework and code, not for chit-chat

Don't mix them up

Often confused with

Reasoning ModelvsChain of thought

CoT is a prompting trick any model can follow. Reasoning models are trained to do it by default via RL.

Reasoning ModelvsAgent

Agents call tools and take actions. Reasoning models think harder internally but don't act on the world.

Gecko's take

Reasoning models changed the pricing ladder: the frontier is no longer one price. It is now $2/M for chat, $20/M for reasoning, and $200/M for agent execution.

The price of knowing this term

Routing easy questions to a chat model and only using reasoning for hard ones saves ~85% on a typical 1M-token/mo workload.

Frequently Asked Questions

On GPQA Diamond, o3 and Claude Mythos Preview lead at 83%+. DeepSeek R1 is open-weight and scores 71% at a fraction of the price.

Canonical sources

Read the primary sources

OpenAI o1 release (Sep 2024)openai.com
DeepSeek R1 paperarxiv.org
Anthropic Extended Thinkingwww.anthropic.com

Reasoning Model

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed