Why are reasoning models slow?

They generate internal thinking tokens before showing output. Latency can be 30-120s depending on problem difficulty.

Does a bigger context window slow down latency?

Yes · prefill time scales with input length. A 2M-token prompt can take seconds just to read, before the first output token.

ConceptsReading · ~3 min · 62 words deep

Latency

Time from request sent to first response token · the number that determines UX feel.

TL;DR

Time from request sent to first response token · the number that determines UX feel.

Level 1

Basic

Latency (Time to First Token, TTFT) is the delay between hitting send and seeing the first character of the response. Good TTFT under 500ms · feels instant. 1-2s is acceptable for chat. Above 5s feels broken. Latency is distinct from tokens-per-second (how fast the response streams after the first token).

Level 2

Deep

TTFT depends on: queue wait (how many requests are ahead), prefill time (proportional to input length), model size (bigger = slower), and geographic distance (speed of light from your server to theirs). Reasoning models add 30-120s thinking latency before the first visible token. Groq and Cerebras hit sub-100ms TTFT by using exotic hardware (LPU, wafer-scale). Most frontier APIs: 400-2000ms TTFT at p50.

Level 3

Expert

Prefill cost is O(n) in input tokens for full attention, O(n) for sliding-window. Grouped-Query Attention reduces the multiplicative constant. KV cache reuse (prefix caching) eliminates prefill for repeated prompts · massive TTFT win for multi-turn chat with a stable system prompt. Claude's prompt caching and OpenAI's cached input pricing reflect this optimization. Geographic routing (regional endpoints) reduces network latency from 100ms+ to under 20ms.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·TTFT = queue + prefill + network
·Prefix caching eliminates prefill for repeated prompts
·Reasoning adds massive variable latency

If you are a

Builder

·Target TTFT under 500ms for interactive chat UX
·Use prompt caching for multi-turn · Claude 90% discount on cached prefix
·Pick regional endpoints · saves 50-100ms of network latency

If you are a

Investor

·Groq / Cerebras compete on TTFT · different market than throughput players
·Prompt caching is a pricing lever for frontier labs
·Latency-sensitive agents drive premium for fast inference

If you are a

Curious · Normie

·How fast AI starts typing back at you
·Under half a second feels instant
·Reasoning models are slower but smarter

Gecko's take

TTFT determines whether AI feels magical or broken. Every UX win is a latency win.

Frequently Asked Questions

Under 500ms for chat UX, under 100ms for real-time voice. Frontier APIs at p50: 400-2000ms. Groq/Cerebras: sub-100ms.

Latency

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed