Latency
Time from request sent to first response token · the number that determines UX feel.
Time from request sent to first response token · the number that determines UX feel.
Basic
Latency (Time to First Token, TTFT) is the delay between hitting send and seeing the first character of the response. Good TTFT under 500ms · feels instant. 1-2s is acceptable for chat. Above 5s feels broken. Latency is distinct from tokens-per-second (how fast the response streams after the first token).
Deep
TTFT depends on: queue wait (how many requests are ahead), prefill time (proportional to input length), model size (bigger = slower), and geographic distance (speed of light from your server to theirs). Reasoning models add 30-120s thinking latency before the first visible token. Groq and Cerebras hit sub-100ms TTFT by using exotic hardware (LPU, wafer-scale). Most frontier APIs: 400-2000ms TTFT at p50.
Expert
Prefill cost is O(n) in input tokens for full attention, O(n) for sliding-window. Grouped-Query Attention reduces the multiplicative constant. KV cache reuse (prefix caching) eliminates prefill for repeated prompts · massive TTFT win for multi-turn chat with a stable system prompt. Claude's prompt caching and OpenAI's cached input pricing reflect this optimization. Geographic routing (regional endpoints) reduces network latency from 100ms+ to under 20ms.
Depending on why you're here
- ·TTFT = queue + prefill + network
- ·Prefix caching eliminates prefill for repeated prompts
- ·Reasoning adds massive variable latency
- ·Target TTFT under 500ms for interactive chat UX
- ·Use prompt caching for multi-turn · Claude 90% discount on cached prefix
- ·Pick regional endpoints · saves 50-100ms of network latency
- ·Groq / Cerebras compete on TTFT · different market than throughput players
- ·Prompt caching is a pricing lever for frontier labs
- ·Latency-sensitive agents drive premium for fast inference
- ·How fast AI starts typing back at you
- ·Under half a second feels instant
- ·Reasoning models are slower but smarter
TTFT determines whether AI feels magical or broken. Every UX win is a latency win.