Beta
ConceptsReading · ~3 min · 62 words deep

Latency

Time from request sent to first response token · the number that determines UX feel.

TL;DR

Time from request sent to first response token · the number that determines UX feel.

Level 1

Latency (Time to First Token, TTFT) is the delay between hitting send and seeing the first character of the response. Good TTFT under 500ms · feels instant. 1-2s is acceptable for chat. Above 5s feels broken. Latency is distinct from tokens-per-second (how fast the response streams after the first token).

Level 2

TTFT depends on: queue wait (how many requests are ahead), prefill time (proportional to input length), model size (bigger = slower), and geographic distance (speed of light from your server to theirs). Reasoning models add 30-120s thinking latency before the first visible token. Groq and Cerebras hit sub-100ms TTFT by using exotic hardware (LPU, wafer-scale). Most frontier APIs: 400-2000ms TTFT at p50.

Level 3

Prefill cost is O(n) in input tokens for full attention, O(n) for sliding-window. Grouped-Query Attention reduces the multiplicative constant. KV cache reuse (prefix caching) eliminates prefill for repeated prompts · massive TTFT win for multi-turn chat with a stable system prompt. Claude's prompt caching and OpenAI's cached input pricing reflect this optimization. Geographic routing (regional endpoints) reduces network latency from 100ms+ to under 20ms.

The takeaway for you
If you are a
Researcher
  • ·TTFT = queue + prefill + network
  • ·Prefix caching eliminates prefill for repeated prompts
  • ·Reasoning adds massive variable latency
If you are a
Builder
  • ·Target TTFT under 500ms for interactive chat UX
  • ·Use prompt caching for multi-turn · Claude 90% discount on cached prefix
  • ·Pick regional endpoints · saves 50-100ms of network latency
If you are a
Investor
  • ·Groq / Cerebras compete on TTFT · different market than throughput players
  • ·Prompt caching is a pricing lever for frontier labs
  • ·Latency-sensitive agents drive premium for fast inference
If you are a
Curious · Normie
  • ·How fast AI starts typing back at you
  • ·Under half a second feels instant
  • ·Reasoning models are slower but smarter
Gecko's take

TTFT determines whether AI feels magical or broken. Every UX win is a latency win.

Under 500ms for chat UX, under 100ms for real-time voice. Frontier APIs at p50: 400-2000ms. Groq/Cerebras: sub-100ms.