Throughput
The total tokens-per-second a serving cluster handles across ALL concurrent requests · vs tokens-per-second of a single request.
The total tokens-per-second a serving cluster handles across ALL concurrent requests · vs tokens-per-second of a single request.
Basic
Throughput is a system-level metric: how many total tokens per second your infrastructure serves. Tokens-per-second is per-request. A cluster with 10 users getting 50 tok/s each is 500 tok/s throughput. Optimizing for throughput vs latency is a key trade-off · batching more requests = higher throughput, lower per-request latency.
Deep
Throughput depends on batching strategy, quantization, parallelism (tensor, pipeline, expert), and KV cache management. Continuous batching (vLLM, TRT-LLM) packs requests dynamically, drastically raising utilization. FP8 weights + activations doubles throughput on compatible hardware. Mixed-tier serving (small model for easy requests, big for hard) is an emerging pattern. Per-GPU throughput benchmark: H100 hits 2000-4000 tok/s aggregate on a 70B dense model with FP8 + continuous batching.
Expert
Throughput-latency trade-off: static batching maximizes throughput at the cost of per-request latency (first requester waits for batch to fill). Continuous batching optimizes for both by swapping requests in and out dynamically. Paged attention (vLLM) enables efficient KV cache management for variable-length sequences. Multi-tenant serving adds complications · priority lanes, rate limiting, quota enforcement. Cerebras and SambaNova claim 2000+ tok/s per request on 70B models via exotic hardware · normal GPU serving tops out lower per-request but higher aggregate.
Depending on why you're here
- ·Continuous batching + paged attention + FP8 is the 2026 default
- ·Per-GPU throughput: H100 hits 2K-4K tok/s on 70B dense with FP8
- ·Cerebras/SambaNova push per-request speed · GPU clusters win on aggregate
- ·Optimize for throughput if batch workload, latency if interactive
- ·vLLM + continuous batching is the open-source gold standard
- ·Monitor aggregate tokens/sec · individual requests less indicative
- ·Throughput optimization drives margin · every 2× throughput = 2× less GPU fleet needed
- ·Serving software (vLLM, TRT-LLM, SGLang) is strategic infrastructure
- ·Cerebras/SambaNova bet differently · per-request speed vs aggregate
- ·How much AI work a data center can do per second
- ·Different from "how fast does it type" · that's per-user
- ·Why some AI providers can handle millions of users at once
Throughput per dollar is the number that actually matters. Every pricing war is a throughput war in disguise.