Beta
ConceptsReading · ~3 min · 65 words deep

Throughput

The total tokens-per-second a serving cluster handles across ALL concurrent requests · vs tokens-per-second of a single request.

TL;DR

The total tokens-per-second a serving cluster handles across ALL concurrent requests · vs tokens-per-second of a single request.

Level 1

Throughput is a system-level metric: how many total tokens per second your infrastructure serves. Tokens-per-second is per-request. A cluster with 10 users getting 50 tok/s each is 500 tok/s throughput. Optimizing for throughput vs latency is a key trade-off · batching more requests = higher throughput, lower per-request latency.

Level 2

Throughput depends on batching strategy, quantization, parallelism (tensor, pipeline, expert), and KV cache management. Continuous batching (vLLM, TRT-LLM) packs requests dynamically, drastically raising utilization. FP8 weights + activations doubles throughput on compatible hardware. Mixed-tier serving (small model for easy requests, big for hard) is an emerging pattern. Per-GPU throughput benchmark: H100 hits 2000-4000 tok/s aggregate on a 70B dense model with FP8 + continuous batching.

Level 3

Throughput-latency trade-off: static batching maximizes throughput at the cost of per-request latency (first requester waits for batch to fill). Continuous batching optimizes for both by swapping requests in and out dynamically. Paged attention (vLLM) enables efficient KV cache management for variable-length sequences. Multi-tenant serving adds complications · priority lanes, rate limiting, quota enforcement. Cerebras and SambaNova claim 2000+ tok/s per request on 70B models via exotic hardware · normal GPU serving tops out lower per-request but higher aggregate.

The takeaway for you
If you are a
Researcher
  • ·Continuous batching + paged attention + FP8 is the 2026 default
  • ·Per-GPU throughput: H100 hits 2K-4K tok/s on 70B dense with FP8
  • ·Cerebras/SambaNova push per-request speed · GPU clusters win on aggregate
If you are a
Builder
  • ·Optimize for throughput if batch workload, latency if interactive
  • ·vLLM + continuous batching is the open-source gold standard
  • ·Monitor aggregate tokens/sec · individual requests less indicative
If you are a
Investor
  • ·Throughput optimization drives margin · every 2× throughput = 2× less GPU fleet needed
  • ·Serving software (vLLM, TRT-LLM, SGLang) is strategic infrastructure
  • ·Cerebras/SambaNova bet differently · per-request speed vs aggregate
If you are a
Curious · Normie
  • ·How much AI work a data center can do per second
  • ·Different from "how fast does it type" · that's per-user
  • ·Why some AI providers can handle millions of users at once
Gecko's take

Throughput per dollar is the number that actually matters. Every pricing war is a throughput war in disguise.

No. Throughput is system-wide volume. Latency is time-to-first-token for one request. They trade off.