Which framework has the best throughput?

vLLM, TensorRT-LLM, and SGLang are the 2026 leaders. Choice depends on GPU vendor (Nvidia TRT-LLM is Nvidia-optimized) and OSS preference.

Does quantization help throughput?

Yes. FP8 roughly doubles throughput vs FP16 on H100+. INT4 is more aggressive but has compatibility constraints.

ConceptsReading · ~3 min · 65 words deep

Throughput

The total tokens-per-second a serving cluster handles across ALL concurrent requests · vs tokens-per-second of a single request.

TL;DR

The total tokens-per-second a serving cluster handles across ALL concurrent requests · vs tokens-per-second of a single request.

Level 1

Basic

Throughput is a system-level metric: how many total tokens per second your infrastructure serves. Tokens-per-second is per-request. A cluster with 10 users getting 50 tok/s each is 500 tok/s throughput. Optimizing for throughput vs latency is a key trade-off · batching more requests = higher throughput, lower per-request latency.

Level 2

Deep

Throughput depends on batching strategy, quantization, parallelism (tensor, pipeline, expert), and KV cache management. Continuous batching (vLLM, TRT-LLM) packs requests dynamically, drastically raising utilization. FP8 weights + activations doubles throughput on compatible hardware. Mixed-tier serving (small model for easy requests, big for hard) is an emerging pattern. Per-GPU throughput benchmark: H100 hits 2000-4000 tok/s aggregate on a 70B dense model with FP8 + continuous batching.

Level 3

Expert

Throughput-latency trade-off: static batching maximizes throughput at the cost of per-request latency (first requester waits for batch to fill). Continuous batching optimizes for both by swapping requests in and out dynamically. Paged attention (vLLM) enables efficient KV cache management for variable-length sequences. Multi-tenant serving adds complications · priority lanes, rate limiting, quota enforcement. Cerebras and SambaNova claim 2000+ tok/s per request on 70B models via exotic hardware · normal GPU serving tops out lower per-request but higher aggregate.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Continuous batching + paged attention + FP8 is the 2026 default
·Per-GPU throughput: H100 hits 2K-4K tok/s on 70B dense with FP8
·Cerebras/SambaNova push per-request speed · GPU clusters win on aggregate

If you are a

Builder

·Optimize for throughput if batch workload, latency if interactive
·vLLM + continuous batching is the open-source gold standard
·Monitor aggregate tokens/sec · individual requests less indicative

If you are a

Investor

·Throughput optimization drives margin · every 2× throughput = 2× less GPU fleet needed
·Serving software (vLLM, TRT-LLM, SGLang) is strategic infrastructure
·Cerebras/SambaNova bet differently · per-request speed vs aggregate

If you are a

Curious · Normie

·How much AI work a data center can do per second
·Different from "how fast does it type" · that's per-user
·Why some AI providers can handle millions of users at once

Gecko's take

Throughput per dollar is the number that actually matters. Every pricing war is a throughput war in disguise.

Frequently Asked Questions

No. Throughput is system-wide volume. Latency is time-to-first-token for one request. They trade off.

Throughput

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed