Inference
The process of running a trained model to generate predictions · every API call is inference.
The process of running a trained model to generate predictions · every API call is inference.
Basic
Inference is what happens every time you send a prompt and get a response. It's distinct from training (which teaches the model). Inference costs scale with model size, hardware, optimization (quantization, batching), and request volume. It represents the majority of ongoing AI compute spending in production systems · typically 60-80% of total model lifecycle cost.
Deep
Inference has two phases: prefill (process the input prompt in parallel) and decode (generate output tokens one at a time). Prefill is compute-bound; decode is memory-bandwidth-bound due to KV cache reads. This is why long outputs are expensive: each token requires a full pass reading the cache. Optimizations include: PagedAttention (efficient KV cache), speculative decoding (draft model predicts tokens, verified by main model), continuous batching (pack requests dynamically), and quantization (INT8/FP8 weights + activations).
Expert
Serving frameworks: vLLM, TensorRT-LLM, SGLang are production leaders. FlashAttention for efficient attention kernels. PagedAttention treats KV cache as virtual memory. Speculative decoding yields 2-3× throughput on match-heavy workloads. Continuous batching vs static batching: static wastes capacity when requests finish early; continuous packs new requests mid-batch. FP8 at both weight and activation gives roughly 2× throughput vs FP16 on H100+. Memory bandwidth is the dominant constraint for LLM decode · tokens/sec scales linearly with memory bandwidth, not FLOPS.
Depending on why you're here
- ·Prefill = compute-bound, decode = memory-bandwidth-bound
- ·vLLM + PagedAttention + continuous batching is the 2026 default
- ·FP8 weights + activations doubles throughput on H100+
- ·Optimize for your workload shape · long output = memory-constrained
- ·Use FP8 serving in production · minimal quality loss, 2× cheaper
- ·Batch aggressively for throughput, use speculative decoding for interactive latency
- ·Inference is 60-80% of model lifecycle cost · not training
- ·Cost compression via MoE + FP8 + distillation · 30× drop since 2023
- ·Inference infra is the new margin battleground
- ·What the AI does when it answers you
- ·Cheaper and faster than training
- ·The reason AI APIs keep getting cheaper
Inference optimization is where the next 10× cost reduction lives. Every frontier lab is racing to ship the best serving stack.