Why does output cost more than input?

Input tokens are processed in parallel (prefill). Output tokens are generated sequentially with KV cache reads each step. Sequential = more memory bandwidth.

What is speculative decoding?

A small draft model predicts several tokens ahead, the full model verifies them in one pass. 2-3× throughput with no quality loss.

ConceptsReading · ~3 min · 74 words deep

Inference

The process of running a trained model to generate predictions · every API call is inference.

TL;DR

The process of running a trained model to generate predictions · every API call is inference.

Level 1

Basic

Inference is what happens every time you send a prompt and get a response. It's distinct from training (which teaches the model). Inference costs scale with model size, hardware, optimization (quantization, batching), and request volume. It represents the majority of ongoing AI compute spending in production systems · typically 60-80% of total model lifecycle cost.

Level 2

Deep

Inference has two phases: prefill (process the input prompt in parallel) and decode (generate output tokens one at a time). Prefill is compute-bound; decode is memory-bandwidth-bound due to KV cache reads. This is why long outputs are expensive: each token requires a full pass reading the cache. Optimizations include: PagedAttention (efficient KV cache), speculative decoding (draft model predicts tokens, verified by main model), continuous batching (pack requests dynamically), and quantization (INT8/FP8 weights + activations).

Level 3

Expert

Serving frameworks: vLLM, TensorRT-LLM, SGLang are production leaders. FlashAttention for efficient attention kernels. PagedAttention treats KV cache as virtual memory. Speculative decoding yields 2-3× throughput on match-heavy workloads. Continuous batching vs static batching: static wastes capacity when requests finish early; continuous packs new requests mid-batch. FP8 at both weight and activation gives roughly 2× throughput vs FP16 on H100+. Memory bandwidth is the dominant constraint for LLM decode · tokens/sec scales linearly with memory bandwidth, not FLOPS.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Prefill = compute-bound, decode = memory-bandwidth-bound
·vLLM + PagedAttention + continuous batching is the 2026 default
·FP8 weights + activations doubles throughput on H100+

If you are a

Builder

·Optimize for your workload shape · long output = memory-constrained
·Use FP8 serving in production · minimal quality loss, 2× cheaper
·Batch aggressively for throughput, use speculative decoding for interactive latency

If you are a

Investor

·Inference is 60-80% of model lifecycle cost · not training
·Cost compression via MoE + FP8 + distillation · 30× drop since 2023
·Inference infra is the new margin battleground

If you are a

Curious · Normie

·What the AI does when it answers you
·Cheaper and faster than training
·The reason AI APIs keep getting cheaper

Gecko's take

Inference optimization is where the next 10× cost reduction lives. Every frontier lab is racing to ship the best serving stack.

Frequently Asked Questions

Per request inference is much cheaper. Total cost of serving models over their lifetime usually exceeds training cost by 3-5×.

Inference

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed