KV Cache Compression
KV cache compression reduces the memory cost of long-context LLM inference · via quantization, eviction, or sparsity · cuts VRAM 2-8× for long contexts.
KV cache compression reduces the memory cost of long-context LLM inference · via quantization, eviction, or sparsity · cuts VRAM 2-8× for long contexts.
Basic
In transformer inference, every generated token requires storing key + value tensors for all prior tokens (the KV cache). For 128K context, this can exceed 30GB on a 70B model · often the bottleneck before compute. KV cache compression shrinks this memory via: quantization (4-bit/8-bit KV), eviction (drop less-important tokens), sliding windows, or sparse attention. Essential for serving long context efficiently.
Deep
Techniques: (1) KV quantization (INT4/INT8 stored KV tensors) cuts memory 2-4× with minor quality loss. (2) Eviction policies (H2O, StreamingLLM) drop low-attention tokens. (3) Sparsity (e.g., MQA, GQA at inference) reduce K+V channels. (4) Prefix caching (Anthropic, OpenAI) reuses KV across sessions · not compression per se but related. vLLM, SGLang, TensorRT-LLM all ship KV cache management · critical for production long-context serving.
Expert
Advanced methods: (a) H2O (Heavy-Hitter Oracle, NeurIPS 2023) · keeps only tokens with high attention accumulation; (b) StreamingLLM (ICLR 2024) · keeps initial "attention sink" tokens + sliding window, enables infinite streams; (c) YaRN + scale-aware quantization for RoPE-based models. Frontier production servers combine multiple techniques · e.g., vLLM paged KV with dynamic quantization. Memory savings of 4-8× are standard in 2026 production with minor (<2%) quality impact.
Long-context (1M+) LLMs need KV cache compression to be economically viable · every major inference server now ships it.
Depending on why you're here
- ·Quantization, eviction, sparsity, sliding windows
- ·H2O, StreamingLLM, YaRN as key techniques
- ·4-8× memory savings at <2% quality impact
- ·Check if your inference server uses KV quantization
- ·For 100K+ contexts, KV compression is essential
- ·vLLM, SGLang, TensorRT-LLM all ship modern methods
- ·Makes long-context serving economics viable
- ·Key reason 1M+ context APIs are priced affordable in 2026
- ·Silicon + software co-design lever
- ·Ways to make AI remember less but still work
- ·Needed for AI that handles long documents
- ·Invisible to users · happens inside the AI
KV cache compression is the quiet reason 1M-context LLMs are affordable · silicon + software saving 4-8× memory invisibly.