Quantization
Compressing model weights to lower-precision numbers (INT8, INT4, FP8) to cut memory use and speed up inference.
Compressing model weights to lower-precision numbers (INT8, INT4, FP8) to cut memory use and speed up inference.
Basic
Modern models are trained in FP16 or BF16 (16-bit floating point). Quantization converts weights to INT8, INT4, or FP8 after training. You trade a small quality hit for 2-4× less memory and 2-4× faster inference. Most open-weight models ship quantized variants (GGUF, AWQ, GPTQ). On closed APIs, providers quantize internally but don't expose the precision.
Deep
Post-Training Quantization (PTQ): apply after training, fast. Methods: GPTQ (second-order error minimization), AWQ (activation-aware), SmoothQuant (migrates outliers). Quantization-Aware Training (QAT): train with quantized forward, backward in float. Slower but better quality. Weight-only vs weight+activation: weight-only is more forgiving; activation quantization requires calibration. Typical quality loss for 4-bit: 1-3 points on benchmarks. For 2-bit: 5-15 points. Lookup-based methods (GGUF k-quants) mix precision across layers. Frontier serving uses FP8 for both weight and activation to leverage Tensor Core acceleration.
Expert
Uniform quantization: W_q = round(W / s + z), where s is scale and z is zero-point. Per-tensor scales are simplest; per-channel or per-group scales (group_size=128 is common) preserve outliers. GPTQ: W_q = argmin ||WX - W_q X||² with Hessian-weighted updates. AWQ: identifies 1-2% of salient channels, preserves them in higher precision, quantizes the rest aggressively. SmoothQuant: migrates activation variance into weights via a learned scaling vector, enabling uniform activation quantization. Kernel-level: GEMM on Tensor Cores accelerates FP8 × FP8; INT4 requires custom kernels (Marlin, ExLlamaV2).
Depending on why you're here
- ·Uniform quantization is the default · non-uniform schemes (GGUF) trade complexity for quality
- ·GPTQ, AWQ, SmoothQuant are the dominant PTQ recipes
- ·FP8 is replacing INT8 at the Tensor Core frontier
- ·Use FP8 or INT8 weights for production serving · 2× speedup with minimal quality loss
- ·INT4 for edge/consumer-GPU deployment · check benchmark-specific quality loss
- ·Benchmark your task before and after quantization
- ·Quantization extends the useful life of every GPU generation
- ·Commoditizes access to frontier-scale models on consumer hardware
- ·Weakens the closed-API pricing moat as quantized open models close quality gap
- ·Shrinking AI model files so they run faster and fit on smaller hardware
- ·Trades a tiny bit of accuracy for big speed gains
- ·Why you can run open-source models on a laptop
FP8 weight-and-activation serving is the 2026 default. Anyone still running FP16 in production is overpaying by 2×.
Frequently Asked Questions
Read the primary sources
- GPTQ paper (2022)arxiv.org
- AWQ paper (2023)arxiv.org
- SmoothQuant paper (2023)arxiv.org