Beta
ConceptsReading · ~3 min · 77 words deep

Quantization

Compressing model weights to lower-precision numbers (INT8, INT4, FP8) to cut memory use and speed up inference.

TL;DR

Compressing model weights to lower-precision numbers (INT8, INT4, FP8) to cut memory use and speed up inference.

Level 1

Modern models are trained in FP16 or BF16 (16-bit floating point). Quantization converts weights to INT8, INT4, or FP8 after training. You trade a small quality hit for 2-4× less memory and 2-4× faster inference. Most open-weight models ship quantized variants (GGUF, AWQ, GPTQ). On closed APIs, providers quantize internally but don't expose the precision.

Level 2

Post-Training Quantization (PTQ): apply after training, fast. Methods: GPTQ (second-order error minimization), AWQ (activation-aware), SmoothQuant (migrates outliers). Quantization-Aware Training (QAT): train with quantized forward, backward in float. Slower but better quality. Weight-only vs weight+activation: weight-only is more forgiving; activation quantization requires calibration. Typical quality loss for 4-bit: 1-3 points on benchmarks. For 2-bit: 5-15 points. Lookup-based methods (GGUF k-quants) mix precision across layers. Frontier serving uses FP8 for both weight and activation to leverage Tensor Core acceleration.

Level 3

Uniform quantization: W_q = round(W / s + z), where s is scale and z is zero-point. Per-tensor scales are simplest; per-channel or per-group scales (group_size=128 is common) preserve outliers. GPTQ: W_q = argmin ||WX - W_q X||² with Hessian-weighted updates. AWQ: identifies 1-2% of salient channels, preserves them in higher precision, quantizes the rest aggressively. SmoothQuant: migrates activation variance into weights via a learned scaling vector, enabling uniform activation quantization. Kernel-level: GEMM on Tensor Cores accelerates FP8 × FP8; INT4 requires custom kernels (Marlin, ExLlamaV2).

The takeaway for you
If you are a
Researcher
  • ·Uniform quantization is the default · non-uniform schemes (GGUF) trade complexity for quality
  • ·GPTQ, AWQ, SmoothQuant are the dominant PTQ recipes
  • ·FP8 is replacing INT8 at the Tensor Core frontier
If you are a
Builder
  • ·Use FP8 or INT8 weights for production serving · 2× speedup with minimal quality loss
  • ·INT4 for edge/consumer-GPU deployment · check benchmark-specific quality loss
  • ·Benchmark your task before and after quantization
If you are a
Investor
  • ·Quantization extends the useful life of every GPU generation
  • ·Commoditizes access to frontier-scale models on consumer hardware
  • ·Weakens the closed-API pricing moat as quantized open models close quality gap
If you are a
Curious · Normie
  • ·Shrinking AI model files so they run faster and fit on smaller hardware
  • ·Trades a tiny bit of accuracy for big speed gains
  • ·Why you can run open-source models on a laptop
Gecko's take

FP8 weight-and-activation serving is the 2026 default. Anyone still running FP16 in production is overpaying by 2×.

For weights-only: GPTQ or AWQ for INT4, GGUF k-quants for flexible bit-widths. For production inference: FP8 with SmoothQuant or equivalent activation scaling. Benchmark on your task.
Canonical sources