How much quality do I lose?

INT8 or FP8: typically 0-1 point drop. INT4: 1-3 points. INT2: 5-15 points. Task-dependent · code and math are more sensitive than chat.

Does quantization work on all models?

Small models (< 1B) suffer more. Frontier MoE models quantize well because of redundancy across experts. Reasoning models degrade faster at aggressive quantization due to compounding errors across reasoning tokens.

ConceptsReading · ~3 min · 77 words deep

Quantization

Q: What quantization method should I use?

For weights-only: GPTQ or AWQ for INT4, GGUF k-quants for flexible bit-widths. For production inference: FP8 with SmoothQuant or equivalent activation scaling. Benchmark on your task.

Q: Does quantization work on all models?

Small models (< 1B) suffer more. Frontier MoE models quantize well because of redundancy across experts. Reasoning models degrade faster at aggressive quantization due to compounding errors across reasoning tokens.

Compressing model weights to lower-precision numbers (INT8, INT4, FP8) to cut memory use and speed up inference.

TL;DR

Compressing model weights to lower-precision numbers (INT8, INT4, FP8) to cut memory use and speed up inference.

Level 1

Basic

Modern models are trained in FP16 or BF16 (16-bit floating point). Quantization converts weights to INT8, INT4, or FP8 after training. You trade a small quality hit for 2-4× less memory and 2-4× faster inference. Most open-weight models ship quantized variants (GGUF, AWQ, GPTQ). On closed APIs, providers quantize internally but don't expose the precision.

Level 2

Deep

Post-Training Quantization (PTQ): apply after training, fast. Methods: GPTQ (second-order error minimization), AWQ (activation-aware), SmoothQuant (migrates outliers). Quantization-Aware Training (QAT): train with quantized forward, backward in float. Slower but better quality. Weight-only vs weight+activation: weight-only is more forgiving; activation quantization requires calibration. Typical quality loss for 4-bit: 1-3 points on benchmarks. For 2-bit: 5-15 points. Lookup-based methods (GGUF k-quants) mix precision across layers. Frontier serving uses FP8 for both weight and activation to leverage Tensor Core acceleration.

Level 3

Expert

Uniform quantization: W_q = round(W / s + z), where s is scale and z is zero-point. Per-tensor scales are simplest; per-channel or per-group scales (group_size=128 is common) preserve outliers. GPTQ: W_q = argmin ||WX - W_q X||² with Hessian-weighted updates. AWQ: identifies 1-2% of salient channels, preserves them in higher precision, quantizes the rest aggressively. SmoothQuant: migrates activation variance into weights via a learned scaling vector, enabling uniform activation quantization. Kernel-level: GEMM on Tensor Cores accelerates FP8 × FP8; INT4 requires custom kernels (Marlin, ExLlamaV2).

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Uniform quantization is the default · non-uniform schemes (GGUF) trade complexity for quality
·GPTQ, AWQ, SmoothQuant are the dominant PTQ recipes
·FP8 is replacing INT8 at the Tensor Core frontier

If you are a

Builder

·Use FP8 or INT8 weights for production serving · 2× speedup with minimal quality loss
·INT4 for edge/consumer-GPU deployment · check benchmark-specific quality loss
·Benchmark your task before and after quantization

If you are a

Investor

·Quantization extends the useful life of every GPU generation
·Commoditizes access to frontier-scale models on consumer hardware
·Weakens the closed-API pricing moat as quantized open models close quality gap

If you are a

Curious · Normie

·Shrinking AI model files so they run faster and fit on smaller hardware
·Trades a tiny bit of accuracy for big speed gains
·Why you can run open-source models on a laptop

Gecko's take

FP8 weight-and-activation serving is the 2026 default. Anyone still running FP16 in production is overpaying by 2×.

Frequently Asked Questions

For weights-only: GPTQ or AWQ for INT4, GGUF k-quants for flexible bit-widths. For production inference: FP8 with SmoothQuant or equivalent activation scaling. Benchmark on your task.

Canonical sources

Read the primary sources

Quantization

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Read the primary sources

Related terms

Glossary

Explore live data

Cite or embed