Sliding Window Attention
Sliding window attention restricts each token to a local window of K neighbors · cuts attention compute from O(N²) to O(N·K) · used in Mistral, Gemma, and Llama 4 variants.
Sliding window attention restricts each token to a local window of K neighbors · cuts attention compute from O(N²) to O(N·K) · used in Mistral, Gemma, and Llama 4 variants.
Basic
In standard transformer attention, every token attends to every previous token · O(N²) memory and compute. Sliding window attention limits each token to the previous K tokens (window size). Mistral used 4,096-token windows in early models; Gemma 2 uses 4K windows; Llama 4 uses local + global hybrid. Enables long-context handling without quadratic blowup.
Deep
Sliding window works because most useful context is local. Information that needs to travel far uses the "receptive field" expansion through stacked layers · a 32-layer model with 4K window has effective context of 128K through depth. Some architectures (Mistral 7B original) used pure sliding window; newer (Gemma 2, Llama 4) mix local sliding-window layers with global-attention layers for best of both. Compute cost: sliding window attention is ~K/N cheaper than full attention when N >> K.
Expert
Sliding window attention integrates with FlashAttention for memory-efficient implementation. The window size K is a hyperparameter: too small loses context, too large loses efficiency. Hybrid architectures (e.g., Gemma 2: sliding + global layers interleaved) balance quality vs speed. Mistral's effective context via sliding was questionable in practice · retrieval from far positions degraded. Modern approaches pair sliding with explicit long-context mechanisms (KV-cache compression, retrieval) to preserve quality.
Depending on why you're here
- ·Limits attention to K nearest tokens · O(N·K) vs O(N²)
- ·Used in Mistral, Gemma 2, Llama 4 variants
- ·Hybrid local+global architectures restore quality
- ·Efficient long-context models often use sliding window
- ·Quality degrades at far positions · pair with retrieval
- ·Check model card for window size + hybrid layer config
- ·Sliding window enables cheaper long-context serving
- ·Compute savings of 3-10× on long contexts
- ·Quality trade-off mostly mitigated in hybrid designs
- ·A way to make AI only look at recent words for speed
- ·Lets AI handle long documents faster
- ·Used in some of Google's and Mistral's models
Sliding window attention is the quiet architecture that makes long-context serving economically possible.