Beta
ConceptsReading · ~3 min · 77 words deep

Sliding Window Attention

Sliding window attention restricts each token to a local window of K neighbors · cuts attention compute from O(N²) to O(N·K) · used in Mistral, Gemma, and Llama 4 variants.

TL;DR

Sliding window attention restricts each token to a local window of K neighbors · cuts attention compute from O(N²) to O(N·K) · used in Mistral, Gemma, and Llama 4 variants.

Level 1

In standard transformer attention, every token attends to every previous token · O(N²) memory and compute. Sliding window attention limits each token to the previous K tokens (window size). Mistral used 4,096-token windows in early models; Gemma 2 uses 4K windows; Llama 4 uses local + global hybrid. Enables long-context handling without quadratic blowup.

Level 2

Sliding window works because most useful context is local. Information that needs to travel far uses the "receptive field" expansion through stacked layers · a 32-layer model with 4K window has effective context of 128K through depth. Some architectures (Mistral 7B original) used pure sliding window; newer (Gemma 2, Llama 4) mix local sliding-window layers with global-attention layers for best of both. Compute cost: sliding window attention is ~K/N cheaper than full attention when N >> K.

Level 3

Sliding window attention integrates with FlashAttention for memory-efficient implementation. The window size K is a hyperparameter: too small loses context, too large loses efficiency. Hybrid architectures (e.g., Gemma 2: sliding + global layers interleaved) balance quality vs speed. Mistral's effective context via sliding was questionable in practice · retrieval from far positions degraded. Modern approaches pair sliding with explicit long-context mechanisms (KV-cache compression, retrieval) to preserve quality.

The takeaway for you
If you are a
Researcher
  • ·Limits attention to K nearest tokens · O(N·K) vs O(N²)
  • ·Used in Mistral, Gemma 2, Llama 4 variants
  • ·Hybrid local+global architectures restore quality
If you are a
Builder
  • ·Efficient long-context models often use sliding window
  • ·Quality degrades at far positions · pair with retrieval
  • ·Check model card for window size + hybrid layer config
If you are a
Investor
  • ·Sliding window enables cheaper long-context serving
  • ·Compute savings of 3-10× on long contexts
  • ·Quality trade-off mostly mitigated in hybrid designs
If you are a
Curious · Normie
  • ·A way to make AI only look at recent words for speed
  • ·Lets AI handle long documents faster
  • ·Used in some of Google's and Mistral's models
Gecko's take

Sliding window attention is the quiet architecture that makes long-context serving economically possible.

2K-8K tokens is common. Mistral used 4K, Gemma 2 uses 4K, Llama 4 Scout uses 8K with global-attention layers interleaved.