Does sliding window hurt quality?

On pure sliding-window, yes at far positions. Modern hybrid (mix with global-attention layers) largely resolves this.

Which models use sliding window?

Mistral 7B (original), Gemma 2, Llama 4 Scout, some Qwen variants. Check model cards.

ConceptsReading · ~3 min · 77 words deep

Sliding Window Attention

Sliding window attention restricts each token to a local window of K neighbors · cuts attention compute from O(N²) to O(N·K) · used in Mistral, Gemma, and Llama 4 variants.

TL;DR

Sliding window attention restricts each token to a local window of K neighbors · cuts attention compute from O(N²) to O(N·K) · used in Mistral, Gemma, and Llama 4 variants.

Level 1

Basic

In standard transformer attention, every token attends to every previous token · O(N²) memory and compute. Sliding window attention limits each token to the previous K tokens (window size). Mistral used 4,096-token windows in early models; Gemma 2 uses 4K windows; Llama 4 uses local + global hybrid. Enables long-context handling without quadratic blowup.

Level 2

Deep

Sliding window works because most useful context is local. Information that needs to travel far uses the "receptive field" expansion through stacked layers · a 32-layer model with 4K window has effective context of 128K through depth. Some architectures (Mistral 7B original) used pure sliding window; newer (Gemma 2, Llama 4) mix local sliding-window layers with global-attention layers for best of both. Compute cost: sliding window attention is ~K/N cheaper than full attention when N >> K.

Level 3

Expert

Sliding window attention integrates with FlashAttention for memory-efficient implementation. The window size K is a hyperparameter: too small loses context, too large loses efficiency. Hybrid architectures (e.g., Gemma 2: sliding + global layers interleaved) balance quality vs speed. Mistral's effective context via sliding was questionable in practice · retrieval from far positions degraded. Modern approaches pair sliding with explicit long-context mechanisms (KV-cache compression, retrieval) to preserve quality.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Limits attention to K nearest tokens · O(N·K) vs O(N²)
·Used in Mistral, Gemma 2, Llama 4 variants
·Hybrid local+global architectures restore quality

If you are a

Builder

·Efficient long-context models often use sliding window
·Quality degrades at far positions · pair with retrieval
·Check model card for window size + hybrid layer config

If you are a

Investor

·Sliding window enables cheaper long-context serving
·Compute savings of 3-10× on long contexts
·Quality trade-off mostly mitigated in hybrid designs

If you are a

Curious · Normie

·A way to make AI only look at recent words for speed
·Lets AI handle long documents faster
·Used in some of Google's and Mistral's models

Gecko's take

Sliding window attention is the quiet architecture that makes long-context serving economically possible.

Frequently Asked Questions

2K-8K tokens is common. Mistral used 4K, Gemma 2 uses 4K, Llama 4 Scout uses 8K with global-attention layers interleaved.

Sliding Window Attention

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed