Context Window
The max number of tokens · input + output · a model can handle in a single request. Ranges from 32K to 2M in 2026.
The max number of tokens · input + output · a model can handle in a single request. Ranges from 32K to 2M in 2026.
Basic
Context windows in 2026: Gemini 2.5 Pro offers 2M tokens, GPT-5 offers 128K-400K depending on tier, Claude Sonnet 4 offers 200K, Llama 4 offers 10M (experimental). Bigger windows let the model consider more information at once · entire codebases, long documents, multi-hour conversations. Cost scales roughly linearly with tokens used, so big contexts get expensive fast.
Deep
A context window includes the system prompt, user prompt, any retrieved documents (RAG), tool call results, prior conversation turns, and the generated output. Different models report context in different ways: effective context (where quality is preserved) is often smaller than maximum context. "Needle in a haystack" tests measure retrieval quality at depth · most models degrade past 50-70% depth. Compute cost of attention is quadratic in sequence length O(n²), but modern serving uses optimized attention (FlashAttention, ring attention) to push this lower. Pricing is typically linear per-token on input and a multiplier on output.
Expert
Quadratic attention cost is mitigated through: FlashAttention (memory-efficient implementation), ring attention (distributed), sliding-window attention (local only), and linear attention variants. Effective context length (where model actually uses information reliably) lags max context by 20-50%. Extended context often comes from techniques like YARN, LongRoPE, or position interpolation, which can degrade base capability. Frontier models trained from scratch with long context (Gemini 1.5 with ring attention) preserve quality at length better than position-extended models. KV cache size grows linearly with context · 2M tokens on GPT-4 class models needs 100GB+ of KV cache per request.
Depending on why you're here
- ·Quadratic attention cost · FlashAttention, ring attention mitigate
- ·Effective context < max context · measure with needle-in-haystack
- ·YARN, LongRoPE are position-extension methods
- ·Budget tokens carefully · 2M context at $10/M = $20 per request
- ·RAG usually beats long context for large document sets
- ·Check effective context length before relying on deep retrieval
- ·Long context is a commoditizing feature · 1M+ is table stakes in 2026
- ·Infra cost of long context (KV cache, bandwidth) constrains provider margin
- ·Use case dominance: agents + codebase understanding + legal docs
- ·How much text AI can read at once
- ·Bigger window = more document, more conversation, more code
- ·Costs more as the window grows
Often confused with
Tokens are the units. Context window is the max count of tokens per request. 1M context window holds roughly 750K English words.
Context is within-request scope. Persistent memory (across requests) requires separate infrastructure · vector DB, session store.
Context windows hit diminishing returns past 200K for most workloads. 1M+ is for agents and codebase-scale retrieval, not chat.