Does longer context always mean better answers?

No. Effective context (where retrieval quality holds) is often 50-70% of max. Beyond that, models lose important information in the middle.

How much does a 1M-token request cost?

At $3/M input: $3 per million tokens sent. Output is usually 3-5× more. A single 1M context request can cost $5-15 depending on output length.

ConceptsReading · ~3 min · 94 words deep

Context Window

Q: Which model has the longest context window?

As of 2026: Llama 4 (10M experimental), Gemini 2.5 Pro (2M production). Claude and GPT-5 are 200K-400K.

The max number of tokens · input + output · a model can handle in a single request. Ranges from 32K to 2M in 2026.

TL;DR

The max number of tokens · input + output · a model can handle in a single request. Ranges from 32K to 2M in 2026.

Level 1

Basic

Context windows in 2026: Gemini 2.5 Pro offers 2M tokens, GPT-5 offers 128K-400K depending on tier, Claude Sonnet 4 offers 200K, Llama 4 offers 10M (experimental). Bigger windows let the model consider more information at once · entire codebases, long documents, multi-hour conversations. Cost scales roughly linearly with tokens used, so big contexts get expensive fast.

Level 2

Deep

A context window includes the system prompt, user prompt, any retrieved documents (RAG), tool call results, prior conversation turns, and the generated output. Different models report context in different ways: effective context (where quality is preserved) is often smaller than maximum context. "Needle in a haystack" tests measure retrieval quality at depth · most models degrade past 50-70% depth. Compute cost of attention is quadratic in sequence length O(n²), but modern serving uses optimized attention (FlashAttention, ring attention) to push this lower. Pricing is typically linear per-token on input and a multiplier on output.

Level 3

Expert

Quadratic attention cost is mitigated through: FlashAttention (memory-efficient implementation), ring attention (distributed), sliding-window attention (local only), and linear attention variants. Effective context length (where model actually uses information reliably) lags max context by 20-50%. Extended context often comes from techniques like YARN, LongRoPE, or position interpolation, which can degrade base capability. Frontier models trained from scratch with long context (Gemini 1.5 with ring attention) preserve quality at length better than position-extended models. KV cache size grows linearly with context · 2M tokens on GPT-4 class models needs 100GB+ of KV cache per request.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Quadratic attention cost · FlashAttention, ring attention mitigate
·Effective context < max context · measure with needle-in-haystack
·YARN, LongRoPE are position-extension methods

If you are a

Builder

·Budget tokens carefully · 2M context at $10/M = $20 per request
·RAG usually beats long context for large document sets
·Check effective context length before relying on deep retrieval

If you are a

Investor

·Long context is a commoditizing feature · 1M+ is table stakes in 2026
·Infra cost of long context (KV cache, bandwidth) constrains provider margin
·Use case dominance: agents + codebase understanding + legal docs

If you are a

Curious · Normie

·How much text AI can read at once
·Bigger window = more document, more conversation, more code
·Costs more as the window grows

Don't mix them up

Often confused with

Context WindowvsTokens

Tokens are the units. Context window is the max count of tokens per request. 1M context window holds roughly 750K English words.

Context WindowvsMemory

Context is within-request scope. Persistent memory (across requests) requires separate infrastructure · vector DB, session store.

Gecko's take

Context windows hit diminishing returns past 200K for most workloads. 1M+ is for agents and codebase-scale retrieval, not chat.

Frequently Asked Questions

As of 2026: Llama 4 (10M experimental), Gemini 2.5 Pro (2M production). Claude and GPT-5 are 200K-400K.

Context Window

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed