Cache Pricing
A provider feature that gives you discounted input pricing for prompts the system has already processed · 50-90% savings.
A provider feature that gives you discounted input pricing for prompts the system has already processed · 50-90% savings.
Basic
Anthropic Prompt Caching: 90% off repeat prefix. OpenAI Cached Input: 50% off. Google Gemini Caching: 75% off. The mechanism: if you send the same system prompt (or tool definitions, or retrieved docs) repeatedly, the provider keeps the KV cache warm and charges you massively less. Changes the economics for multi-turn agents and batch workloads.
Deep
Cache pricing maps onto the prefill/decode split. The prefill work for a stable prefix (system prompt, tools) can be amortized across many requests. Providers expose this as a pricing tier · mark which prefix portion is stable, and subsequent requests hit cache at discounted rates. Caches expire after minutes of idle (varies by provider). Multi-turn chat workloads with stable system prompts see 60-80% total cost reduction when cache-aware.
Expert
Anthropic Prompt Caching: ephemeral and persistent cache tiers. Ephemeral = 5-min TTL, 90% discount on hit, 25% surcharge to write. Persistent (beta) = longer TTL. OpenAI cached_input: automatic detection of identical prefixes in a 5-10 min window, 50% discount. Google context caching: explicit cache object, 75% discount. Cache hit rate depends on prompt stability · production systems should design for cache-hit with prefix stability guarantees.
Depending on why you're here
- ·KV cache persistence → prefill amortization
- ·Anthropic 90%, OpenAI 50%, Google 75% · different mechanisms
- ·TTL varies 5min to hours by provider and tier
- ·Design your system prompt for stability · cache it
- ·Separate stable prefix from volatile user input
- ·60-80% workload cost reduction is typical with cache-aware design
- ·Cache tiers are a margin lever · Claude 90% aggressive = strategic
- ·Multi-turn workloads favor cache-capable providers
- ·Cache hit rates are hidden unit economics most pricing comparisons ignore
- ·AI companies charge less when they can reuse work
- ·Why chatbots with stable personas are cheaper to run
- ·Not the same as RAG · this is about prompt reuse, not knowledge
Cache pricing is the closest thing to a free lunch in LLM economics. Ignore it and overpay by 50-90%.
On a 10M-token/month chatbot with a 5,000-token system prompt, Claude prompt caching saves ~$40/month (90% off the repeated prefix). At scale it's material.