How long does cache persist?

Anthropic ephemeral: 5 min. Anthropic persistent beta: longer. OpenAI: 5-10 min auto-detected. Google: explicit object, configurable.

Does RAG content cache?

Yes if you structure the prompt so the retrieved docs are in a stable cached segment. Small retrieval changes invalidate the cache · design for this.

PricingReading · ~3 min · 68 words deep

Cache Pricing

A provider feature that gives you discounted input pricing for prompts the system has already processed · 50-90% savings.

TL;DR

A provider feature that gives you discounted input pricing for prompts the system has already processed · 50-90% savings.

Level 1

Basic

Anthropic Prompt Caching: 90% off repeat prefix. OpenAI Cached Input: 50% off. Google Gemini Caching: 75% off. The mechanism: if you send the same system prompt (or tool definitions, or retrieved docs) repeatedly, the provider keeps the KV cache warm and charges you massively less. Changes the economics for multi-turn agents and batch workloads.

Level 2

Deep

Cache pricing maps onto the prefill/decode split. The prefill work for a stable prefix (system prompt, tools) can be amortized across many requests. Providers expose this as a pricing tier · mark which prefix portion is stable, and subsequent requests hit cache at discounted rates. Caches expire after minutes of idle (varies by provider). Multi-turn chat workloads with stable system prompts see 60-80% total cost reduction when cache-aware.

Level 3

Expert

Anthropic Prompt Caching: ephemeral and persistent cache tiers. Ephemeral = 5-min TTL, 90% discount on hit, 25% surcharge to write. Persistent (beta) = longer TTL. OpenAI cached_input: automatic detection of identical prefixes in a 5-10 min window, 50% discount. Google context caching: explicit cache object, 75% discount. Cache hit rate depends on prompt stability · production systems should design for cache-hit with prefix stability guarantees.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·KV cache persistence → prefill amortization
·Anthropic 90%, OpenAI 50%, Google 75% · different mechanisms
·TTL varies 5min to hours by provider and tier

If you are a

Builder

·Design your system prompt for stability · cache it
·Separate stable prefix from volatile user input
·60-80% workload cost reduction is typical with cache-aware design

If you are a

Investor

·Cache tiers are a margin lever · Claude 90% aggressive = strategic
·Multi-turn workloads favor cache-capable providers
·Cache hit rates are hidden unit economics most pricing comparisons ignore

If you are a

Curious · Normie

·AI companies charge less when they can reuse work
·Why chatbots with stable personas are cheaper to run
·Not the same as RAG · this is about prompt reuse, not knowledge

Gecko's take

Cache pricing is the closest thing to a free lunch in LLM economics. Ignore it and overpay by 50-90%.

The price of knowing this term

On a 10M-token/month chatbot with a 5,000-token system prompt, Claude prompt caching saves ~$40/month (90% off the repeated prefix). At scale it's material.

Frequently Asked Questions

Anthropic (90% discount, aggressive), OpenAI (50% auto), Google Gemini (75%). Most other providers don't offer explicit cache tiers yet.

Cache Pricing

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed