Beta
PricingReading · ~3 min · 68 words deep

Cache Pricing

A provider feature that gives you discounted input pricing for prompts the system has already processed · 50-90% savings.

TL;DR

A provider feature that gives you discounted input pricing for prompts the system has already processed · 50-90% savings.

Level 1

Anthropic Prompt Caching: 90% off repeat prefix. OpenAI Cached Input: 50% off. Google Gemini Caching: 75% off. The mechanism: if you send the same system prompt (or tool definitions, or retrieved docs) repeatedly, the provider keeps the KV cache warm and charges you massively less. Changes the economics for multi-turn agents and batch workloads.

Level 2

Cache pricing maps onto the prefill/decode split. The prefill work for a stable prefix (system prompt, tools) can be amortized across many requests. Providers expose this as a pricing tier · mark which prefix portion is stable, and subsequent requests hit cache at discounted rates. Caches expire after minutes of idle (varies by provider). Multi-turn chat workloads with stable system prompts see 60-80% total cost reduction when cache-aware.

Level 3

Anthropic Prompt Caching: ephemeral and persistent cache tiers. Ephemeral = 5-min TTL, 90% discount on hit, 25% surcharge to write. Persistent (beta) = longer TTL. OpenAI cached_input: automatic detection of identical prefixes in a 5-10 min window, 50% discount. Google context caching: explicit cache object, 75% discount. Cache hit rate depends on prompt stability · production systems should design for cache-hit with prefix stability guarantees.

The takeaway for you
If you are a
Researcher
  • ·KV cache persistence → prefill amortization
  • ·Anthropic 90%, OpenAI 50%, Google 75% · different mechanisms
  • ·TTL varies 5min to hours by provider and tier
If you are a
Builder
  • ·Design your system prompt for stability · cache it
  • ·Separate stable prefix from volatile user input
  • ·60-80% workload cost reduction is typical with cache-aware design
If you are a
Investor
  • ·Cache tiers are a margin lever · Claude 90% aggressive = strategic
  • ·Multi-turn workloads favor cache-capable providers
  • ·Cache hit rates are hidden unit economics most pricing comparisons ignore
If you are a
Curious · Normie
  • ·AI companies charge less when they can reuse work
  • ·Why chatbots with stable personas are cheaper to run
  • ·Not the same as RAG · this is about prompt reuse, not knowledge
Gecko's take

Cache pricing is the closest thing to a free lunch in LLM economics. Ignore it and overpay by 50-90%.

The price of knowing this term

On a 10M-token/month chatbot with a 5,000-token system prompt, Claude prompt caching saves ~$40/month (90% off the repeated prefix). At scale it's material.

Anthropic (90% discount, aggressive), OpenAI (50% auto), Google Gemini (75%). Most other providers don't offer explicit cache tiers yet.