Beta
PricingReading · ~3 min · 81 words deep

Cache Hit Rate

The percentage of input tokens served from a provider's prompt cache · modern providers charge 10% of list price on cache hits, so hit rate directly sets effective input cost.

TL;DR

The percentage of input tokens served from a provider's prompt cache · modern providers charge 10% of list price on cache hits, so hit rate directly sets effective input cost.

Level 1

When a prompt has a long system message or document that repeats across calls, providers cache it. The cached portion is cheaper on subsequent calls · Anthropic charges 10% of normal input price on cache hits; OpenAI caches at 50% discount. Cache hit rate is the % of your input tokens hitting the cache. A 90% cache hit rate turns a $3/M model into an effective ~$0.50/M model.

Level 2

Anthropic's caching has a 5-minute or 1-hour TTL · you pay a premium to write to cache (1.25× list for 5-min, 2× for 1-hour), then 0.1× for reads. OpenAI caches automatically with shorter TTL and no write premium. Google Gemini cache is context-size dependent. Hit rate depends on prompt design: keep static content at the front, split cacheable blocks from dynamic ones, and reuse the same prefix across sessions. Most production AI apps see 60-95% cache hit rates with proper design.

Level 3

Cache boundaries are byte-exact in most providers · any change invalidates downstream cache. Anthropic uses ephemeral cache tags (cache_control blocks); OpenAI is fully automatic. Cost optimization: chunking prompts so user-specific content comes last, using prefix caching with shared system prompts across users, and amortizing long document reads over many queries. Cache hit rate shows up on provider dashboards (OpenAI platform usage, Anthropic Workbench). Monitoring it is now table stakes for serious AI apps.

Why this matters now

April 2026 · every major provider now ships aggressive prompt caching. Cache hit rate has become the primary pricing lever.

The takeaway for you
If you are a
Researcher
  • ·Cache hit rate = cached tokens / total input tokens
  • ·Anthropic: 0.1× read price · 1.25× write premium
  • ·OpenAI: automatic · 50% discount · no write premium
If you are a
Builder
  • ·Keep static content at prompt start · cache-friendly prefixes
  • ·Measure via provider dashboard
  • ·Aim for 60-95% hit rate in production apps
If you are a
Investor
  • ·Cache is a pricing lever providers use to reward frequent users
  • ·High cache-hit workloads (chatbots, agents) see 80% effective cost reduction
  • ·Favors providers with generous TTLs and auto-caching
If you are a
Curious · Normie
  • ·A way to make AI cheaper when you reuse the same background info
  • ·Providers remember repeated parts of your prompts at low cost
  • ·Can cut AI bills by 10× for heavy users
Gecko's take

Cache hit rate is the single biggest pricing lever for LLM apps in 2026. Optimize for it before you optimize model choice.

All major providers expose cache tokens in API responses (cache_read_input_tokens). Divide by total input tokens per call, then average over time.