Does caching work across different users?

Yes · if they share the same prompt prefix. Common pattern: shared system prompt across all users benefits every subsequent user of that prefix.

What breaks the cache?

Any byte-level change in cached content. Even whitespace changes or timestamp edits invalidate the cache from that point forward.

PricingReading · ~3 min · 81 words deep

Cache Hit Rate

The percentage of input tokens served from a provider's prompt cache · modern providers charge 10% of list price on cache hits, so hit rate directly sets effective input cost.

TL;DR

The percentage of input tokens served from a provider's prompt cache · modern providers charge 10% of list price on cache hits, so hit rate directly sets effective input cost.

Level 1

Basic

When a prompt has a long system message or document that repeats across calls, providers cache it. The cached portion is cheaper on subsequent calls · Anthropic charges 10% of normal input price on cache hits; OpenAI caches at 50% discount. Cache hit rate is the % of your input tokens hitting the cache. A 90% cache hit rate turns a $3/M model into an effective ~$0.50/M model.

Level 2

Deep

Anthropic's caching has a 5-minute or 1-hour TTL · you pay a premium to write to cache (1.25× list for 5-min, 2× for 1-hour), then 0.1× for reads. OpenAI caches automatically with shorter TTL and no write premium. Google Gemini cache is context-size dependent. Hit rate depends on prompt design: keep static content at the front, split cacheable blocks from dynamic ones, and reuse the same prefix across sessions. Most production AI apps see 60-95% cache hit rates with proper design.

Level 3

Expert

Cache boundaries are byte-exact in most providers · any change invalidates downstream cache. Anthropic uses ephemeral cache tags (cache_control blocks); OpenAI is fully automatic. Cost optimization: chunking prompts so user-specific content comes last, using prefix caching with shared system prompts across users, and amortizing long document reads over many queries. Cache hit rate shows up on provider dashboards (OpenAI platform usage, Anthropic Workbench). Monitoring it is now table stakes for serious AI apps.

Why this matters now

April 2026 · every major provider now ships aggressive prompt caching. Cache hit rate has become the primary pricing lever.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Cache hit rate = cached tokens / total input tokens
·Anthropic: 0.1× read price · 1.25× write premium
·OpenAI: automatic · 50% discount · no write premium

If you are a

Builder

·Keep static content at prompt start · cache-friendly prefixes
·Measure via provider dashboard
·Aim for 60-95% hit rate in production apps

If you are a

Investor

·Cache is a pricing lever providers use to reward frequent users
·High cache-hit workloads (chatbots, agents) see 80% effective cost reduction
·Favors providers with generous TTLs and auto-caching

If you are a

Curious · Normie

·A way to make AI cheaper when you reuse the same background info
·Providers remember repeated parts of your prompts at low cost
·Can cut AI bills by 10× for heavy users

Gecko's take

Cache hit rate is the single biggest pricing lever for LLM apps in 2026. Optimize for it before you optimize model choice.

Frequently Asked Questions

All major providers expose cache tokens in API responses (cache_read_input_tokens). Divide by total input tokens per call, then average over time.

Cache Hit Rate

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed