Cache Hit Rate Is the Cost Lever Your Team Is Probably Ignoring

#ai #machinelearning #rag #llm

Cache Hit Rate Is the Cost Lever Your Team Is Probably Ignoring

I have watched teams spend a month on model selection benchmarks — GPT-4o versus Claude Sonnet 4.5 versus Gemini 2.5 Pro — then deploy with a prompt structure that breaks cache hits on every single request, paying three to five times more than they should for work the provider has already done. The model selection decision is worth something. The prompt structure decision is worth more. For any workload with repeated or agentic patterns, it is not close.

The mechanism is the KV cache. Every major LLM API — Anthropic, OpenAI, Google — reuses computation when a new request begins with tokens it has already processed. Anthropic charges cache reads at 10 percent of the standard input price. A clean cache hit is a 90 percent discount on those tokens. The catch is that cache hits only fire when the prefix — the exact sequence of tokens at the start of your request — matches what was previously cached. Move one token, change one value, and the cache restarts. This is why the position of dynamic content inside your prompt is the variable that determines whether any of this pays off.

What happens when you get it wrong

ProjectDiscovery ships Neo, an agentic task runner built on Claude Opus 4.5. In early February 2025, their cache hit rate sat at 4.2 percent. Their system prompts ran to 2,500 lines of YAML — roughly 20,000 tokens per agent — but working memory, skills context, and runtime session identifiers were embedded inside that prefix. Every request began with a slightly different token sequence. The cache never fired.

The cost difference between a 3 percent cache rate and a 91 percent cache rate on identical workloads is not linear. ProjectDiscovery found a comparison task that ran 66.8 million input tokens at 3.2 percent cache utilization pre-optimization. Equivalent tasks post-optimization ran at 91.8 percent cache rates. The cost differential was roughly 60x for the same computation. Not 60 percent. Sixty times.

What they changed was the position of dynamic content. Working memory, runtime context, and session-specific data moved out of the cacheable prefix into the user message tail. The static YAML system prompt stayed put, with an explicit cache breakpoint marking it cacheable. One structural change shifted cache rates from 4.2 percent to 73.7 percent in two weeks. By March 2025, the rate was 84.3 percent, 9.8 billion tokens had been served from cache, and their overall token bill had dropped 59 percent. The implementation is not a new service or a new model — it is a reordering of content within requests you are already making.

The research backs this up across providers

A January 2026 paper on prompt caching for agentic tasks tested caching strategies across 500-plus agent sessions on GPT-5.2, Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-4o, all with 10,000-token system prompts on PhD-level research tasks. Cost savings ranged from 41 to 80 percent. But the distribution matters: the best-performing strategy for every model was caching the system prompt only — not full context.

The counterintuitive finding was that caching too aggressively makes things worse. GPT-4o showed an 8.8 percent latency regression when full-context caching was on, because volatile tool results in the middle of the context created cache mismatches that added overhead instead of reducing it. The same model with system-prompt-only caching showed a 30.9 percent latency improvement. Claude Sonnet 4.5 with system-prompt-only caching achieved 78.5 percent cost savings and a 22.9 percent reduction in time-to-first-token. The lesson generalizes: a misplaced cache boundary costs you overhead without delivering savings, while a well-placed one compounds.

Three questions worth asking about your prompts today

First: where does your dynamic content live? If timestamps, user identifiers, session context, or per-request state are inside your system prompt — especially near the start — they are breaking cache hits on every request. Move them to the user message.

Second: are your tool definitions ordered consistently? The arXiv paper found that dynamic tool sets — where tool lists change between requests — break cache hits because the prefix diverges. If you are adding or removing tools based on user context, that variation belongs at the tail, not the front.

Third: what is your actual cache hit rate right now? Anthropic's API returns cache read token counts in the usage field of every response. If you are not logging cache_read_input_tokens alongside your standard token counts, you have no visibility into whether caching is doing anything at all. In my experience, teams that start logging this number for the first time are surprised by how low it is — and usually find a misplaced dynamic value in the first fifteen minutes of investigation.

The pricing math closes quickly. Anthropic charges 1.25x base input price for the cache write at the 5-minute TTL, and 0.10x for reads. A prompt that is hit twice breaks even on the write cost. A prompt hit ten times runs at roughly 18 percent of the cost of uncached requests. At any production volume above a few hundred requests per day, that breakeven happens on day one.

The boring optimization wins here, as it usually does: audit where your static content ends and your dynamic content begins, draw that boundary explicitly with a cache breakpoint, and let the provider do the work it is already equipped to do.