The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

#ai #machinelearning #llm #rag

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

Here is a specific claim: most teams building on Claude or GPT-4o are paying three to ten times more per token than they need to, and the reason is not model selection or request volume — it is prompt construction.

Prompt caching has been available on Anthropic's API since August 2024 and on OpenAI's since October 2024. The economics are not subtle: Anthropic cache reads cost $0.30 per million tokens versus $3.00 per million for fresh tokens, a 90 percent reduction. OpenAI's automatic caching runs at roughly 50 percent off. If your application sends any system prompt longer than a few hundred tokens — and most do — you are either capturing that discount or you are not, and the line between those two outcomes is exactly one architectural decision most teams have never made deliberately.

Why most applications do not actually capture the discount

Prefix caching works by hashing the leading bytes of your prompt against a stored version. If the leading bytes match, you pay the cache-read price. If they do not, you pay full price and the provider stores the new prefix for the next call.

The mistake I have seen repeatedly is teams interleaving dynamic content — the user's name, a session ID, a timestamp, the current date — into the beginning or middle of their system prompt. It looks like a small thing. A system prompt that opens with "Today is April 30, 2026. You are assisting {user_name}..." is entirely uncacheable, because the prefix changes with every request. You have spent fifteen thousand tokens of careful prompt engineering writing tool definitions, personas, and step-by-step instructions, and none of it is ever reused.

The fix is architectural, not subtle: static content at the front, dynamic content at the end. Everything that varies per-request belongs in the user turn or at the tail of the message array, never in the prefix.

What that looks like in production

ProjectDiscovery, the security tooling company behind Nuclei, published a detailed post-mortem in early 2025 on how they cut LLM costs by 59 percent overall — reaching 70 percent in the most recent ten-day window of their measurement period. Their agentic system runs on Claude with system prompts exceeding 20,000 tokens per agent, and average tasks run 26 steps with 40 tool calls. At that scale, every prompt construction mistake compounds badly.

Their starting cache hit rate was under 8 percent. After deploying what they call the relocation trick — moving working memory and runtime context from the cacheable prefix to the message tail — their cache hit rate jumped to 74 percent overnight. One case in their logs shows a single task processing 67.5 million input tokens at a 91.8 percent cache hit rate. That is a different cost structure entirely, not a marginal improvement.

The implementation involved three explicit cache breakpoints: one at the end of the static system prompt with a one-hour TTL shared across users, one at the last static tool definition, and one at a sliding window of recent conversation history with a five-minute TTL. Stable template variables replaced actual values in the system prompt to maintain byte-identical prompts across different users. The architecture is not complex, but it requires thinking about your prompt as having structure — a static spine with dynamic arms — rather than as a single string you construct fresh at request time.

What the research confirms

A January 2026 paper from researchers at Tsinghua and Microsoft, "Don't Break the Cache," evaluated caching strategies across OpenAI, Anthropic, and Google over 500 agentic sessions using a multi-turn benchmark with 10,000-token system prompts and up to 50 tool calls per session. The measured cost reduction range was 41 to 80 percent depending on provider and strategy, with time-to-first-token improving by 13 to 31 percent as a secondary effect.

The paper's central finding matches what ProjectDiscovery discovered operationally: full-context caching can paradoxically increase latency because dynamic tool results appearing mid-context invalidate the cache entry. The winning strategy across every provider was placing dynamic content at the end of the prompt, never in the middle. Strategic cache block placement — not just enabling caching — is what separates teams at 10 percent hit rates from teams at 75 percent.

What to do if you are not measuring this yet

Add cache hit rate to your request logging today. Both Anthropic and OpenAI return cache usage metadata in their API responses. Anthropic provides cache_creation_input_tokens and cache_read_input_tokens directly in the usage object. If you are not recording those fields, you cannot see whether you are paying full price on every call, which means you almost certainly are.

Once you are measuring, audit your system prompt construction. Look for any field that changes between requests — user ID, current date, session state — and move it to the tail. If you are using function calling, make sure your tool definitions are static and appear before any dynamic content. If you are building agentic systems with multi-turn history, mark your cache breakpoints explicitly on Anthropic, or structure your message array so that the first several turns are stable across sessions.

The numbers are real. At production scale — tens of millions of tokens per day — the difference between an 8 percent cache hit rate and a 75 percent cache hit rate is a monthly bill that looks like a different company. The engineering to get there is a few hours of work. The actual win was never about which model you picked.