Prompt Caching Works. Your Prompt Assembly Code Does Not.

#ai #machinelearning #rag #llm

Prompt Caching Works. Your Prompt Assembly Code Does Not.

I have watched teams enable Anthropic's prompt caching, wait a billing cycle, and conclude that the advertised 90% discount on input tokens is marketing fiction. It is not. The discount is real — Anthropic charges $0.30 per million tokens for cache reads against $3.00 for fresh input, a genuine 10x difference. What is fiction is the assumption that flipping the flag is sufficient.

The failure mode is architectural. The default way engineers build LLM applications — dynamically assembling prompts from system instructions, retrieved context, conversation history, and user input — produces prompts that defeat the cache on every single call, regardless of what the documentation says.

What prefix invariance actually means

Anthropic's cache operates on prefix invariance. It checks the prompt from the beginning outward. The cached prefix must be byte-for-byte identical to a prior request. The moment any content changes, the cache misses for that position and everything that follows it.

This seems obvious until you look at how most production prompt assembly actually works. A typical chain: [system prompt] + [RAG chunks from this query] + [conversation history] + [user message]. If the RAG chunks differ between requests — which they do, by definition — then the cache never gets a stable prefix long enough to activate, even though the system prompt is identical across every request. The dynamic content is injected upstream of the static content, and the cache sees a novel prompt every time.

Anthropic requires a minimum of 1024 tokens in the cached block and supports up to four explicit breakpoints per prompt. These parameters are not the bottleneck. The bottleneck is content ordering.

From 7% to 85% in one deployment

ProjectDiscovery runs an AI security research platform built on agent swarms. Each task averages 26 steps and 40 tool calls, working from a system prompt that exceeds 2,500 lines of YAML — over 20,000 tokens per agent. The economics of caching a system prompt that size are not subtle: sent 100 times, it costs roughly $6.00 at fresh input pricing and $0.67 with caching. They had every incentive to get this right.

Their initial cache hit rate was 7%.

The diagnosis was prompt structure. Dynamic task context — the current scan target, task parameters, variable tool outputs — was being injected into the cacheable prefix before the static system prompt content. From the cache's perspective, every request opened with novel content. The 20,000-token system prompt that should have dominated the cached prefix was sitting downstream of tokens that changed on every call.

The fix was architectural, not technical: relocate all dynamic content from the cacheable prefix to the tail of the prompt, after the cache breakpoints, delivered as part of the user message rather than embedded in the system prompt. They also structured three explicit breakpoints — one for the static system prompt, one for the conversation sliding window, one for tool definitions. A single deployment on February 16 moved the hit rate from 7% to 73.7%. By March 23 it had reached 85%. The cost reduction was 59% overall and climbing toward 70% in the most recent measurement window.

For their longest agentic tasks — one ran to 1,663 steps and 57.5 million input tokens — cache rates hit 92.9%. At that scale, the difference between a 7% and 93% cache rate on a single task is not rounding error. It is the difference between running the task economically and not running it at all.

The parallel request trap

There is a second structural failure mode that hits applications using parallel LLM calls for throughput — batch document analysis, concurrent summarization, fan-out agent patterns.

Thomson Reuters Labs published a specific breakdown of this problem. Their pipeline ingested a 30,000-token document and ran multiple analytical queries against it in parallel to reduce latency. Cache hit rate without modification: 4.2%.

The cause is a race condition in cache population. When two parallel requests arrive simultaneously against a prefix that has no existing cache entry, both trigger a cache write. The second write is redundant — you pay the $3.75/M write premium twice, and the second entry is wasted. Every subsequent request that arrives before any cache entry is established repeats this. In a burst of parallel calls, you can write the same prefix dozens of times and read it zero times in the same request window.

The fix is cache warming: a single synchronous call to establish the cache entry before the parallel batch is dispatched. The warming call costs 3.98 seconds of overhead. Against a session with three parallel queries, that overhead is roughly 5% of total session time. Against a session with twenty queries, under 1%. The cost comparison on their 30,000-token document with three questions: $0.34 without warming, $0.14 with it — 60% cheaper, from a wrapper function that fires one request before releasing the batch.

This failure produces correct results. Both code paths return valid completions. The only signal that something is wrong is a bill that is higher than it should be, which most teams attribute to volume rather than structure.

Where to look

Find every place in your codebase where a prompt is assembled. Identify which content is stable across requests and which is dynamic. If dynamic content appears before any cache breakpoint, you have a structural problem that no amount of breakpoint configuration will fix.

The three offenders I see most often: RAG chunks injected into the system prompt block rather than the user message, user-specific metadata prepended as a system prefix, and timestamp or request-ID fields inadvertently baked into the cacheable portion for debugging purposes.

Once you restructure for a stable prefix, add cache_control: {"type": "ephemeral"} at the end of the static block and watch the cache_read_input_tokens field in the response. If that field is zero on requests after the first, your prefix is still changing. The field will tell you immediately whether the fix held.

The savings that caching advertises are real. They are just gated behind understanding that the cache cannot compensate for a prompt assembly pattern that was never designed with prefix stability in mind.