The Prompt Tax Most LLM Teams Are Silently Paying

#ai #machinelearning #rag #llm

The Prompt Tax Most LLM Teams Are Silently Paying

Anthropic shipped prompt caching in August 2024. Nearly two years later, Datadog's State of AI Engineering report found that only 28 percent of LLM API calls across their observed production deployments show cached-read tokens — despite the fact that 69 percent of all input tokens in those same deployments live in system prompts. The math is not subtle: most teams are sending the same fifty thousand tokens on every request and paying full rate for all of them.

This is not an obscure optimization from a recent release. Both Anthropic and OpenAI have had prompt caching available for over a year. OpenAI applies it automatically on GPT-4o calls longer than 1,024 tokens, at a 50 percent discount, requiring zero code changes. Anthropic's implementation requires marking your cache breakpoints explicitly, but the discount is steeper: cache reads on Claude Sonnet cost $0.30 per million tokens versus $3.00 per million for fresh input — a 90 percent reduction. The 72 percent of teams not using this are not missing some edge-case optimization. They are missing the most obvious cost lever available.

Why System Prompts Are the Structural Problem

Most LLM applications I have seen share the same shape: a system prompt running anywhere from five hundred to fifty thousand tokens — instructions, persona text, policy constraints, tool definitions, few-shot examples — followed by a user message and sometimes retrieved context. The system prompt does not change between requests. It is identical for user one and user ten thousand.

This is exactly the workload prompt caching was built for. The model processes the system prompt once, writes the KV cache state to a fast-access store, and on every subsequent request within the cache window, reads from that state instead of recomputing from scratch. You pay the write cost once — 1.25x the normal input rate on Anthropic's five-minute TTL — and then ten percent of the normal rate on every read thereafter.

Du'An Lightfoot's YouTube analytics bot puts the economics in concrete terms. His system included 81,262 tokens of video metadata JSON in every request — paying $0.24 per call against Claude 3.5 Haiku's base rate. After caching, subsequent requests dropped to $0.024. The monthly bill went from $720 to $72. The only change was marking the cacheable prefix in the API request.

The Thomson Reuters Labs engineering team measured similar numbers on research paper analysis. A 30,000-token document with three parallel queries cost $0.34 per session on Claude 3.5 Sonnet without caching. With a cache-warmed prefix, the same workload ran at $0.14 — a 59 percent reduction — and subsequent queries ran 20 percent faster. For larger prompts the latency gains are more dramatic: Anthropic's own documentation shows a 100,000-token prompt dropping from 11.5 seconds to 2.4 seconds with caching, an 85 percent reduction in time-to-first-token.

Where This Actually Breaks

The implementation failure mode is not the API call. That part is two lines of JSON. The failure mode is prompt structure.

Caching works by prefix matching. Everything up to your first cache_control breakpoint must be byte-for-byte identical across requests for a cache hit to register. This means your cacheable content has to come first in the prompt. If you build prompts dynamically and prepend user-specific context before the system instructions — which is a common pattern when you want to personalize early — you get zero cache hits and no error to investigate. The system just silently processes fresh tokens on every call.

The correct structure is: static system instructions first, then cacheable reference material, then dynamic context, then the user query. The TR Labs team discovered the second half of this the hard way: they parallelized three document-analysis requests before issuing a sequential cache write, and ended up with a 4.2 percent cache hit rate and costs 60 percent higher than their fully-uncached baseline. Each parallel thread had written its own redundant cache entry. Their fix was a single synchronous "warming" call before fanning out.

The second trap is the cache TTL. Anthropic's default window is five minutes. If your workload has gaps longer than that — overnight batch jobs, infrequent API calls, anything without steady traffic throughout the day — you pay the write premium on every call and recover nothing. The one-hour TTL doubles the write cost to 2x the normal input rate, but that premium is recovered within the first two or three requests in the same window. Know your request distribution before picking a TTL.

The Actual First Step

Before you add a re-ranker, upgrade your embedding model, or benchmark a new chunking strategy: open your provider dashboard, find your average input token count per call, and separate it into system tokens versus dynamic tokens. If the static portion is above two thousand tokens and your request volume is more than a few hundred calls per day, you are probably leaving 50 to 90 percent of your input token spend on the table.

On OpenAI, caching is already happening automatically — check whether your prompt structure is prefix-stable enough to be hitting it. On Anthropic, add cache_control markers to your system prompt and reference content, run a day of traffic, and look at cache_read_input_tokens in the response metadata.

The boring optimization is the one that ships in an afternoon, requires no new infrastructure, and cuts your monthly API bill in half. Most teams are still waiting for a reason to look at it.