<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Parag Darade</title>
    <description>The latest articles on DEV Community by Parag Darade (@parag_d).</description>
    <link>https://dev.to/parag_d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3903878%2F35b21846-9327-4a4c-8126-39079eee577e.png</url>
      <title>DEV Community: Parag Darade</title>
      <link>https://dev.to/parag_d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/parag_d"/>
    <language>en</language>
    <item>
      <title>The Prompt Tax Most LLM Teams Are Silently Paying</title>
      <dc:creator>Parag Darade</dc:creator>
      <pubDate>Wed, 29 Apr 2026 19:57:42 +0000</pubDate>
      <link>https://dev.to/parag_d/the-prompt-tax-most-llm-teams-are-silently-paying-1nml</link>
      <guid>https://dev.to/parag_d/the-prompt-tax-most-llm-teams-are-silently-paying-1nml</guid>
      <description>&lt;h1&gt;
  
  
  The Prompt Tax Most LLM Teams Are Silently Paying
&lt;/h1&gt;

&lt;p&gt;Anthropic shipped prompt caching in August 2024. Nearly two years later, &lt;a href="https://www.datadoghq.com/state-of-ai-engineering/" rel="noopener noreferrer"&gt;Datadog's State of AI Engineering report&lt;/a&gt; found that only 28 percent of LLM API calls across their observed production deployments show cached-read tokens — despite the fact that 69 percent of all input tokens in those same deployments live in system prompts. The math is not subtle: most teams are sending the same fifty thousand tokens on every request and paying full rate for all of them.&lt;/p&gt;

&lt;p&gt;This is not an obscure optimization from a recent release. Both Anthropic and OpenAI have had prompt caching available for over a year. OpenAI applies it automatically on GPT-4o calls longer than 1,024 tokens, at a 50 percent discount, requiring zero code changes. Anthropic's implementation requires marking your cache breakpoints explicitly, but the discount is steeper: &lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;cache reads on Claude Sonnet cost $0.30 per million tokens&lt;/a&gt; versus $3.00 per million for fresh input — a 90 percent reduction. The 72 percent of teams not using this are not missing some edge-case optimization. They are missing the most obvious cost lever available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why System Prompts Are the Structural Problem
&lt;/h2&gt;

&lt;p&gt;Most LLM applications I have seen share the same shape: a system prompt running anywhere from five hundred to fifty thousand tokens — instructions, persona text, policy constraints, tool definitions, few-shot examples — followed by a user message and sometimes retrieved context. The system prompt does not change between requests. It is identical for user one and user ten thousand.&lt;/p&gt;

&lt;p&gt;This is exactly the workload prompt caching was built for. The model processes the system prompt once, writes the KV cache state to a fast-access store, and on every subsequent request within the cache window, reads from that state instead of recomputing from scratch. You pay the write cost once — 1.25x the normal input rate on Anthropic's five-minute TTL — and then ten percent of the normal rate on every read thereafter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://labeveryday.medium.com/prompt-caching-is-a-must-how-i-went-from-spending-720-to-72-monthly-on-api-costs-3086f3635d63" rel="noopener noreferrer"&gt;Du'An Lightfoot's YouTube analytics bot&lt;/a&gt; puts the economics in concrete terms. His system included 81,262 tokens of video metadata JSON in every request — paying $0.24 per call against Claude 3.5 Haiku's base rate. After caching, subsequent requests dropped to $0.024. The monthly bill went from $720 to $72. The only change was marking the cacheable prefix in the API request.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://medium.com/tr-labs-ml-engineering-blog/prompt-caching-the-secret-to-60-cost-reduction-in-llm-applications-6c792a0ac29b" rel="noopener noreferrer"&gt;Thomson Reuters Labs engineering team&lt;/a&gt; measured similar numbers on research paper analysis. A 30,000-token document with three parallel queries cost $0.34 per session on Claude 3.5 Sonnet without caching. With a cache-warmed prefix, the same workload ran at $0.14 — a 59 percent reduction — and subsequent queries ran 20 percent faster. For larger prompts the latency gains are more dramatic: Anthropic's own documentation shows a 100,000-token prompt dropping from 11.5 seconds to 2.4 seconds with caching, an 85 percent reduction in time-to-first-token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Actually Breaks
&lt;/h2&gt;

&lt;p&gt;The implementation failure mode is not the API call. That part is two lines of JSON. The failure mode is prompt structure.&lt;/p&gt;

&lt;p&gt;Caching works by prefix matching. Everything up to your first &lt;code&gt;cache_control&lt;/code&gt; breakpoint must be byte-for-byte identical across requests for a cache hit to register. This means your cacheable content has to come &lt;em&gt;first&lt;/em&gt; in the prompt. If you build prompts dynamically and prepend user-specific context before the system instructions — which is a common pattern when you want to personalize early — you get zero cache hits and no error to investigate. The system just silently processes fresh tokens on every call.&lt;/p&gt;

&lt;p&gt;The correct structure is: static system instructions first, then cacheable reference material, then dynamic context, then the user query. The TR Labs team discovered the second half of this the hard way: they parallelized three document-analysis requests before issuing a sequential cache write, and ended up with a 4.2 percent cache hit rate and costs 60 percent higher than their fully-uncached baseline. Each parallel thread had written its own redundant cache entry. Their fix was a single synchronous "warming" call before fanning out.&lt;/p&gt;

&lt;p&gt;The second trap is the cache TTL. Anthropic's default window is five minutes. If your workload has gaps longer than that — overnight batch jobs, infrequent API calls, anything without steady traffic throughout the day — you pay the write premium on every call and recover nothing. The one-hour TTL doubles the write cost to 2x the normal input rate, but that premium is recovered within the first two or three requests in the same window. Know your request distribution before picking a TTL.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual First Step
&lt;/h2&gt;

&lt;p&gt;Before you add a re-ranker, upgrade your embedding model, or benchmark a new chunking strategy: open your provider dashboard, find your average input token count per call, and separate it into system tokens versus dynamic tokens. If the static portion is above two thousand tokens and your request volume is more than a few hundred calls per day, you are probably leaving 50 to 90 percent of your input token spend on the table.&lt;/p&gt;

&lt;p&gt;On OpenAI, caching is already happening automatically — check whether your prompt structure is prefix-stable enough to be hitting it. On Anthropic, add &lt;code&gt;cache_control&lt;/code&gt; markers to your system prompt and reference content, run a day of traffic, and look at &lt;code&gt;cache_read_input_tokens&lt;/code&gt; in the response metadata.&lt;/p&gt;

&lt;p&gt;The boring optimization is the one that ships in an afternoon, requires no new infrastructure, and cuts your monthly API bill in half. Most teams are still waiting for a reason to look at it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Prompt Caching Works. Your Prompt Assembly Code Does Not.</title>
      <dc:creator>Parag Darade</dc:creator>
      <pubDate>Wed, 29 Apr 2026 18:31:37 +0000</pubDate>
      <link>https://dev.to/parag_d/prompt-caching-works-your-prompt-assembly-code-does-not-5edc</link>
      <guid>https://dev.to/parag_d/prompt-caching-works-your-prompt-assembly-code-does-not-5edc</guid>
      <description>&lt;h1&gt;
  
  
  Prompt Caching Works. Your Prompt Assembly Code Does Not.
&lt;/h1&gt;

&lt;p&gt;I have watched teams enable Anthropic's prompt caching, wait a billing cycle, and conclude that the advertised 90% discount on input tokens is marketing fiction. It is not. The discount is real — Anthropic charges $0.30 per million tokens for cache reads against $3.00 for fresh input, a genuine 10x difference. What is fiction is the assumption that flipping the flag is sufficient.&lt;/p&gt;

&lt;p&gt;The failure mode is architectural. The default way engineers build LLM applications — dynamically assembling prompts from system instructions, retrieved context, conversation history, and user input — produces prompts that defeat the cache on every single call, regardless of what the documentation says.&lt;/p&gt;

&lt;h2&gt;
  
  
  What prefix invariance actually means
&lt;/h2&gt;

&lt;p&gt;Anthropic's cache operates on prefix invariance. It checks the prompt from the beginning outward. The cached prefix must be byte-for-byte identical to a prior request. The moment any content changes, the cache misses for that position and everything that follows it.&lt;/p&gt;

&lt;p&gt;This seems obvious until you look at how most production prompt assembly actually works. A typical chain: &lt;code&gt;[system prompt] + [RAG chunks from this query] + [conversation history] + [user message]&lt;/code&gt;. If the RAG chunks differ between requests — which they do, by definition — then the cache never gets a stable prefix long enough to activate, even though the system prompt is identical across every request. The dynamic content is injected upstream of the static content, and the cache sees a novel prompt every time.&lt;/p&gt;

&lt;p&gt;Anthropic requires a minimum of 1024 tokens in the cached block and supports up to four explicit breakpoints per prompt. These parameters are not the bottleneck. The bottleneck is content ordering.&lt;/p&gt;

&lt;h2&gt;
  
  
  From 7% to 85% in one deployment
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching" rel="noopener noreferrer"&gt;ProjectDiscovery&lt;/a&gt; runs an AI security research platform built on agent swarms. Each task averages 26 steps and 40 tool calls, working from a system prompt that exceeds 2,500 lines of YAML — over 20,000 tokens per agent. The economics of caching a system prompt that size are not subtle: sent 100 times, it costs roughly $6.00 at fresh input pricing and $0.67 with caching. They had every incentive to get this right.&lt;/p&gt;

&lt;p&gt;Their initial cache hit rate was 7%.&lt;/p&gt;

&lt;p&gt;The diagnosis was prompt structure. Dynamic task context — the current scan target, task parameters, variable tool outputs — was being injected into the cacheable prefix before the static system prompt content. From the cache's perspective, every request opened with novel content. The 20,000-token system prompt that should have dominated the cached prefix was sitting downstream of tokens that changed on every call.&lt;/p&gt;

&lt;p&gt;The fix was architectural, not technical: relocate all dynamic content from the cacheable prefix to the tail of the prompt, after the cache breakpoints, delivered as part of the user message rather than embedded in the system prompt. They also structured three explicit breakpoints — one for the static system prompt, one for the conversation sliding window, one for tool definitions. A single deployment on February 16 moved the hit rate from 7% to 73.7%. By March 23 it had reached 85%. The cost reduction was 59% overall and climbing toward 70% in the most recent measurement window.&lt;/p&gt;

&lt;p&gt;For their longest agentic tasks — one ran to 1,663 steps and 57.5 million input tokens — cache rates hit 92.9%. At that scale, the difference between a 7% and 93% cache rate on a single task is not rounding error. It is the difference between running the task economically and not running it at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The parallel request trap
&lt;/h2&gt;

&lt;p&gt;There is a second structural failure mode that hits applications using parallel LLM calls for throughput — batch document analysis, concurrent summarization, fan-out agent patterns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/tr-labs-ml-engineering-blog/prompt-caching-the-secret-to-60-cost-reduction-in-llm-applications-6c792a0ac29b" rel="noopener noreferrer"&gt;Thomson Reuters Labs published&lt;/a&gt; a specific breakdown of this problem. Their pipeline ingested a 30,000-token document and ran multiple analytical queries against it in parallel to reduce latency. Cache hit rate without modification: 4.2%.&lt;/p&gt;

&lt;p&gt;The cause is a race condition in cache population. When two parallel requests arrive simultaneously against a prefix that has no existing cache entry, both trigger a cache write. The second write is redundant — you pay the $3.75/M write premium twice, and the second entry is wasted. Every subsequent request that arrives before any cache entry is established repeats this. In a burst of parallel calls, you can write the same prefix dozens of times and read it zero times in the same request window.&lt;/p&gt;

&lt;p&gt;The fix is cache warming: a single synchronous call to establish the cache entry before the parallel batch is dispatched. The warming call costs 3.98 seconds of overhead. Against a session with three parallel queries, that overhead is roughly 5% of total session time. Against a session with twenty queries, under 1%. The cost comparison on their 30,000-token document with three questions: $0.34 without warming, $0.14 with it — 60% cheaper, from a wrapper function that fires one request before releasing the batch.&lt;/p&gt;

&lt;p&gt;This failure produces correct results. Both code paths return valid completions. The only signal that something is wrong is a bill that is higher than it should be, which most teams attribute to volume rather than structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to look
&lt;/h2&gt;

&lt;p&gt;Find every place in your codebase where a prompt is assembled. Identify which content is stable across requests and which is dynamic. If dynamic content appears before any cache breakpoint, you have a structural problem that no amount of breakpoint configuration will fix.&lt;/p&gt;

&lt;p&gt;The three offenders I see most often: RAG chunks injected into the system prompt block rather than the user message, user-specific metadata prepended as a system prefix, and timestamp or request-ID fields inadvertently baked into the cacheable portion for debugging purposes.&lt;/p&gt;

&lt;p&gt;Once you restructure for a stable prefix, add &lt;code&gt;cache_control: {"type": "ephemeral"}&lt;/code&gt; at the end of the static block and watch the &lt;code&gt;cache_read_input_tokens&lt;/code&gt; field in the response. If that field is zero on requests after the first, your prefix is still changing. The field will tell you immediately whether the fix held.&lt;/p&gt;

&lt;p&gt;The savings that caching advertises are real. They are just gated behind understanding that the cache cannot compensate for a prompt assembly pattern that was never designed with prefix stability in mind.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
