<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Leonhail Paypa</title>
    <description>The latest articles on DEV Community by Leonhail Paypa (@leonhail).</description>
    <link>https://dev.to/leonhail</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3941295%2Fac39c52f-ace7-48e6-9634-5297abac1750.png</url>
      <title>DEV Community: Leonhail Paypa</title>
      <link>https://dev.to/leonhail</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/leonhail"/>
    <language>en</language>
    <item>
      <title>Why your Anthropic prompt caching probably isn't working (and the npm package I built to fix it)</title>
      <dc:creator>Leonhail Paypa</dc:creator>
      <pubDate>Wed, 20 May 2026 04:48:59 +0000</pubDate>
      <link>https://dev.to/leonhail/why-your-anthropic-prompt-caching-probably-isnt-working-and-the-npm-package-i-built-to-fix-it-42c</link>
      <guid>https://dev.to/leonhail/why-your-anthropic-prompt-caching-probably-isnt-working-and-the-npm-package-i-built-to-fix-it-42c</guid>
      <description>&lt;p&gt;I'm a solo developer with about five years of experience, mostly outside AI. The last few months I've been getting serious about it — reading docs, building small things with Claude, learning how it differs from the web APIs I'm used to.&lt;/p&gt;

&lt;p&gt;While I was setting up Anthropic prompt caching for a project, I got stuck on a question I couldn't easily answer: how do I know it's actually working? The docs explained the &lt;code&gt;cache_control&lt;/code&gt; API and the 90% discount on cached tokens. But the only way to verify a call had hit the cache was to manually parse &lt;code&gt;cache_read_input_tokens&lt;/code&gt; from the response usage on every request. Nobody seems to do this.&lt;/p&gt;

&lt;p&gt;That gap turned into my first published npm package, &lt;code&gt;prompt-cache-optimizer&lt;/code&gt;. This post is what I learned about the four ways prompt caching silently fails, and what the package does to catch them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What prompt caching is supposed to do
&lt;/h2&gt;

&lt;p&gt;When you call &lt;code&gt;messages.create&lt;/code&gt; with a long, stable prefix (system prompt, tool definitions, retrieved documents), Anthropic lets you mark a &lt;code&gt;cache_control&lt;/code&gt; breakpoint. On the first call, that prefix gets written to the cache at ~1.25x the normal input rate. On any subsequent call within the cache TTL, the cached tokens are read back at &lt;strong&gt;10% of the input rate&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That's a 90% discount on whatever portion of your prompt is stable. For a chatbot that re-sends a 10K-token system prompt every turn, this is the difference between a $5K monthly bill and a $500 one.&lt;br&gt;
The math is incredible. The execution is finicky.&lt;/p&gt;
&lt;h2&gt;
  
  
  The four ways prompt caching silently fails
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Misplaced breakpoints&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;cache_control&lt;/code&gt; markers cache everything before them in the request. Put the breakpoint in the wrong place and you cache the wrong things. Worse, the call still succeeds — Anthropic happily processes it, you get a normal response, you just paid full price.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prefix drift across calls&lt;/strong&gt;&lt;br&gt;
The cache only hits if the cacheable prefix is byte-identical to what was cached. If you reorder your tools array between calls, or shuffle retrieved documents, or insert a timestamp anywhere in your system prompt — the prefix is different, cache misses, you pay full price.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Worse, you also pay the 1.25x write cost to cache the new (now-different) prefix, which expires in 5 minutes if nothing else hits it. So you're paying more than you would without caching at all.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TTL expiration&lt;/strong&gt;&lt;br&gt;
Anthropic recently dropped the default cache TTL from 1 hour to 5 minutes. A lot of setups that "had caching working" started silently regressing — calls that came in 6 minutes apart instead of 4 minutes started missing the cache. Nobody got an error. The bill just went up.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No measurement&lt;/strong&gt;&lt;br&gt;
The only way to verify any of the above is to parse &lt;code&gt;cache_read_input_tokens&lt;/code&gt; and &lt;code&gt;cache_creation_input_tokens&lt;/code&gt; from every single response, compute a hit rate, and compare against an expected baseline. Nobody does this. Most teams "set up caching" once, watch the first response come back with high cached tokens, and assume it works forever.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The wrapper I built&lt;/strong&gt;&lt;br&gt;
I shipped a small TypeScript package called &lt;u&gt;prompt-cache-optimizer&lt;/u&gt; that fixes the measurement problem and warns about the other three.&lt;/p&gt;

&lt;p&gt;It's a drop-in wrapper for &lt;code&gt;@anthropic-ai/sdk&lt;/code&gt;. Use it exactly like the SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;CachedAnthropic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;placeBreakpoints&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;prompt-cache-optimizer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CachedAnthropic&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;warnIfHitRateBelow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;placeBreakpoints&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;longSystemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;after-system&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cacheInfo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// {&lt;/span&gt;
&lt;span class="c1"&gt;//   hit: true,&lt;/span&gt;
&lt;span class="c1"&gt;//   cachedTokens: 8420,&lt;/span&gt;
&lt;span class="c1"&gt;//   uncachedTokens: 312,&lt;/span&gt;
&lt;span class="c1"&gt;//   cacheWriteTokens: 0,&lt;/span&gt;
&lt;span class="c1"&gt;//   dollarsSaved: 0.024,&lt;/span&gt;
&lt;span class="c1"&gt;//   dollarsSpent: 0.001&lt;/span&gt;
&lt;span class="c1"&gt;// }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every response gets a &lt;code&gt;cacheInfo&lt;/code&gt; field with the parsed numbers. The client also tracks aggregate stats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="c1"&gt;// {&lt;/span&gt;
&lt;span class="c1"&gt;//   totalCalls: 142,&lt;/span&gt;
&lt;span class="c1"&gt;//   cacheHits: 124,&lt;/span&gt;
&lt;span class="c1"&gt;//   hitRate: 0.873,&lt;/span&gt;
&lt;span class="c1"&gt;//   totalCachedTokens: 1_240_000,&lt;/span&gt;
&lt;span class="c1"&gt;//   dollarsSaved: 3.72,&lt;/span&gt;
&lt;span class="c1"&gt;//   dollarsSpent: 1.41,&lt;/span&gt;
&lt;span class="c1"&gt;// }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And when something looks wrong, it emits passive warnings instead of throwing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cache-write-without-read&lt;/code&gt; → your cacheable prefix changed call-over-call (the silent failure mode)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;low-hit-rate&lt;/code&gt; → rolling cache hit rate dropped below your threshold&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;no-cache-control-found&lt;/code&gt; → you forgot to mark anything cacheable&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;unknown-model&lt;/code&gt; → pricing unknown, dollar accounting skipped&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Route them anywhere you like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CachedAnthropic&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;onWarning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real numbers&lt;/strong&gt;&lt;br&gt;
The included example processes 5 questions reusing a large system prompt. Here's the actual output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tl00lam21bewpp9kvp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tl00lam21bewpp9kvp0.png" alt=" " width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five calls. The first writes to cache (cost: a tiny bit more than uncached). Calls 2-5 each hit the cache.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;80% hit rate&lt;/strong&gt; (4 hits, 1 miss — the first call always misses since that's when the cache gets written)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.017&lt;/strong&gt; saved on &lt;strong&gt;$0.020&lt;/strong&gt; spent&lt;/li&gt;
&lt;li&gt;Same workload without caching would have cost &lt;strong&gt;$0.037&lt;/strong&gt; — a &lt;strong&gt;46% reduction&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At higher call volumes the proportions get even better. A chatbot answering 1000 questions/day with a 10K-token system prompt easily hits 70%+ cost reductions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How big the install is&lt;/strong&gt;&lt;br&gt;
The package is ~50KB unpacked, has &lt;strong&gt;zero runtime dependencies&lt;/strong&gt;, and treats &lt;code&gt;@anthropic-ai/sdk&lt;/code&gt; as a peer dependency. It does not phone home, store payloads, or require an account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Roadmap&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;v0.1&lt;/strong&gt; is intentionally focused on measurement and explicit helpers. Coming up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.2&lt;/strong&gt; — auto-placement of &lt;code&gt;cache_control&lt;/code&gt; breakpoints based on observed prompt stability (no more manual &lt;code&gt;placeBreakpoints()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.3&lt;/strong&gt; — safe message/tool reordering to maximize the stable prefix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.4&lt;/strong&gt; — OpenAI and Gemini prompt caching support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v1.0&lt;/strong&gt; — persistent stats adapter, middleware mode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Try it&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;prompt-cache-optimizer @anthropic-ai/sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/prompt-cache-optimizer" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/prompt-cache-optimizer&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/leonhail-nell/prompt-cache-optimizer" rel="noopener noreferrer"&gt;https://github.com/leonhail-nell/prompt-cache-optimizer&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find it useful, a GitHub star is the single biggest signal that helps other developers find it. If it saves you real money on your Anthropic bill, I'd love to hear about it — file an issue or DM me.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>typescript</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
