I'm a solo developer with about five years of experience, mostly outside AI. The last few months I've been getting serious about it — reading docs, building small things with Claude, learning how it differs from the web APIs I'm used to.
While I was setting up Anthropic prompt caching for a project, I got stuck on a question I couldn't easily answer: how do I know it's actually working? The docs explained the cache_control API and the 90% discount on cached tokens. But the only way to verify a call had hit the cache was to manually parse cache_read_input_tokens from the response usage on every request. Nobody seems to do this.
That gap turned into my first published npm package, prompt-cache-optimizer. This post is what I learned about the four ways prompt caching silently fails, and what the package does to catch them.
What prompt caching is supposed to do
When you call messages.create with a long, stable prefix (system prompt, tool definitions, retrieved documents), Anthropic lets you mark a cache_control breakpoint. On the first call, that prefix gets written to the cache at ~1.25x the normal input rate. On any subsequent call within the cache TTL, the cached tokens are read back at 10% of the input rate.
That's a 90% discount on whatever portion of your prompt is stable. For a chatbot that re-sends a 10K-token system prompt every turn, this is the difference between a $5K monthly bill and a $500 one.
The math is incredible. The execution is finicky.
The four ways prompt caching silently fails
Misplaced breakpoints
cache_controlmarkers cache everything before them in the request. Put the breakpoint in the wrong place and you cache the wrong things. Worse, the call still succeeds — Anthropic happily processes it, you get a normal response, you just paid full price.Prefix drift across calls
The cache only hits if the cacheable prefix is byte-identical to what was cached. If you reorder your tools array between calls, or shuffle retrieved documents, or insert a timestamp anywhere in your system prompt — the prefix is different, cache misses, you pay full price.
Worse, you also pay the 1.25x write cost to cache the new (now-different) prefix, which expires in 5 minutes if nothing else hits it. So you're paying more than you would without caching at all.
TTL expiration
Anthropic recently dropped the default cache TTL from 1 hour to 5 minutes. A lot of setups that "had caching working" started silently regressing — calls that came in 6 minutes apart instead of 4 minutes started missing the cache. Nobody got an error. The bill just went up.No measurement
The only way to verify any of the above is to parsecache_read_input_tokensandcache_creation_input_tokensfrom every single response, compute a hit rate, and compare against an expected baseline. Nobody does this. Most teams "set up caching" once, watch the first response come back with high cached tokens, and assume it works forever.
The wrapper I built
I shipped a small TypeScript package called prompt-cache-optimizer that fixes the measurement problem and warns about the other three.
It's a drop-in wrapper for @anthropic-ai/sdk. Use it exactly like the SDK:
import { CachedAnthropic, placeBreakpoints } from "prompt-cache-optimizer";
const client = new CachedAnthropic({
apiKey: process.env.ANTHROPIC_API_KEY!,
warnIfHitRateBelow: 0.6,
});
const { system, messages } = placeBreakpoints({
system: longSystemPrompt,
messages: conversation,
strategy: "after-system",
});
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system,
messages,
});
console.log(response.cacheInfo);
// {
// hit: true,
// cachedTokens: 8420,
// uncachedTokens: 312,
// cacheWriteTokens: 0,
// dollarsSaved: 0.024,
// dollarsSpent: 0.001
// }
Every response gets a cacheInfo field with the parsed numbers. The client also tracks aggregate stats:
console.log(client.stats());
// {
// totalCalls: 142,
// cacheHits: 124,
// hitRate: 0.873,
// totalCachedTokens: 1_240_000,
// dollarsSaved: 3.72,
// dollarsSpent: 1.41,
// }
And when something looks wrong, it emits passive warnings instead of throwing:
-
cache-write-without-read→ your cacheable prefix changed call-over-call (the silent failure mode) -
low-hit-rate→ rolling cache hit rate dropped below your threshold -
no-cache-control-found→ you forgot to mark anything cacheable -
unknown-model→ pricing unknown, dollar accounting skipped
Route them anywhere you like:
new CachedAnthropic({
apiKey,
onWarning: (event) => logger.warn(event),
});
Real numbers
The included example processes 5 questions reusing a large system prompt. Here's the actual output:

Five calls. The first writes to cache (cost: a tiny bit more than uncached). Calls 2-5 each hit the cache.
- 80% hit rate (4 hits, 1 miss — the first call always misses since that's when the cache gets written)
- $0.017 saved on $0.020 spent
- Same workload without caching would have cost $0.037 — a 46% reduction
At higher call volumes the proportions get even better. A chatbot answering 1000 questions/day with a 10K-token system prompt easily hits 70%+ cost reductions.
How big the install is
The package is ~50KB unpacked, has zero runtime dependencies, and treats @anthropic-ai/sdk as a peer dependency. It does not phone home, store payloads, or require an account.
Roadmap
v0.1 is intentionally focused on measurement and explicit helpers. Coming up:
-
v0.2 — auto-placement of
cache_controlbreakpoints based on observed prompt stability (no more manualplaceBreakpoints()) - v0.3 — safe message/tool reordering to maximize the stable prefix
- v0.4 — OpenAI and Gemini prompt caching support
- v1.0 — persistent stats adapter, middleware mode
Try it
npm install prompt-cache-optimizer @anthropic-ai/sdk
- npm: https://www.npmjs.com/package/prompt-cache-optimizer
- GitHub: https://github.com/leonhail-nell/prompt-cache-optimizer
If you find it useful, a GitHub star is the single biggest signal that helps other developers find it. If it saves you real money on your Anthropic bill, I'd love to hear about it — file an issue or DM me.
Top comments (0)