Yegor Shustyk

Posted on May 23

Prompt Caching Cut My Claude Bill by 70% — Here's the Exact Setup

#claude #webdev #telegram

I run a Claude-powered Telegram bot in production. Last 14 days: 905 API calls, $7.62 total. That's $0.0084 per call against a system prompt that's about 6,000 tokens. Without prompt caching, the same workload would have cost me roughly $25.

The Anthropic docs cover prompt caching at the spec level, but the practical "how do I wire this into a real Node app that makes hundreds of calls per day" is scattered. Here's the exact setup that's actually running in production, plus the five gotchas that cost me a day each to figure out.

The problem

A typical Claude call from my bot looks like this:

System prompt: ~6,000 tokens. Big block of instructions: tone, response shape, framework lenses, formatting rules, language guidance, anti-pattern checklist.
Per-call dynamic context: ~500-2,000 tokens. User's memory card, recent entries, current message.
Reply: 200-800 tokens out.

Without caching, every call pays full price on those 6,000 system tokens. With ~900 calls in two weeks, that's 5.4M tokens just in system prompt repetition. At Claude Sonnet 4.5 input pricing ($3/MTok), that's $16+ on text the model has already seen.

The fix takes about 10 lines.

The setup

Anthropic's API accepts the system field as either a string (simple case) or an array of typed blocks (cache-aware case). To enable caching, you split the system field into static and dynamic pieces, and mark the static block with cache_control: { type: "ephemeral" }.

Here's the helper I use across every call:

// claude.js
export function withPromptCache(staticPrompt, dynamicSuffix = '') {
  const blocks = [
    { type: 'text', text: staticPrompt, cache_control: { type: 'ephemeral' } },
  ]
  if (dynamicSuffix) blocks.push({ type: 'text', text: dynamicSuffix })
  return blocks
}

And the call site:

const reply = await askClaude(
  withPromptCache(FREE_MESSAGE_PROMPT, userContextSuffix),
  userMessage,
  recentExchanges,
  { userId: user.id, callType: 'free', maxTokens: 2048 }
)

FREE_MESSAGE_PROMPT is the big static block. userContextSuffix is the small per-user dynamic part (memory card, recent entries). The dynamic part stays uncached — that's intentional and the right tradeoff.

Inside askClaude, the body sent to Anthropic is:

const body = JSON.stringify({
  model: 'claude-sonnet-4-5',
  max_tokens: 2048,
  system: systemPrompt,  // ← the array from withPromptCache()
  messages: [...contextMessages, { role: 'user', content: userMessage }],
})

That's it for setup. Now the interesting part: tracking whether it actually works.

Reading the three token counters

When caching is active, Anthropic returns three input counters instead of one. You have to track all three or you'll never know if caching is doing anything.

const usage         = data.usage
const input         = usage.input_tokens                ?? 0
const cacheCreated  = usage.cache_creation_input_tokens ?? 0
const cacheRead     = usage.cache_read_input_tokens     ?? 0
const output        = usage.output_tokens               ?? 0

// Cost math: cache-creation is +25% on top of normal price,
// cache-read is -90% off normal price.
const cost = input        * INPUT_PRICE
           + cacheCreated * INPUT_PRICE * 1.25
           + cacheRead    * INPUT_PRICE * 0.10
           + output       * OUTPUT_PRICE

What you want to see in your logs: cacheRead should dwarf cacheCreated. The first call in a cache window writes (1.25×), every subsequent call within ~5 minutes reads (0.10×). If cacheCreated is always equal to your static prompt size, the cache is never hitting.

I write all three counters to a token_usage table per call, so /admin can show effective spend and hit-rate over time.

The 5 gotchas

1. Minimum token threshold (silent failure mode)

Anthropic requires your cached block to be at least 1024 tokens for Sonnet/Opus, 2048 for Haiku. Below that, the cache_control field is silently ignored. No error, no warning. You'll just see cacheRead: 0 forever and wonder why.

If you're caching a small system prompt, you have two options: pad it with relevant context until it crosses the threshold, or accept that caching doesn't apply at your scale.

2. The 5-minute TTL

The cache is ephemeral with a ~5-minute TTL. This matters more than people realize when planning where to apply caching:

Active chat sessions (user-bot back-and-forth) — every turn within the session hits the cache. Huge win.
Cron loops (e.g. nightly job that hits Claude per user) — if your loop processes one user every 10 seconds, the cache stays warm across the whole loop. Also a win.
Sparse one-off calls (one insight request per day per user) — these always miss. You'll pay the 1.25× cache-creation penalty for nothing. Skip caching here.

3. Separate static from dynamic at the right line

Putting the wrong content in the cached block invalidates the cache constantly. The rule:

Cached block = bytes that are identical across the call pattern you're optimizing.

For my bot, that means the cached block contains the system prompt and nothing else. The user's memory card, recent entries, and current question all go into the dynamic suffix (unmarked, uncached). If I put the memory card into the cached block, every user would invalidate the cache for every other user.

4. Cache key is content, not order

The cache key is a hash of the cached block's exact content. Even one whitespace change kills the cache. This bites you if you do something like:

// BAD — string concatenation creates a new cache key every call
const prompt = `${BASE_PROMPT}\n\nUser language: ${user.language}`

The user.language interpolation makes the "static" block per-user. Either move it to the dynamic suffix, or accept multiple cache entries (one per language).

5. Cache costs +25% the first time

The first call after a cache miss pays 1.25× normal input price to write the cache. If your traffic is too sparse to amortize this across enough reads, you're losing money.

Rough rule: you break even at ~3 cache hits per write. Below that, just send the system field as a plain string and skip the wrapper.

Real numbers from the project

Last 14 days, broken down by call type (sorted by spend):

free (chat)                 109 calls  $1.92
memory_card_midweek          41 calls  $1.43
evening                      46 calls  $0.76
morning_ack                  46 calls  $0.55
morning                     159 calls  $0.55
evening_opener              193 calls  $0.53
memory_card                  19 calls  $0.49
weekly_summary               19 calls  $0.34
...
─────────────────────────────────────
TOTAL                       905 calls  $7.62

Average $0.0084 per call on Sonnet 4.5 at ~6k input + 500 output. Without caching, this would land at roughly $0.025/call — about 3× more. Across 905 calls, that's the difference between $7 and $25 for the same work.

The win compounds as the bot scales. Doubling users doesn't double the cost — most additional traffic hits warm caches.

When NOT to use prompt caching

I want to be specific because the docs gloss over this:

Sparse, one-off calls where you have <3 hits per 5-minute window. The 1.25× write penalty exceeds the read savings.
Per-user prompts where the "static" block is actually per-user. You'll write a fresh cache for every user; pay the penalty, get no benefit.
Below the token threshold (1024 Sonnet / 2048 Haiku). Caching silently doesn't apply. Don't bother wrapping in withPromptCache — just save the indirection.
During development when you're iterating on the prompt. Every prompt edit invalidates the cache, so the savings show up only after the prompt is stable.

What this powers

This caching setup runs a Telegram-based self-reflection bot called Wise Insights — daily morning and evening check-ins, weekly summaries, memory layer that learns user patterns over time. It's live at wise.synergize.digital if you want to see what's running on top of all this token plumbing.

Happy to share more of the architecture (Supabase + grammy + node-cron, plus how I handle the memory layer without vector embeddings) if there's interest — drop questions in the comments.

The main lesson: prompt caching is one of those features that looks like a 10% optimization and turns out to be a 70% one, but only if your traffic pattern fits. Measure the three counters, watch the hit rate, and don't wrap it where it won't help.

DEV Community