DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Keep Your Anthropic Prompt Cache Alive With prompt-cache-warmer

The cache hit rate dropped to zero, every time

We had a long system prompt. Around 15,000 tokens. Routing rules, tool descriptions, output format requirements, persona notes. The kind of prompt that builds up over months as the product evolves.

We turned on Anthropic's prompt caching. Cache hit rate climbed above 90% during active hours. Input token costs dropped noticeably. Things looked good.

Then we started watching what happened during slow periods. Overnight. Weekends. The 20 minutes between a customer finishing a task and starting the next one.

The cache hit rate dropped to zero. Every time. Without fail.

The Anthropic prompt cache has a 5-minute TTL. If no request uses a cached prefix in 5 minutes, the cache expires. The next request pays full input token cost, which for a 15k-token system prompt adds up fast.

In practice this meant: we had great cache economics during active sessions, and terrible ones after any idle gap. Traffic bursts after idle periods, common in production, were the worst case. Every burst started cold.

The fix was not complicated, but it needed to be reliable. Keep the cache alive by sending minimal calls on an interval shorter than the TTL.

That is what prompt-cache-warmer does.


The shape of the fix

Install it:

pip install prompt-cache-warmer
Enter fullscreen mode Exit fullscreen mode

Basic usage:

from anthropic import Anthropic
from prompt_cache_warmer import CacheWarmer, WarmupConfig

client = Anthropic()

config = WarmupConfig(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant. " + ("context " * 5000),
            "cache_control": {"type": "ephemeral"},
        }
    ],
    interval_seconds=240,  # every 4 minutes, inside the 5-minute TTL
)

warmer = CacheWarmer(client=client, config=config)
warmer.start()

# your agent or server does its thing here
# warmer keeps the cache alive in the background

warmer.stop()
Enter fullscreen mode Exit fullscreen mode

The warmer sends a minimal request on each tick: a short user message with max_tokens=1. The goal is not to get a useful response. The goal is to touch the cached prefix so the TTL resets.

With verification enabled:

config = WarmupConfig(
    model="claude-sonnet-4-6",
    system=[...],
    interval_seconds=240,
    verify=True,
)
Enter fullscreen mode Exit fullscreen mode

When verify=True, the warmer checks the response for cache_read_input_tokens > 0. If the value is zero, the warmup call did not hit the cache. This can happen if the cache had already expired before the interval fired, or if the cache_control breakpoints do not match what you expect.


What it does NOT do

  • It does not manage multiple independent cache scopes. One warmer instance handles one system prompt configuration.
  • It does not adjust the interval dynamically. You configure a fixed interval and it runs on that schedule.
  • It does not retry on failure. If a warmup call fails, the warmer logs the error and waits for the next interval.
  • It does not track cost. It sends real API calls that cost real tokens. For max_tokens=1 with a 15k-token cached system prompt, the cost per warmup call is small. But it adds up if you run many warmers or forget to stop one.

Inside the lib: the optional verify design

The verify step is the most interesting design choice in this library.

When you send a warmup call and get back a response, the response usage object from Anthropic looks like this:

response.usage.cache_read_input_tokens   # tokens served from cache
response.usage.cache_creation_input_tokens  # tokens written to cache
response.usage.input_tokens              # uncached tokens
Enter fullscreen mode Exit fullscreen mode

If cache_read_input_tokens > 0, the warmup hit the cache. The prefix was alive.

If cache_read_input_tokens == 0 and cache_creation_input_tokens > 0, the warmup actually re-created the cache entry. The TTL had expired. The warmup interval was too long.

Without verification, you trust that your interval is short enough. You do not know if the cache expired between your last warmup and this one.

With verification, you confirm each warmup actually hit. If it missed, you have actionable signal: shorten the interval, check your cache_control breakpoints, or investigate whether the cache was invalidated by a system prompt change.

The verification step is optional because it requires parsing the response and checking the usage object. For most cases, setting a 4-minute interval against a 5-minute TTL is conservative enough that you do not need the extra check. But for high-value prompts where a cold start is genuinely costly, verification gives you certainty.


When this is useful

Long system prompts. The prompt cache benefit scales with prompt length. Short prompts do not justify the complexity of a separate warming process. If your system prompt is over 1,024 tokens (the minimum for caching to apply), warming starts to make sense.

Agents with variable traffic. A 24/7 high-traffic service will naturally keep the cache warm on its own. Agents that serve bursts with idle gaps in between are the target case. Customer support bots. Scheduled batch processors. Developer tools people open and close.

Multi-tenant systems with shared prompts. If multiple tenants share a common base system prompt, one warmer keeps that prefix cached for everyone.

Production services where cost spikes matter. During a traffic burst after a cold cache, every concurrent request pays full input token cost for the system prompt. Warming prevents that spike.


When NOT to use this

Short system prompts. If your system prompt is a few hundred tokens, the cost difference between a cache hit and a miss is negligible. Warming is not worth it.

Continuous high-traffic systems. If your system is handling requests constantly, the cache stays warm on its own. Adding a warmer adds cost with no benefit.

Per-user system prompts. If every user has a unique system prompt, warming does not help. The cached prefix must be reused across requests to justify the approach.

Multi-region setups where you do not know which region serves each request. The Anthropic prompt cache is per-region. A warmer running in us-east-1 does not warm the cache in eu-west-1.


Install

pip install prompt-cache-warmer
Enter fullscreen mode Exit fullscreen mode

There is also a Rust port if you need it in a non-Python codebase: MukundaKatta/prompt-cache-warmer-rs

The Rust version exposes a WarmCall trait so you can plug in any HTTP client.


Siblings

These libraries compose naturally with prompt-cache-warmer:

Lib Boundary Repo
cachebench Measures cache hit rates so you can confirm warming is working MukundaKatta/cachebench
claude-cost Calculates cost for cache vs non-cache tokens, shows ROI of warming MukundaKatta/claude-cost
prompt-cache-key Generates stable cache scope hashes, helps you ensure breakpoints are consistent MukundaKatta/prompt-cache-key
prompt-template-version Tracks which prompt version is deployed, so you know what to warm after a prompt update MukundaKatta/prompt-template-version

The most useful pairing is cachebench alongside the warmer. Run both in staging, check that cache_read_input_tokens is consistently high, and only then enable in production.


What is next

The library has 17 tests covering the warmer lifecycle, verify logic, and interval behavior. The main gaps I would like to close:

Multiple breakpoint support. Right now the warmer warms one cache scope. If your system prompt has multiple cache_control breakpoints at different prefixes, you might want to warm all of them independently.

Backoff on verify failure. When verify detects a cache miss, the right response might be to immediately re-warm rather than wait for the next interval. A simple retry-on-miss mode would close that gap.

Metrics hook. Right now the warmer logs errors but does not expose a metrics callback. A simple on_tick(result) callback would let you wire it into your observability stack without coupling the library to any specific metrics system.

Graceful interval adjustment. If you are close to the TTL boundary and a warmup call takes 2 seconds to complete, your effective interval is shorter than configured. Accounting for call duration in the sleep calculation would make the timing tighter.

The library is small by design. Adding a warming heartbeat to an existing agent should take five minutes. That is the goal.


Part of the Hermes Agent Challenge. All libraries in this series are production-ready, independently tested, and available on PyPI or crates.io.

Top comments (0)