DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Pre-Warm Your Anthropic Prompt Cache Before the Traffic Hits

Anthropic's prompt caching can cut input token costs by up to 90% for cached content. The cache TTL is 5 minutes. If your agent is idle for more than 5 minutes and then gets a burst of traffic, the first few requests pay full price while the cache warms up.

For predictable traffic patterns — morning spikes, scheduled batch runs, after-deployment verification — you can pre-warm the cache before users arrive.


The Shape of the Fix

from prompt_cache_warmer import CacheWarmer

warmer = CacheWarmer(
    client=anthropic_client,
    model="claude-sonnet-4-6",
)

# Define what to warm
warmer.add(
    system_prompt=LONG_SYSTEM_PROMPT,
    seed_messages=[{"role": "user", "content": "ping"}],
)

# Run before traffic arrives
result = warmer.warm()
print(f"Cache warmed: {result.cache_creation_tokens} tokens cached")
print(f"Estimated savings on first 100 requests: ${result.estimated_savings:.4f}")
Enter fullscreen mode Exit fullscreen mode

One call before your traffic spike. The cache is warm. Users get cache-hit latency from the first request.


What It Does NOT Do

prompt-cache-warmer does not maintain the cache indefinitely. It makes one warm-up call. The cache TTL is still 5 minutes. If you need persistent cache warmth, schedule the warmer to run every 4 minutes.

It does not cache tool definitions automatically. Tool schemas need to be included in the warm-up call if you want them cached. Pass them as part of seed_messages or via the tools parameter.

It does not work for providers other than Anthropic. Prompt caching is an Anthropic-specific feature. OpenAI has a different caching mechanism that does not require pre-warming.


Inside the Library

The warm-up call adds cache_control: {"type": "ephemeral"} to the system prompt and makes a minimal request:

def warm(self) -> WarmResult:
    response = self._client.messages.create(
        model=self._model,
        max_tokens=1,  # minimal response, we only want the cache write
        system=[{
            "type": "text",
            "text": self._system_prompt,
            "cache_control": {"type": "ephemeral"},
        }],
        messages=self._seed_messages,
    )

    return WarmResult(
        cache_creation_tokens=response.usage.cache_creation_input_tokens,
        cache_read_tokens=response.usage.cache_read_input_tokens,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
    )
Enter fullscreen mode Exit fullscreen mode

max_tokens=1 minimizes the cost of the warm-up call itself. We only want the cache write side effect.

The estimated_savings calculation: cache_creation_tokens * (full_price - cache_price) * n_expected_requests. With claude-sonnet-4-6 at $3.00 per 1M input tokens and cache reads at $0.30 per 1M (90% discount), warming 50K tokens saves roughly $0.135 per 100 cache hits.

For Rust users: prompt-cache-warmer is also available as a crates.io crate at prompt-cache-warmer v0.1.0. The Rust version uses the reqwest HTTP client and exposes both sync and async APIs.


When to Use It

Use it for agents with large, stable system prompts (5K+ tokens). The cache benefit is proportional to the cached token count. For short system prompts, the warm-up call may not be worth the cost.

Use it before predictable traffic spikes. Morning startup for business-hours agents. Pre-deployment smoke tests. Scheduled batch jobs with a warm-up step.

The scheduling pattern:

import schedule
import time

warmer = CacheWarmer(client=client, model="claude-sonnet-4-6")
warmer.add(system_prompt=SYSTEM_PROMPT, seed_messages=SEED)

# Keep cache warm with 4-minute interval (TTL is 5 minutes)
schedule.every(4).minutes.do(warmer.warm)

while True:
    schedule.run_pending()
    time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

Skip it for agents with dynamic system prompts that change per-user or per-request. Caching requires stable content. If your system prompt changes, each variant needs its own cache entry and pre-warming each variant is impractical.


Install

# Python
pip install git+https://github.com/MukundaKatta/prompt-cache-warmer

# Rust
cargo add prompt-cache-warmer
Enter fullscreen mode Exit fullscreen mode
from prompt_cache_warmer import CacheWarmer, WarmSchedule

# Scheduled warming
warmer = CacheWarmer(client=client, model="claude-sonnet-4-6")
warmer.add(
    system_prompt=SYSTEM_PROMPT,
    seed_messages=[{"role": "user", "content": "warmup"}],
    tools=TOOL_SCHEMAS,  # cache tool definitions too
)

schedule = WarmSchedule(warmer=warmer, interval_seconds=240)
schedule.start()  # runs in a background thread

# Your agent loop runs normally
# Cache stays warm automatically
Enter fullscreen mode Exit fullscreen mode

Sibling Libraries

Library What it solves
cachebench Measure actual cache hit rates and savings
agentfit Fit conversation into token budget for caching
prompt-cache-key Stable hashes for cache scope management
llm-cost-cap Pre-flight cost gate before each call
token-budget-pool Shared token/USD budget across concurrent agents

The caching stack: prompt-cache-warmer keeps the cache live, cachebench measures whether it is working, prompt-cache-key identifies stable cache boundaries, llm-cost-cap validates that the warm-up call itself does not blow your budget.


What's Next

Multi-model warming: warm the same system prompt against multiple models simultaneously. Useful when you load-balance across claude-sonnet-4-6 and claude-haiku-3-5 and want both caches warm.

Cache effectiveness monitoring: after each warm call, record cache_creation_tokens. After subsequent agent calls, record cache_read_tokens. Track the ratio over time. If it drops, the warm interval needs adjustment or the system prompt changed without updating the warm schedule.

The Rust crate already has async support. The Python version's async warm call is on the roadmap for use in async agent frameworks.


Built as part of the agent-stack family: composable Python primitives for production LLM agents.

Top comments (0)