Mukunda Rao Katta

Posted on May 25

prompt-cache-warmer-rs: Pre-Warm the Anthropic Prompt Cache from Rust

#hermeschallenge #ai #rust #agents

The first call is always expensive

You deploy an agent. System prompt is 3,000 tokens. You have a tool catalog with eight tools, each with a detailed description. Total context going into every call is about 4,500 tokens before the user even says anything.

Anthropic's prompt cache is supposed to help with this. Mark your system prompt with cache_control: {"type": "ephemeral"} and subsequent calls within the five-minute window reuse the cached tokens at a fraction of the cost. Input token price drops by 90%. Latency on cached reads is lower too.

But there is a catch. The cache only warms on the first call that sends those tokens with the cache breakpoints. If your agent just woke up, the first real user request pays full price and also takes the latency hit of priming the cache. That is fine for long-running sessions. It is annoying for short-lived Lambda functions, stateless API handlers, or any setup where you restart frequently and the first user pays the bill.

The cost asymmetry is real. On a 4,500-token system prompt and tools payload, the first call with a cache miss costs roughly 18x more in input tokens than a cache hit. For a busy agent that handles many requests per minute, that first-call cost barely registers. For a Lambda that cold-starts on each request, or a worker that scales down during off-hours and back up in the morning, that miss happens every time a new instance starts. If you have 50 workers and all of them spin up at 9am, you pay for 50 cache misses in the first few seconds of the day.

prompt-cache-warmer-rs sends a cheap throw-away call to prime the cache before your first real user request. By the time the user asks anything, the cache is warm.

Shape of the fix

The crate gives you a CacheWarmer that accepts your API key and exposes a warm method. You call it once during startup or in a background task.

use prompt_cache_warmer_rs::{CacheWarmer, WarmOptions};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let warmer = CacheWarmer::new(std::env::var("ANTHROPIC_API_KEY")?);

    warmer.warm(WarmOptions {
        model: "claude-sonnet-4-6",
        system: &SYSTEM_PROMPT,
        tools: &TOOLS,
    }).await?;

    // Your real request handler starts here.
    // First user call hits the warm cache.
    run_agent().await
}

The warm call sends a single-token user message: ".". Anthropic processes the system prompt and tools with the cache breakpoints in place. The actual model response is discarded. The cache window is now open.

For async startup patterns where you want to fire-and-forget:

let warmer = CacheWarmer::new(api_key.clone());
let options = WarmOptions {
    model: "claude-sonnet-4-6",
    system: &SYSTEM_PROMPT,
    tools: &TOOLS,
};

tokio::spawn(async move {
    if let Err(e) = warmer.warm(options).await {
        tracing::warn!("Cache warm failed: {}", e);
    }
});

// Start handling requests while the warm call runs.
serve_requests().await;

What it does not do

This crate does not manage the cache window. Anthropic's ephemeral cache expires after five minutes. If your agent is idle for six minutes and then gets a request, the first call after the idle period pays full price again. The crate does not track this window or re-warm automatically. That is a separate concern. You can pair it with a background task that re-warms on a timer if your traffic pattern has long idle gaps. The crate also does not do anything with the response from the warm call. It discards the model output entirely. The warm call is a side effect, not a query.

Inside the lib

The WarmCall trait is what makes the crate generic. The default implementation uses reqwest under the hood, but you can plug in any async HTTP caller that implements the trait. This matters for testing. In tests you replace the HTTP call with a mock that records what was sent and returns a minimal valid response. All 17 integration tests use this approach, which means no real API keys are needed to run the test suite.

pub trait WarmCall {
    async fn call(&self, request: WarmRequest<'_>) -> Result<(), WarmError>;
}

The WarmRequest struct carries the model, system, tools, and the single-token user message. It does not carry a conversation history. The whole point is that the warm call is stateless and cheap.

The cache breakpoint is injected by the crate. You pass plain strings for your system prompt and tools. The crate wraps them in the cache_control structure before sending. This keeps caller code clean. You do not have to remember to add the cache marker every time.

One design choice worth noting: the warm call always uses the same single-token user message. Some callers asked for a way to customize this. I kept it fixed because the content does not affect what gets cached. The cache breakpoint is on the system prompt and tools, not the user message. Making this configurable would add surface area without changing behavior.

When useful

Lambda functions and short-lived containers where the first cold request would otherwise miss the cache.
Agent servers that restart frequently during deploys or auto-scaling events.
Batch processors that spin up workers on demand and want the first item in each batch to hit the cache.
Development environments where you restart the server constantly and want cache hits from the first test call.
Any setup where your system prompt and tool catalog are static per deployment and you want to guarantee they are cached before traffic arrives.

When not useful

Long-running persistent processes where the first user request is infrequent enough that the warm call is wasted by the time it matters.
Dynamic system prompts that change per user or per session. The cache key includes the content, so if your system prompt varies, there is no single warm target.
Applications where the warm call's API cost is meaningful at your traffic volume. The warm call is cheap (single token output, cached input) but it is not free.
Situations where your Anthropic account is rate-limited and you cannot afford an extra call during startup.

Install

[dependencies]
prompt-cache-warmer-rs = "0.1"
tokio = { version = "1", features = ["full"] }

cargo add prompt-cache-warmer-rs

Siblings

Crate / Package	Language	What it does
prompt-cache-warmer	Python	Same warm-call pattern for Python agent servers
prompt-cache-key	Python	Stable hashes for Anthropic cache scope tracking
cachebench	Python	Measure cache hit rates across prompt versions
llm-message-hash-py	Python	Canonical hash of LLM request for cache key derivation
agentidemp-rs	Rust	Idempotency keys to prevent duplicate warm calls

What is next

The most-requested feature is automatic re-warm on a configurable interval. A KeepWarm wrapper that spawns a background task and re-fires the warm call every four minutes is planned for the next release. After that, I want to add a WarmStatus type that tracks whether the last warm call succeeded and when, so callers can decide whether to proceed or wait before serving real traffic.

Source is at github.com/MukundaKatta/prompt-cache-warmer-rs. Published at crates.io/crates/prompt-cache-warmer-rs. If you hit a use case not covered by the current API, open an issue. The WarmCall trait keeps the surface narrow on purpose, so feature requests that require it to grow deserve a close look before landing.

DEV Community