Mukunda Rao Katta

Posted on May 25

cachebench: stop finding out about prompt-cache regressions from the invoice

#hermeschallenge #ai #llm #agents

Prompt caching is the single highest-ROI feature shipping in LLM APIs right now. On Anthropic and OpenAI, a healthy cache hit ratio saves 50 to 90 percent of input tokens. On a long system prompt with a large RAG context, that is the difference between a sustainable agent and one that quietly bankrupts itself.

There is one problem. Per-request hit ratio is invisible from the SDK. Misses are silent. A single deploy that appends a timestamp to a system prompt can halve your cache hit rate and double your bill, and the only place you find out is the monthly invoice.

I have shipped enough agents to make this mistake twice. cachebench is what I wrote so I never make it a third time.

The problem

Three things conspire against you.

First, the SDK does not surface hit ratio in a useful shape. It reports cache_read_tokens and cache_creation_tokens as fields on the response, but only if you know to look, and only after you have already paid for the call.

Second, the API has documented quirks. Anthropic's SDK silently misses about 40 percent on back-to-back requests at certain timing windows. OpenAI cache mechanics differ across models. Bedrock pricing has its own shape. None of this is in the response object.

Third, the most common cause of a regression is your own deploy. A timestamp in the system prompt. A reordered tool definition. A new template engine that re-serializes the prompt with different whitespace. Any of these silently invalidates the cache, and you find out from billing.

The shape of the fix

You wrap your client call. Every wrapped call records hit ratio, cost saved, and the prefix it tried to hit.

from anthropic import Anthropic
from cachebench import CacheTracker, Provider

client = Anthropic()
tracker = CacheTracker(provider=Provider.ANTHROPIC, miss_alert_threshold=0.6)
create = tracker.wrap(client.messages.create)

response = create(
    model="claude-sonnet-4-20250514",
    max_tokens=200,
    system=[{"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "Hello"}],
)

print(tracker.aggregate())
# {'calls': 1, 'hit_ratio': 0.94, 'cost_saved_usd': 0.012, ...}

When you want to know which prefix regressed:

for prefix_id, stats in tracker.by_prefix().items():
    if stats["hit_ratio"] < 0.5:
        print(f"REGRESSED: {prefix_id} {stats}")

When you want an alert fired when a request hits below threshold:

import requests

def to_slack(metrics):
    requests.post(SLACK_URL, json={
        "text": f"Cache regression: {metrics.prefix_id} ratio={metrics.hit_ratio:.2f}"
    })

tracker = CacheTracker(provider=Provider.ANTHROPIC, on_miss_alert=to_slack)

When you want a retry around the Anthropic eventual-consistency miss:

from cachebench import CachePolicy

tracker = CacheTracker(
    provider=Provider.ANTHROPIC,
    policy=CachePolicy.miss_aware(delay_ms=2000, max_retries=1),
)

The wrapper handles sync and async paths automatically. tracker.wrap detects coroutines and returns the matching shape.

What it does NOT do

It is not a proxy. It is not a server.
It is not a cache itself. It observes the provider's cache. It does not store responses.
It is not a billing dashboard. It exports metrics. The dashboard is your job.
It does not auto-inject cache breakpoints into your prompts. Different tool for that.

Inside the lib (one design choice worth showing)

Every wrapped call gets a prefix_id. The prefix_id is a stable hash of the cacheable prefix, computed before the call goes out. That is what powers per-prefix grouping.

def fingerprint(messages, system) -> str:
    # canonical serialization of the cacheable portion only
    payload = {
        "system": _normalize(system),
        "messages_prefix": _cacheable_prefix(messages),
    }
    return hashlib.sha256(
        json.dumps(payload, sort_keys=True, separators=(",", ":")).encode()
    ).hexdigest()[:16]

Two design choices in those few lines matter:

The first is sort_keys=True plus a compact separator. Different SDK versions serialize the same dict with different key ordering and whitespace. If the prefix_id were sensitive to that, two semantically identical prompts would look like different prefixes and the per-prefix view would be useless.

The second is _cacheable_prefix(messages). Only the portion of the messages list that is actually cacheable goes into the hash. The newest user message does not. If it did, every call would have a unique prefix_id and the per-prefix grouping would have one row per call.

The whole point of the fingerprint is to ask, "is this the same cacheable prefix as the last 1,000 calls, and if so, what fraction of them actually hit?" Getting that question right means getting the hash boundary right.

When this is useful

You ship prompt changes often and want a fast signal when a change tanked the cache.
You run multiple system prompts (per-tenant, per-flow, per-experiment) and want to know which one is regressing.
You are on Anthropic and have been bitten by the eventual-consistency miss window.
You want a "cost saved this hour" number to put on a dashboard.
You run a fleet of agents with shared prefixes and want to confirm they really are sharing the cache.

When this is NOT what you want

You want a managed observability product with dashboards out of the box. Use Phoenix, Langfuse, or Helicone.
You want a smart caching layer that decides what to cache. Different problem space.
You only ever run one prompt with one system message and you have already eyeballed the cache token counts. The aggregate-of-one is not worth the wrapper.

Install

pip install cachebench

Repo: https://github.com/MukundaKatta/cachebench

Sibling libraries

Library	Role
bedrock-kit	Full AWS Bedrock client wrapper (throttle, JSON repair, cost ledger)
llmfleet	Pool concurrent calls into provider Batch APIs
agenttap	Wire-level prompt introspection
agenttrace	Cost + latency per run
claude-cost	Cache-aware cost calc for Anthropic

bedrock-kit and cachebench compose cleanly. Wrap a bedrock_kit.BedrockClient.invoke call with CacheTracker.wrap and you get bedrock-kit's client features plus cachebench's per-prefix regression alerting.

What's next

I want a small Streamlit dashboard that consumes the exported metrics directly, so a team can stand it up in five minutes without an observability vendor. I also want a per-prefix retention setting so very long-lived agent processes do not grow the in-memory store without bound.

If you run prompts in production and you do not know your current cache hit ratio, that is the first number to put a name on. The bill is already paying for the misses.

DEV Community