Anthropic's prompt cache gives you 90% cost reduction on cached tokens. But the cache only hits when the prompt prefix is byte-for-byte identical to a previous request. If anything in the system prompt changes — a trailing space, a different dict ordering, a timestamp injected at the wrong place — the cache misses and you pay full price.
You changed a prompt last week. Cache hit rates dropped from 85% to 12%. It took three days to figure out that a list of tools was being serialized in a different order each call.
prompt-cache-key computes a stable cache scope hash so you can compare what you are sending against what you sent before and find the byte that broke the cache.
The Shape of the Fix
from prompt_cache_key import CacheKey
key = CacheKey()
system_prompt = "You are a helpful assistant..."
tool_schemas = get_tool_schemas() # dict, potentially unstable ordering
# Compute the cache key for the cache-eligible prefix
ck = key.compute(
system=system_prompt,
tools=tool_schemas,
)
print(f"Cache scope hash: {ck.hex}")
# Log it with every request
logger.info("llm_call", cache_key=ck.hex, model="claude-sonnet-4-6")
# If the hash changes between calls, the cache will miss
When the hash changes between two requests that should hit the same cache scope, you know something in the prefix changed. You can then diff the canonical representations to find the difference.
What It Does NOT Do
prompt-cache-key does not interact with Anthropic's API. It computes a local hash of your prompt content. The actual cache lookup happens on Anthropic's servers. You cannot query whether a specific hash is currently cached; only the API call itself will tell you via cache_read_input_tokens in the usage response.
It does not guarantee your prompt will be cached. Anthropic's cache requires cache_control: {"type": "ephemeral"} on the appropriate content blocks. If you have not set those, the cache will not activate regardless of hash stability.
It does not track cache hit rates automatically. The hash is for debugging and comparison. You get cache hit data from response.usage.cache_read_input_tokens in the API response.
Inside the Library
The cache key is computed from a canonical serialization of the cache-eligible content:
import hashlib
import json
class CacheKey:
def compute(
self,
system: str | None = None,
tools: list[dict] | None = None,
messages: list[dict] | None = None,
) -> "CacheKeyResult":
parts = []
if system:
parts.append(("system", system))
if tools:
# Sort tools by name for stability
stable_tools = sorted(tools, key=lambda t: t.get("name", ""))
parts.append(("tools", json.dumps(stable_tools, sort_keys=True)))
if messages:
# Only include messages with cache_control set
cached_messages = [
m for m in messages
if self._has_cache_control(m)
]
if cached_messages:
parts.append(("messages", json.dumps(cached_messages, sort_keys=True)))
canonical = json.dumps(parts, sort_keys=True)
digest = hashlib.sha256(canonical.encode()).hexdigest()
return CacheKeyResult(
hex=digest[:16],
full=digest,
canonical=canonical,
)
def _has_cache_control(self, message: dict) -> bool:
for block in message.get("content", []):
if isinstance(block, dict) and "cache_control" in block:
return True
return False
The key improvements over naive hashing:
- Tools are sorted by name before serialization so insertion order does not matter
-
sort_keys=Trueinjson.dumpsmeans dict field order does not matter - Only the cache-eligible prefix (system + tools + cached messages) is hashed, not the full message list
The canonical string is preserved in CacheKeyResult.canonical so you can diff two canonical strings directly:
# Debug a cache miss
key1 = cache_key.compute(system=old_system, tools=tools)
key2 = cache_key.compute(system=new_system, tools=tools)
if key1.hex != key2.hex:
import difflib
diff = difflib.unified_diff(
key1.canonical.splitlines(),
key2.canonical.splitlines(),
lineterm="",
)
print("\n".join(diff))
When to Use It
Use it when Anthropic prompt caching is part of your cost model. If you are relying on cache hits to keep costs down, you need to verify that your cache key is stable across requests.
Use it during development when you are adding new tool schemas, modifying the system prompt, or refactoring how you assemble the prompt. Hash stability is easy to break accidentally. Log the hash and watch for changes.
Use it as a deployment check. After a code deploy, compute the cache key from the new code and compare it to the previous deploy's key. A changed key means a cache miss wave on the first requests after deploy.
Skip it if you are not using Anthropic prompt caching, or if your prompts are fully dynamic (e.g., per-user system prompts that are different for every request). A stable cache key only matters when you have stable cacheable content.
Install
pip install git+https://github.com/MukundaKatta/prompt-cache-key
# Or from PyPI
pip install prompt-cache-key
from prompt_cache_key import CacheKey
from agent_step_log import StepLog
cache_key = CacheKey()
SYSTEM_PROMPT = open("prompts/system.txt").read()
TOOL_SCHEMAS = load_tool_schemas()
# Compute once at startup to verify stability
startup_key = cache_key.compute(system=SYSTEM_PROMPT, tools=TOOL_SCHEMAS)
logger.info("startup_cache_key", key=startup_key.hex)
def call_llm(messages: list[dict], run_id: str) -> dict:
# Verify key matches startup (guards against runtime mutations)
current_key = cache_key.compute(system=SYSTEM_PROMPT, tools=TOOL_SCHEMAS)
if current_key.hex != startup_key.hex:
logger.warning("cache_key_changed", new_key=current_key.hex)
response = anthropic_client.messages.create(
model="claude-sonnet-4-6",
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}],
tools=TOOL_SCHEMAS,
messages=messages,
max_tokens=1024,
)
logger.info(
"llm_call",
cache_key=current_key.hex,
cache_read_tokens=response.usage.cache_read_input_tokens,
cache_write_tokens=response.usage.cache_creation_input_tokens,
)
return response
Sibling Libraries
| Library | What it solves |
|---|---|
prompt-cache-warmer |
Pre-warm Anthropic's prompt cache with a max_tokens=1 call |
cachebench |
Benchmark client-side vs server-side cache performance |
llm-cache-mem |
Client-side in-process LRU cache for identical requests |
llm-prompt-version |
Version and hash prompt templates for traceability |
agent-tool-spec-pack |
Stable multi-provider tool schema generation |
The Anthropic caching stack: prompt-cache-key for debugging cache stability, prompt-cache-warmer for pre-warming after deploys, cachebench for measuring actual hit rates.
What's Next
Drift detector: CacheDriftDetector that logs the cache key on each request and emits a warning when consecutive requests have different keys. A configurable threshold (e.g., warn if the key changes more than once per 100 requests) would catch drift in production without alerting on every request.
Provider abstraction: similar hash computation for OpenAI (which has its own caching for system prefix) and Google (Gemini context caching). The canonical key format can be extended to cover multiple providers.
Visualization tool: a CLI command that takes two JSONL log files and shows which requests had matching cache keys and which did not. Useful for post-mortem analysis of a cache miss wave after a deploy.
Built as part of the agent-stack family: composable Python primitives for production LLM agents.
Top comments (0)