DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Stable Anthropic Cache Scope Hashes: Know Whether Your Prompt Will Hit the Cache

Anthropic's prompt cache gives you 90% cost reduction on cached tokens. But the cache only hits when the prompt prefix is byte-for-byte identical to a previous request. If anything in the system prompt changes — a trailing space, a different dict ordering, a timestamp injected at the wrong place — the cache misses and you pay full price.

You changed a prompt last week. Cache hit rates dropped from 85% to 12%. It took three days to figure out that a list of tools was being serialized in a different order each call.

prompt-cache-key computes a stable cache scope hash so you can compare what you are sending against what you sent before and find the byte that broke the cache.


The Shape of the Fix

from prompt_cache_key import CacheKey

key = CacheKey()

system_prompt = "You are a helpful assistant..."
tool_schemas = get_tool_schemas()  # dict, potentially unstable ordering

# Compute the cache key for the cache-eligible prefix
ck = key.compute(
    system=system_prompt,
    tools=tool_schemas,
)

print(f"Cache scope hash: {ck.hex}")

# Log it with every request
logger.info("llm_call", cache_key=ck.hex, model="claude-sonnet-4-6")

# If the hash changes between calls, the cache will miss
Enter fullscreen mode Exit fullscreen mode

When the hash changes between two requests that should hit the same cache scope, you know something in the prefix changed. You can then diff the canonical representations to find the difference.


What It Does NOT Do

prompt-cache-key does not interact with Anthropic's API. It computes a local hash of your prompt content. The actual cache lookup happens on Anthropic's servers. You cannot query whether a specific hash is currently cached; only the API call itself will tell you via cache_read_input_tokens in the usage response.

It does not guarantee your prompt will be cached. Anthropic's cache requires cache_control: {"type": "ephemeral"} on the appropriate content blocks. If you have not set those, the cache will not activate regardless of hash stability.

It does not track cache hit rates automatically. The hash is for debugging and comparison. You get cache hit data from response.usage.cache_read_input_tokens in the API response.


Inside the Library

The cache key is computed from a canonical serialization of the cache-eligible content:

import hashlib
import json

class CacheKey:
    def compute(
        self,
        system: str | None = None,
        tools: list[dict] | None = None,
        messages: list[dict] | None = None,
    ) -> "CacheKeyResult":
        parts = []

        if system:
            parts.append(("system", system))

        if tools:
            # Sort tools by name for stability
            stable_tools = sorted(tools, key=lambda t: t.get("name", ""))
            parts.append(("tools", json.dumps(stable_tools, sort_keys=True)))

        if messages:
            # Only include messages with cache_control set
            cached_messages = [
                m for m in messages
                if self._has_cache_control(m)
            ]
            if cached_messages:
                parts.append(("messages", json.dumps(cached_messages, sort_keys=True)))

        canonical = json.dumps(parts, sort_keys=True)
        digest = hashlib.sha256(canonical.encode()).hexdigest()

        return CacheKeyResult(
            hex=digest[:16],
            full=digest,
            canonical=canonical,
        )

    def _has_cache_control(self, message: dict) -> bool:
        for block in message.get("content", []):
            if isinstance(block, dict) and "cache_control" in block:
                return True
        return False
Enter fullscreen mode Exit fullscreen mode

The key improvements over naive hashing:

  1. Tools are sorted by name before serialization so insertion order does not matter
  2. sort_keys=True in json.dumps means dict field order does not matter
  3. Only the cache-eligible prefix (system + tools + cached messages) is hashed, not the full message list

The canonical string is preserved in CacheKeyResult.canonical so you can diff two canonical strings directly:

# Debug a cache miss
key1 = cache_key.compute(system=old_system, tools=tools)
key2 = cache_key.compute(system=new_system, tools=tools)

if key1.hex != key2.hex:
    import difflib
    diff = difflib.unified_diff(
        key1.canonical.splitlines(),
        key2.canonical.splitlines(),
        lineterm="",
    )
    print("\n".join(diff))
Enter fullscreen mode Exit fullscreen mode

When to Use It

Use it when Anthropic prompt caching is part of your cost model. If you are relying on cache hits to keep costs down, you need to verify that your cache key is stable across requests.

Use it during development when you are adding new tool schemas, modifying the system prompt, or refactoring how you assemble the prompt. Hash stability is easy to break accidentally. Log the hash and watch for changes.

Use it as a deployment check. After a code deploy, compute the cache key from the new code and compare it to the previous deploy's key. A changed key means a cache miss wave on the first requests after deploy.

Skip it if you are not using Anthropic prompt caching, or if your prompts are fully dynamic (e.g., per-user system prompts that are different for every request). A stable cache key only matters when you have stable cacheable content.


Install

pip install git+https://github.com/MukundaKatta/prompt-cache-key

# Or from PyPI
pip install prompt-cache-key
Enter fullscreen mode Exit fullscreen mode
from prompt_cache_key import CacheKey
from agent_step_log import StepLog

cache_key = CacheKey()

SYSTEM_PROMPT = open("prompts/system.txt").read()
TOOL_SCHEMAS = load_tool_schemas()

# Compute once at startup to verify stability
startup_key = cache_key.compute(system=SYSTEM_PROMPT, tools=TOOL_SCHEMAS)
logger.info("startup_cache_key", key=startup_key.hex)

def call_llm(messages: list[dict], run_id: str) -> dict:
    # Verify key matches startup (guards against runtime mutations)
    current_key = cache_key.compute(system=SYSTEM_PROMPT, tools=TOOL_SCHEMAS)
    if current_key.hex != startup_key.hex:
        logger.warning("cache_key_changed", new_key=current_key.hex)

    response = anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }],
        tools=TOOL_SCHEMAS,
        messages=messages,
        max_tokens=1024,
    )

    logger.info(
        "llm_call",
        cache_key=current_key.hex,
        cache_read_tokens=response.usage.cache_read_input_tokens,
        cache_write_tokens=response.usage.cache_creation_input_tokens,
    )

    return response
Enter fullscreen mode Exit fullscreen mode

Sibling Libraries

Library What it solves
prompt-cache-warmer Pre-warm Anthropic's prompt cache with a max_tokens=1 call
cachebench Benchmark client-side vs server-side cache performance
llm-cache-mem Client-side in-process LRU cache for identical requests
llm-prompt-version Version and hash prompt templates for traceability
agent-tool-spec-pack Stable multi-provider tool schema generation

The Anthropic caching stack: prompt-cache-key for debugging cache stability, prompt-cache-warmer for pre-warming after deploys, cachebench for measuring actual hit rates.


What's Next

Drift detector: CacheDriftDetector that logs the cache key on each request and emits a warning when consecutive requests have different keys. A configurable threshold (e.g., warn if the key changes more than once per 100 requests) would catch drift in production without alerting on every request.

Provider abstraction: similar hash computation for OpenAI (which has its own caching for system prefix) and Google (Gemini context caching). The canonical key format can be extended to cover multiple providers.

Visualization tool: a CLI command that takes two JSONL log files and shows which requests had matching cache keys and which did not. Useful for post-mortem analysis of a cache miss wave after a deploy.


Built as part of the agent-stack family: composable Python primitives for production LLM agents.

Top comments (0)