Gabriel Anhaia

Posted on May 23

Prompt Caching: What Belongs in the Cacheable Prefix, What Kills Hit Rate

#llm #prompt #ai #performance

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You enabled prompt caching. You saw cache_read_input_tokens show up in the response. You declared victory and moved on.

Then the bill came in. Your cache hit rate is 38%. You assumed it was 90%. The model is fine. The SDK is fine. Five fields in your system prompt are quietly evicting the cache on every request.

Move them, and the hit rate jumps past 90% without touching the model, the SDK, or the prompt content. The trick is knowing which bytes belong before the breakpoint and which belong after.

How prompt caching actually works (in one paragraph)

The vendor hashes a prefix of your prompt. Next request, if the bytes up to the cache breakpoint match exactly (byte for byte, including whitespace and key ordering) you pay roughly 10% of the input token cost for that prefix. Miss by one byte and you pay full freight. Anthropic's TTL is 5 minutes from the last hit, refreshed each time it lands.

The implication people skip past: the prefix is shared across requests. Anything you put before the breakpoint had better be identical across requests, or the hash diverges and the cache dies. "Identical" is stricter than you think.

The cacheable-prefix budget

A production prompt has roughly three layers:

System prompt: your role definition, format contract, refusal policy, examples.
Tool / function definitions: JSON Schema for every tool the model can call.
The first user turn (sometimes): RAG context, long documents.

Layers 1 and 2 are stable across requests for a given feature. They're the cacheable prefix. Layer 3 is per-request. The breakpoint sits between them.

Five categories of content sneak into layers 1 and 2 that shouldn't be there.

Five fields that quietly evict your cache

1. Timestamps

You added Current time: 2026-05-23T14:32:11Z to the system prompt because the model needs to reason about dates. Every request now has a unique prefix. Cache hit rate: zero.

The fix is mechanical. Move the timestamp to the user turn. The system prompt can say "The current UTC time will be provided in the user message under <now> tags" and you stop burning the prefix.

If the time only needs minute-level resolution and you genuinely want some caching, you can also round to the nearest 5 minutes. That's a tax on correctness for a benefit you'd get cleaner by moving the field.

2. Unordered tool / function lists

This one is brutal because the code looks correct.

# this is wrong even though it runs every time
tools = [
    tool_def(name)
    for name in available_tools  # set, ordering not stable
]

available_tools is a set. Python sets don't guarantee iteration order across processes. Container restarts on Kubernetes re-iterate in different order. The JSON serialization of the tool list flips. The cache hash flips. You're left wondering why your hit rate is 40% on Mondays and 70% on Fridays.

Sort. Always sort.

tools = [
    tool_def(name)
    for name in sorted(available_tools)
]

Same applies to dicts inside tool schemas. json.dumps(schema, sort_keys=True) if you're inlining schema JSON. The cost of sort_keys=True is negligible. The cost of not having it is half your token spend.

3. Locale and user-language headers

You read the user's Accept-Language and dropped it into the system prompt: "Reply in fr-FR formatting conventions." The same user later opens the app from a hotel Wi-Fi that fingerprints as en-US. Different prefix. Cache miss.

Locale is per-request data. It belongs in the user turn, or as a tool argument, or as a structured tag the model reads from the user message:

<user_context>
  <locale>fr-FR</locale>
  <timezone>Europe/Paris</timezone>
</user_context>

Cacheable prefix says "the user's locale and timezone will be in <user_context> tags". The actual values land after the breakpoint.

4. A/B variant IDs

The product team wants to test two prompt variants. Someone wrote this:

system = base_prompt + f"\n\nExperiment variant: {variant_id}"

Every variant ID is a different prefix. Two variants means two caches. Three variants and a shadow test means four caches. The hit rate per cache is fine. Aggregate hit rate is a fraction of what it should be because each cache only sees a slice of traffic.

If the variant string genuinely changes the prompt, accept that you have two caches and budget for the warmup. If it's metadata for logging that doesn't change behavior, log it in your application code and keep it out of the prompt entirely. Telemetry doesn't belong in the cache key.

5. Model-name interpolation during shadow tests

The subtle one. Your wrapper builds prompts like this:

system = f"You are {model_name}, an AI assistant by Anthropic..."

It worked when you only called claude-sonnet-4-7. Then you started shadow-testing claude-opus-4-7 for 10% of traffic. The prompts diverge by exactly one substring. Two caches, neither hot.

Drop the model name from the prompt. The model doesn't need to introduce itself to itself.

The structure that holds 90%+

Here's what a clean Anthropic call looks like once the five fields are out of the prefix.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a support assistant for an
e-commerce platform. You help users track orders, process
returns, and answer product questions.

The user's locale, timezone, and current UTC time will be
provided in the first user message inside <context> tags.

When you need to perform an action, call the appropriate
tool. Always confirm destructive actions before calling
the tool a second time."""

TOOLS = sorted(
    [
        {"name": "get_order", "input_schema": {...}},
        {"name": "process_return", "input_schema": {...}},
        {"name": "search_products", "input_schema": {...}},
    ],
    key=lambda t: t["name"],  # stable order, every request
)

def reply(user_message: str, ctx: dict) -> str:
    user_turn = (
        f"<context>"
        f"<locale>{ctx['locale']}</locale>"
        f"<timezone>{ctx['tz']}</timezone>"
        f"<now>{ctx['now_iso']}</now>"
        f"</context>\n\n"
        f"{user_message}"
    )

    response = client.messages.create(
        model="claude-sonnet-4-7",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                # cache breakpoint: everything up to and
                # including the tools list gets cached
                "cache_control": {"type": "ephemeral"},
            }
        ],
        tools=TOOLS,
        messages=[{"role": "user", "content": user_turn}],
    )
    return response.content[0].text

A few things to notice. The cache_control marker sits on the system block. Anthropic's API treats that as the breakpoint, and the tool definitions sit inside the cached region by default. The volatile data (locale, timezone, time) is in the user turn, after the breakpoint, where it costs nothing extra to vary.

On a real call, the response object includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens. The ratio of cache_read to cache_read + cache_creation + input_tokens is your hit rate. Log it. Track it. Alert on regressions.

The audit script: diff the last 100 prompts

You can't fix what you don't measure. This 30-line script reads your last 100 logged prompts and flags every byte that changed between consecutive requests. The bytes that churn are the ones evicting your cache.

import difflib
import json
import sys
from collections import Counter
from pathlib import Path

def load_recent(log_path: str, n: int = 100) -> list[str]:
    lines = Path(log_path).read_text().splitlines()
    rows = [json.loads(line) for line in lines[-n:]]
    # we cache the system + tools prefix, so audit that
    return [
        r["system"] + "\n" + json.dumps(r["tools"], sort_keys=True)
        for r in rows
    ]

def churn_report(prompts: list[str]) -> Counter:
    churn = Counter()
    for prev, curr in zip(prompts, prompts[1:]):
        diff = difflib.unified_diff(
            prev.splitlines(), curr.splitlines(), n=0
        )
        for line in diff:
            if line.startswith(("+", "-")) and not line.startswith(("+++", "---")):
                # strip the +/- marker and bucket the line
                churn[line[1:].strip()[:80]] += 1
    return churn

if __name__ == "__main__":
    prompts = load_recent(sys.argv[1])
    for line, count in churn_report(prompts).most_common(20):
        print(f"{count:4d}  {line}")

Run it against your last 100 production prompts. The top of the output tells you exactly which lines are mutating between requests. Timestamps will be there. Locale strings will be there. Unsorted tool names will be there. Each one is a cache miss you're paying for, request after request.

A team I talked to ran this and discovered a request_id: <uuid> field someone added to the system prompt six months ago "for debugging". Every single request had a different prefix. Their hit rate was structurally zero.

TTL gotchas

Anthropic's prompt cache has a 5-minute TTL, refreshed on every hit. This sounds generous until you do the math on low-volume features.

If you get 1 request per minute, the cache stays warm. If you get 1 request per 6 minutes, the cache expires before the next hit lands. You pay full freight every single time even though your prefix never changed.

Three things follow:

For low-volume features, add a synthetic ping every 4 minutes that hits the same prefix. It's cheap and keeps the cache warm.
For batch workloads, send the warm-up request first, then fan out. Don't fire 100 parallel requests cold. The first one writes the cache, the other 99 race it and may miss.
Anthropic also offers a 1-hour TTL on a paid tier ("type": "ephemeral", "ttl": "1h"). For features where 5 minutes is too tight, the longer TTL pays for itself fast.

Cache breakpoint placement

You get up to four cache breakpoints per request on the Anthropic API. Most production features need exactly one, placed at the end of the tools list, after the stable system prompt and tool definitions, before the per-request user turn.

Two breakpoints make sense when you have a large, semi-stable RAG context that changes slowly (a tenant's knowledge base, say). One breakpoint after tools, a second after the RAG context, and you cache both layers independently.

More than two breakpoints is almost always wrong. The complexity of reasoning about which layer cached and which didn't outweighs the savings.

What to track

Three numbers as first-class metrics:

Cache hit rate: cache_read_input_tokens / (cache_read + cache_creation + input). Aim for ≥85% on stable production prompts.
Cache write rate: high write, low read means your prefix is mutating. Run the audit script.
Prefix length: token count of the cached block. If it's not stable across requests, something's leaking into the prefix.

Set the audit script to run weekly against logged prompts. Cache discipline rots: someone will eventually add a debug field to the system prompt, and you want to catch it the day it ships, not on the next invoice.

What's the weirdest thing you've found leaking into your cacheable prefix? Drop it in the comments. The uglier the better.

If this was useful

Caching is one of those things where the docs cover the mechanism and skip the failure modes. The Prompt Engineering Pocket Guide has a chapter on prompt anatomy that covers cacheable-prefix structure alongside the rest of the production-prompt patterns: refusal policies, format contracts, example placement. If this post landed, that's the same territory mapped out end to end.