Anthropic Prompt Caching Saves 90% — Here's the One Caveat Nobody Mentions

#anthropic #llm #python #performance

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I talked to last month flipped on Anthropic's prompt caching for their RAG endpoint. Sixty thousand tokens of system prompt, tool definitions, and pinned reference docs in front of every user message. The dashboards promised a flat 90% off the input bill. Synthetic load tests confirmed it. Production day one came back at a 1% discount, with only the timestamp differing between staging and prod.

Their system prompt opened with f"Today is {datetime.now().date()}. ...". One token of date. Zero cache hits, all day. Anthropic's caching keys on the exact prefix you sent. A single byte of drift throws the whole hash away.

That is the caveat. The 90% number is real (Anthropic charges cache reads at 0.1× the base input price). The number on your invoice depends entirely on whether your prefix is byte-identical from one request to the next.

How the cache actually keys

Anthropic's docs are blunt about this. From the prompt-caching page:

Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control.

You mark a block with cache_control. The server hashes everything from the start of the request up to and including that block. That hash is the cache key. The prefix is built in this fixed order: tools, then system, then messages. Change any byte of the tool definitions, you invalidate tools + system + messages. Change the system prompt, you invalidate system + messages. User-message edits invalidate from that message onward.

The lookback window is twenty blocks. If a conversation grows past twenty blocks beyond your last cache write, the next request will not find the prior entry and you pay full price plus the write premium.

You can have up to four breakpoints per request. The point of multiple breakpoints is not "cache more." It is to cache sections that change at different frequencies. Tool definitions almost never change, system prompt rarely changes, the static document context maybe rotates daily, the conversation prefix changes every turn. One breakpoint per layer that changes at its own clock.

The three mistakes that kill the discount

Every time someone reports "caching is not working" the cause is one of these three.

A dynamic value in the system prompt. The classic is f"Today is {today}". Every request at midnight UTC bumps the date. Every request mid-day still has the same date but if the variable came from datetime.now() instead of datetime.now().date(), you have a microsecond-resolution timestamp baked into the hash and the cache hit rate is functionally zero. Same trap with request IDs, user IDs, session IDs, and "personalisation strings" that someone added to the system prompt for A/B reasons.

Non-deterministic tool definitions. If you build the tools list from a Python dict in a Python version older than 3.7, or from a set in any version, the order shuffles between runs. The hash sees a different prefix, you write a new cache entry, you pay 1.25× input tokens for the privilege, and the next process with a different ordering does the same. Always serialise tools in a fixed order and lock that order in a unit test.

A "small" string injected before the cache breakpoint. Someone adds f"User tier: {tier}" to the system block "for context." Two tier values in the population means two cache entries instead of one. Five tier values means five. The fix is moving the per-request data after the breakpoint, into the user message or a trailing system block: static first, variable after.

The pattern under all three: anything that varies by request, by process, or by clock has to live after the last breakpoint. Anywhere in or above the cached prefix and you have already lost the discount. The hash is cumulative.

The pricing math nobody puts on a napkin

Four numbers from the official pricing page:

Base input tokens: 1.0× the model's input rate.
Cache read: 0.1× the input rate.
5-minute cache write: 1.25× the input rate.
1-hour cache write: 2.0× the input rate.

So a cache write costs 25% more than a normal read of the same tokens. A cache hit costs 90% less. Break-even is somewhere around three reads per write at the 5-minute TTL. If your traffic shape sends one request and then nothing for ten minutes, caching loses money. If your traffic shape sends ten requests inside a minute against the same prefix, caching is the best money you spend that day.

The 1-hour TTL is the lever for batch jobs and bursty patterns. You pay double the base rate to write, but the entry survives an hour. That is the right setting for a context that gets hit a dozen times across a fifteen-minute conversation, then nothing, then another flurry forty minutes later.

Reading the usage block

The response carries three token counters in usage. Spelled exactly:

cache_creation_input_tokens: tokens you wrote to the cache on this request. Billed at 1.25× or 2.0×.
cache_read_input_tokens: tokens served from cache. Billed at 0.1×.
input_tokens: non-cached, non-cache-write input tokens, typically the user message and anything after the last breakpoint. Billed at 1.0×.

Total input tokens for the request equals the sum of all three. If cache_read_input_tokens is zero on a request you expected to hit, the cache missed and you are paying full price. If cache_creation_input_tokens is non-zero on every request, your prefix is changing every request. When both are non-zero on the same request, you cached part and missed part. That usually means one breakpoint is stable and a later one is not.

Log all three. Aggregate hit rate as a single ratio:

hit_rate = cache_read / (cache_read + creation + input)

If that ratio is under 0.5 across a normal hour of traffic, your cache is broken even if the SDK call returned 200.

A snippet that caches correctly

Static system prompt. Static tool definitions. Per-request data lives in the user message, after the breakpoint. Date and request ID are logged for tracing, never embedded in the cached prefix.

from anthropic import Anthropic
import json
import logging

client = Anthropic()
log = logging.getLogger("rag")

SYSTEM_PROMPT = (
    "You are a helpdesk assistant for the SaaS product. "
    "Answer questions strictly from the provided context. "
    "If the context does not contain the answer, say so. "
    "Cite section IDs in square brackets, e.g. [S-12]."
)

TOOLS = sorted(
    [
        {
            "name": "search_kb",
            "description": "Lookup knowledge-base sections.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                },
                "required": ["query"],
            },
        },
    ],
    key=lambda t: t["name"],
)

def ask(question: str, kb_excerpt: str, request_id: str):
    log.info("rag.start", extra={"request_id": request_id})
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            },
        ],
        tools=TOOLS,
        messages=[
            {
                "role": "user",
                "content": (
                    f"Context:\n{kb_excerpt}\n\n"
                    f"Question: {question}"
                ),
            },
        ],
    )
    u = msg.usage
    log.info(
        "rag.usage",
        extra={
            "request_id": request_id,
            "cache_read": u.cache_read_input_tokens,
            "cache_write": u.cache_creation_input_tokens,
            "fresh_input": u.input_tokens,
            "output": u.output_tokens,
        },
    )
    return msg.content[0].text

The breakpoint sits on the last block of the static system prompt. The tool list is sorted by name on every call so the serialisation is stable. The per-request data (kb_excerpt, question, request_id) sits downstream of the breakpoint, so it cannot poison the prefix hash. A second breakpoint on the kb_excerpt block would pay off if the same context is reused for a second question in the same five-minute window. Skip it if it is not.

The first call to this function will report cache_creation_input_tokens equal to the prompt size and cache_read_input_tokens of zero. Every call within the next five minutes against the same prefix will flip those numbers. That is the entire game.

Two checks before you ship

Before you turn caching on in production, run these two checks against staging traffic.

Check 1: hit rate. Replay an hour of real traffic, sum cache_read_input_tokens / total_input_tokens across all calls. If you are under 0.7, something in your prefix is varying. Find it before it costs you a month of "discount."

Check 2: prefix diff. Log the SHA-256 of system + tools + user[0:N] for each request. Group by hash. If you have more than a handful of distinct hashes per minute, your prefix is not stable. The fix is upstream of the API call. Somewhere in your prompt-construction code, a request-level value is leaking into a request-level-stable place.

Hash your prefix, find the line that drifts, and delete it. That is what stands between your invoice and the 90% discount.

If this was useful

This is one of the patterns from my Prompt Engineering Pocket Guide. It is a small book about getting more out of LLMs without overpaying for the privilege. Chapters on prefix design, cache breakpoint placement, multi-tenant prompt layouts, and the operational habits that turn a paper discount into a real one. If you ship anything that calls Claude in a hot loop, this book is the long version of the math above.