Empiric Infotech LLP

Posted on May 27

Prompt Caching with Claude: 6 Patterns We Use in Production (and the math behind them)

#architecture #claude #llm #performance

When we first turned on prompt caching for a client's support-agent backend, the monthly Anthropic bill dropped from around $4,800 to $1,310 in the next billing cycle. Same traffic, same model (Claude Sonnet 4.6), no quality regression. The only change was how we structured the request.

That gap, roughly 73%, is not unusual. Most teams leave it on the table because they treat caching as a checkbox instead of a design constraint. This post walks through the six patterns we now use across client projects, with code, real numbers, and the failure modes we hit before getting to them.

What prompt caching actually does

A quick refresher so the patterns make sense:

You mark message blocks with cache_control: { type: "ephemeral" }.
The cached prefix lives in Anthropic's infra for ~5 minutes (default) or up to 1 hour with the longer TTL.
Cache writes cost 1.25x (5-min) or 2x (1-hour) the input token rate.
Cache reads cost 0.1x the input token rate.

The break-even is fast. If a 10,000-token system prompt gets reused twice within 5 minutes, you are already cheaper than not caching. From the third hit onward, it is close to free.

That economic shape, expensive write then near-free reads, drives every pattern below.

Pattern 1: Cache the boring, leave the fresh

The single biggest win is cutting your request into two halves: the stable half (system prompt, tool definitions, documentation, few-shot examples) and the volatile half (the user's actual message).

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # ~8k tokens of policy, tone, examples
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=TOOL_DEFINITIONS_WITH_CACHE_CONTROL,
    messages=[
        {"role": "user", "content": user_query}
    ],
)

What we got wrong the first time: we put the user's message inside the cached block by accident, because we were templating the whole prompt as one string. The cache never hit. Treat the user input as a strict boundary. Nothing volatile crosses it.

Pattern 2: Order matters more than you think

Cache hits are prefix-based. The system block is cached only if everything before it (which is nothing) plus its own content matches a prior request byte-for-byte. So:

Put the largest stable content first.
Put smaller stable content next.
Put volatile content last.

We had a project where the team kept rearranging tool definitions alphabetically when adding new tools. Every deploy invalidated the cache, then every subsequent request paid the 1.25x write cost again until the new tool order stabilized. Pin your tool order. Treat it like a database schema.

Pattern 3: Multiple cache breakpoints for layered staleness

Anthropic lets you set up to four cache breakpoints. Use them when different parts of your prompt invalidate at different rates.

A real example from a knowledge-base agent we shipped:

system=[
    {"type": "text", "text": COMPANY_POLICIES, "cache_control": {"type": "ephemeral"}},
    # Refreshed daily
    {"type": "text", "text": daily_kb_snapshot, "cache_control": {"type": "ephemeral"}},
    # Refreshed per user
    {"type": "text", "text": user_profile_context, "cache_control": {"type": "ephemeral"}},
]

When the daily KB rebuilds, only the second and third blocks invalidate. The policies stay cached. Without breakpoints, the entire prefix would invalidate together.

Pattern 4: The 5-minute TTL is a product decision, not just an infra one

The default 5-minute TTL works if your traffic is bursty enough that the cache rarely cools. For low-traffic apps, every request pays the write cost.

The 1-hour TTL (set via beta header) doubles your write cost but holds the prefix for an hour. The math:

5-min TTL, 1 request every 6 minutes -> every request pays the 1.25x write. Net: more expensive than no caching.
1-hour TTL, 1 request every 6 minutes -> first request pays 2x, next nine pay 0.1x. Net: ~25% of uncached cost.

We default to 5-minute for chat workloads and 1-hour for cron-like or analytics agents. Pick based on your inter-request gap, not on the bigger-number-better instinct.

Pattern 5: Cache tool definitions, not just system prompts

Tool definitions count toward the cached prefix and they are usually long. A schema with 20 tools and detailed descriptions can be 6,000+ tokens. Marking the last tool block with cache_control extends the cache to cover every tool above it.

tools=[
    {"name": "search_orders", "description": "...", "input_schema": {...}},
    {"name": "refund_order", "description": "...", "input_schema": {...}},
    # ... 18 more
    {
        "name": "escalate_to_human",
        "description": "...",
        "input_schema": {...},
        "cache_control": {"type": "ephemeral"},
    },
]

Putting the breakpoint on the last tool caches the entire tool block. Adding a new tool to the end without changing existing ones keeps the cache valid for everything above the new entry.

Pattern 6: Conversation history caching for long chats

For multi-turn conversations, set cache_control on the second-to-last assistant message. Every turn extends the cached prefix one message at a time.

messages = [
    *prior_messages[:-2],
    {"role": "assistant", "content": prior_messages[-2]["content"], "cache_control": {"type": "ephemeral"}},
    prior_messages[-1],  # last user message, uncached
]

This is where caching pays for itself fastest. A 20-turn support chat that would re-tokenize 15,000+ tokens per turn drops to ~150 tokens of new input per turn after the first.

Watch out for: tool results in the conversation. Large tool outputs (a 5k-token search result) bloat the cached prefix. If you have noisy tools, summarize or truncate results before they enter history.

How to verify caching actually works

The response includes usage fields. Check them, do not assume:

print(response.usage)
# Usage(
#   input_tokens=42,
#   cache_creation_input_tokens=0,
#   cache_read_input_tokens=8421,
#   output_tokens=180
# )

cache_creation_input_tokens > 0 means you wrote to cache this request (1.25x or 2x cost).
cache_read_input_tokens > 0 means you hit cache (0.1x cost).
Both can be non-zero in the same request if part of the prefix matched and part was new.

We log these per request and chart cache hit rate against cost. A drop in hit rate usually points to an accidental change in the stable prefix, not a traffic shift.

The pattern we did not include

We considered recommending you cache user-specific data (like profile blobs) aggressively. We pulled it after a project where users would update their profile and the agent kept responding with stale facts for the next 5 minutes. The fix was obvious in hindsight: do not cache anything the user can change in-session. The savings were not worth the surprise.

Wrap-up

If you take one thing from this: caching is a structure decision, not a flag. Decide what is stable, sort it by staleness, set breakpoints, and verify with the usage fields. The bill drop is the easy part. Keeping the cache hit rate above 85% as your product evolves is the actual work.

Empiric Infotech is an AI and software studio of 75+ engineers based in Surat, India, with delivery across IST, EU, and US time zones. We ship Claude API, MCP, and agent workloads for product teams. If you want vetted Claude engineers on your stack inside 48 hours, see hire Claude developers.

DEV Community