DEV Community

Cover image for Prompt Caching With the Claude API: A Practical Guide
GDS K S
GDS K S

Posted on

Prompt Caching With the Claude API: A Practical Guide

I noticed a pattern looking at three months of Anthropic invoices. The same 8 KB system prompt was getting billed full price on every request. Same instructions, same tool definitions, same RAG context, charged again every turn. The fix takes about ten lines of code and cuts the input bill by roughly 90 percent on cached tokens. This guide is the version of that fix I wish I had bookmarked a year ago.

TL;DR

Question Short answer
What does it do? Stores a prefix of your prompt server-side so later requests skip re-encoding it
How much does it save? Cache reads cost 10 percent of the base input rate, so up to 90 percent off cached tokens
What does it cost to write? First write costs 1.25x base input (5-minute TTL) or 2x (1-hour TTL)
When does it pay off? Any prefix reused at least twice within the TTL window
Smallest cacheable chunk? 1024 tokens for Sonnet and Opus, 2048 tokens for Haiku
Where do I put the marker? On the last block of the chunk you want cached

1. What prompt caching actually is

When you send a Message to the Claude API, Anthropic encodes your full prompt every time. System instructions, tool schemas, retrieved documents, prior turns, every part. Caching adds an opt-in: you mark a content block as a cache breakpoint, and Anthropic stores the encoded state of everything up to that point. The next request that starts with the same exact bytes reads from the cache instead of recomputing.

Three things to internalize:

The cache is a prefix cache. Order matters. Same system prompt, same tools, same first message in the array. If anything in the prefix differs by even one token, you get a cache miss and pay full input price.

Each breakpoint is ephemeral with a TTL of either 5 minutes (default) or 1 hour. The clock resets every time you read the cache, so a chatty conversation keeps its cache warm without re-paying the write cost.

You get up to four breakpoints per request. Smart layering matters: tools at the deepest level, then system, then static context, then the conversation tail. Each breakpoint extends the cached prefix one layer outward.

2. The cost math, with numbers

For Claude Sonnet 4.6 at the public rate of three dollars per million input tokens, here is what one mid-size system prompt costs across ten requests in a five-minute window.

Scenario Per request 10 requests Effective rate
No caching, 8 KB prefix (~2k tokens) $0.0060 $0.060 $3.00/Mtok
5m cache, first miss + 9 hits $0.0075 first, $0.0006 after $0.0129 $0.65/Mtok
1h cache, first miss + 9 hits $0.0120 first, $0.0006 after $0.0174 $0.87/Mtok

The 5-minute TTL pays for itself after the second request. The 1-hour TTL pays for itself after the third. Use 1-hour only when you know the prefix lives longer than 10 minutes between calls; otherwise stick with the default 5m and let the cache renew on each read.

Output tokens are not affected. Caching only changes the input side.

3. The smallest useful implementation

Take the static parts of your prompt and tag the last block with cache_control. Everything before that block gets cached; everything after stays dynamic.

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = `You are a code reviewer for a TypeScript monorepo.
[... 1500 more tokens of style guide, examples, repo conventions ...]`;

async function review(diff: string) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" }, // 5m TTL by default
      },
    ],
    messages: [{ role: "user", content: diff }],
  });

  console.log({
    cache_creation: response.usage.cache_creation_input_tokens,
    cache_read: response.usage.cache_read_input_tokens,
    fresh_input: response.usage.input_tokens,
  });

  return response;
}
Enter fullscreen mode Exit fullscreen mode

The first call returns a non-zero cache_creation_input_tokens and bills the 1.25x write multiplier on the system prompt. Every later call within 5 minutes returns a non-zero cache_read_input_tokens and bills the 0.1x read rate. The diff itself stays uncached because it sits after the breakpoint.

Python

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a code reviewer for a TypeScript monorepo.
[... 1500 more tokens of style guide, examples, repo conventions ...]"""

def review(diff: str):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},  # 5m default
            }
        ],
        messages=[{"role": "user", "content": diff}],
    )

    print({
        "cache_creation": response.usage.cache_creation_input_tokens,
        "cache_read": response.usage.cache_read_input_tokens,
        "fresh_input": response.usage.input_tokens,
    })

    return response
Enter fullscreen mode Exit fullscreen mode

For a 1-hour TTL, change the breakpoint to {"type": "ephemeral", "ttl": "1h"}. That is the only difference.

4. Layering breakpoints in a real request

The four-breakpoint cap is generous when you stack the layers right. Here is the pattern that has worked across every production prompt I have audited.

┌─────────────────────────────┐
│  tools (rarely change)      │  breakpoint 1
├─────────────────────────────┤
│  system prompt              │  breakpoint 2
├─────────────────────────────┤
│  static context / RAG docs  │  breakpoint 3
├─────────────────────────────┤
│  conversation history       │  breakpoint 4
├─────────────────────────────┤
│  current user turn          │  uncached
└─────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Each breakpoint extends the cached prefix down one layer. If the user adds a turn but the docs and tools and system stay the same, you pay full rate only on the new user turn and the model's response.

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  tools: [
    { name: "search_docs", description: "...", input_schema: {/*...*/} },
    {
      name: "run_query",
      description: "...",
      input_schema: {/*...*/},
      cache_control: { type: "ephemeral" }, // breakpoint 1
    },
  ],
  system: [
    {
      type: "text",
      text: SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" }, // breakpoint 2
    },
  ],
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: RETRIEVED_DOCS,
          cache_control: { type: "ephemeral" } }, // breakpoint 3
        { type: "text", text: priorTurnSummary,
          cache_control: { type: "ephemeral" } }, // breakpoint 4
        { type: "text", text: currentUserMessage }, // uncached
      ],
    },
  ],
});
Enter fullscreen mode Exit fullscreen mode

A nuance worth knowing: a single breakpoint actually creates two cache entries internally, one for the 5m TTL and one for the 1h TTL if you have configured both anywhere in the request. The usage object reports both with cache_creation.ephemeral_5m_input_tokens and cache_creation.ephemeral_1h_input_tokens.

5. The failure modes

Caching is unforgiving about exact-prefix matching. The four ways I have seen it silently miss in production:

A timestamp slipped into the system prompt. Someone added Today is ${new Date().toISOString()} at the top. Every request now has a unique prefix. Every request misses cache. Move the timestamp to the user turn, or round it to the nearest hour.

Trailing whitespace. Two engineers concatenated the prompt with + "\n" in different places. The bytes differ. The cache misses.

Tool order changed. Tools are part of the prefix. If you sort them by frequency or load them from a Map (insertion order), make sure the order is stable across processes.

Below the size floor. Sonnet caches blocks of at least 1024 tokens; Haiku needs 2048. A 600-token system prompt will not cache at all, and you will see no cache_creation_input_tokens in the response. Either pad it or accept the full rate.

6. When to pick 1h over 5m

Decision rule that has held up:

Use case TTL
Interactive chat, agent loops 5m
Background batch jobs every minute 5m
Cron job every 15 minutes 1h
Long-running review session, hourly digests 1h
Single-shot one-off no cache

The 1-hour TTL costs 2x base on the write. You need at least three reads inside that hour to come out ahead versus the 5m TTL, which only needs two. If your traffic is spiky and unpredictable, 5m is the safer default.

7. The honest tradeoffs

Four caveats nobody mentions in the marketing copy.

Caching shifts cost, it does not remove it. Output tokens still bill at the full rate. If your bill leans on output (long generated reports, code synthesis), caching helps less than the 90 percent headline suggests.

Cache hit rate is observable but not always predictable. Anthropic does not guarantee a write will be readable on the next request; in practice it always lands for me, but the docs are clear that the cache write runs on a best-effort basis. Build with the assumption that any individual request might miss.

Tool definitions count as part of the prefix. Adding a new tool invalidates the cache for every prompt that uses tools. Plan tool schema changes during low-traffic windows.

No public API exists for manually evicting a cache. You either let the TTL expire or change the prefix. This matters when you ship a system-prompt update and want to confirm the old version no longer lives in the cache.

8. The bottom line

Prompt caching is the highest-ROI single change you can make to a Claude API integration today. Eight extra lines of config, no architecture change, and a 90 percent cut on the largest line item in most invoices. The reason for the under-use is not technical; the docs treat caching as a feature flag, not as the default.

Set the breakpoint, log cache_read_input_tokens for a week, and watch what your bill does. If you are running tools, set four breakpoints, not one. If you are running interactive sessions, stick with the 5m TTL.

The cheaper the input gets, the more the architecture starts to look like the cache is the prompt and the prompt is the cache. That shift is worth thinking about before the next round of API price changes lands.


GDS K S · thegdsks.com · building Glincker · follow on X @thegdsks

The bill that caching hides is not the next one. The one you would have paid in twelve months if you never set the breakpoint.

Top comments (0)