DEV Community

Sangmin Lee
Sangmin Lee

Posted on • Originally published at claudeguide.io

Prompt Caching: The 90% Discount Most Claude Developers Miss

Originally published at claudeguide.io/claude-prompt-caching-guide

Prompt Caching: The 90% Discount Most Claude Developers Miss

Prompt caching reduces cache-read token costs by up to 90% compared to standard input pricing. It works by storing a prefix of your prompt on Anthropic's servers so that subsequent requests reuse the cached version instead of re-processing the full text. For any application that sends the same system prompt or context repeatedly — agents, chatbots, document Q&A systems — caching is the single highest-ROI optimisation available. For the full cost break-even analysis, see Claude API Cost and Prompt Caching Break-Even.

Most developers don't use it because it requires a one-line change to their API call. This guide shows exactly what to add.


How prompt caching works

When you include "cache_control": {"type": "ephemeral"} in a message block, Anthropic stores that block's content in a temporary cache on their edge servers. The cache entry:

  • Lives for 5 minutes (TTL resets each time it's hit)
  • Is scoped to your API key — no other users share your cache
  • Requires a minimum size of 1,024 tokens (Haiku) or 2,048 tokens (Sonnet/Opus) to be worth caching

On the first request (cache write), you pay a premium: 1.25× the standard input price. On every subsequent request that hits the cache (cache read), you pay only 0.1× the standard input price — a 90% discount.


Pricing breakdown (April 2026)

For claude-sonnet-4-6:

Token type Price per 1M tokens
Standard input $3.00
Cache write $3.75 (1.25×)
Cache read $0.30 (0.1×)
Output $15.00

Break-even analysis: You recover the cache-write premium after just 2 cache reads on the same content. Any application that processes the same document more than twice per 5-minute window is leaving money on the table. Plug your own numbers into the Prompt Caching Break-Even Calculator to see the exact monthly savings for your traffic pattern.

For longer documents like a 50,000-token codebase or legal contract, the math becomes dramatic. At $3.00/M, reading that document 100 times costs $15.00. With caching, you pay $3.75 once and $0.30 × 99 times = $33.45... wait, that's higher? No — the break-even is 2 reads within the TTL window. See the real calculation:

Without caching (100 reads): 50,000 tokens × 100 × $3.00/M = $15.00
With caching (1 write + 99 reads): (50,000 × $3.75/M) + (50,000 × 99 × $0.30/M) = $0.1875 + $1.485 = $1.67

That's an 89% reduction for high-frequency document access.


When to use prompt caching

Prompt caching is most valuable when:

Scenario Cache target Expected savings
Chatbot with long system prompt System message 60–80%
Document Q&A (same doc, many questions) Document content 80–90%
Agent with fixed tool definitions Tool schemas 40–60%
Code review bot (same codebase) Repository context 85–95%
Multi-turn conversations Growing history 50–70%

It has minimal value for:

  • Single-turn, one-off completions
  • Short prompts under the minimum token threshold
  • Requests more than 5 minutes apart

Implementation: basic example

Add cache_control to any content block you want cached. Here's a chatbot with a large system prompt:


python
import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior Python engineer...
[... 5,000 tokens of instructions, examples, and context ...]"""

def chat(user_message: str, conversation_history: list) -

[→ Get the Cost Optimization Masterclass — $59](https://shoutfirst.gumroad.com/l/msjkda?utm_source=claudeguide&utm_medium=article&utm_campaign=claude-prompt-caching-guide)

*30-day money-back guarantee. Instant download.*
Enter fullscreen mode Exit fullscreen mode

Top comments (0)