If you are building production-grade AI applications, you already know the pain of LLM API bills. As your context grows—whether you are feeding Claude large codebases, legal documents, or long chat histories—the cost of input tokens scales linearly.
But it doesn't have to.
Anthropic's Prompt Caching feature allows you to cache frequently used context. Instead of paying full price to process the same system instructions or reference documents over and over, you can reuse them for a 90% discount on input tokens, while simultaneously cutting latency by up to 80%.
In this hands-on guide, you will learn how to implement prompt caching in Python, understand the pricing dynamics, and walk away with production-ready code.
How Prompt Caching Works (The TL;DR)
Normally, every time you send a request to Claude, the API processes your entire prompt from scratch.
With Prompt Caching, you define specific cache breakpoints in your prompt. When Claude processes a request containing a breakpoint, it saves that prefix of the prompt to a temporary cache (with a 5-minute Time-To-Live, refreshed on every hit).
Here is the pricing breakdown for Claude 3.5 Sonnet:
- Base Input Tokens: $3.00 / million tokens
- Cache Write (Creation) Tokens: $3.75 / million tokens (a 25% premium to write to cache)
- Cache Read (Hit) Tokens: $0.30 / million tokens (a 90% saving)
If you query a 50,000-token document 10 times, you go from paying $1.50 per request to just $0.15 per request after the first turn.
Implementation 1: Caching Large Reference Documents
To use prompt caching, you must target a minimum token threshold. For Claude 3.5 Sonnet and Claude 3 Haiku, the minimum cacheable prompt size is 1,024 tokens (roughly 750 words). For Claude 3 Opus, it is 2,048 tokens.
Here is how to cache a massive reference document using the Python SDK. We will use the cache_control parameter with the type ephemeral.
import os
from anthropic import Anthropic
# Initialize the client (ensure ANTHROPIC_API_KEY is in your environment)
client = Anthropic()
# Let's simulate a large knowledge base (must be > 1024 tokens to cache)
large_knowledge_base = """
SYSTEM RULES AND KNOWLEDGE BASE (v4.2):
[Imagine 2,000 words of complex enterprise documentation, API specs, or code guidelines here...]
""" * 50
def query_with_cached_context(user_query: str):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
temperature=0,
system=[
{
"type": "text",
"text": large_knowledge_base,
# This flag tells Anthropic to cache everything up to this block
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_query}
]
)
# Inspect token usage to verify caching behavior
usage = response.usage
print(f"Response: {response.content[0].text[:100]}...")
print(f"Input Tokens: {usage.input_tokens}")
print(f"Cache Creation Tokens (Write): {getattr(usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache Read Tokens (Hit): {getattr(usage, 'cache_read_input_tokens', 0)}")
print("-" * 40)
# First run: Cache Miss (Creates the cache)
print("--- Run 1 (Cache Miss / Creation) ---")
query_with_cached_context("What version of the system rules is this?")
# Second run: Cache Hit (Reads from cache)
print("--- Run 2 (Cache Hit) ---")
query_with_cached_context("Summarize the main rules in one sentence.")
When you run this, the first request will show a high number of cache_creation_input_tokens. The second request (executed within 5 minutes) will show those same tokens registered under cache_read_input_tokens, billing you at the 90% discounted rate.
Implementation 2: Multi-Turn Conversations
Prompt caching is incredibly powerful for chatbots. By caching the conversation history, you only pay full price for the latest user message.
To make this work, place your cache breakpoint at the end of the conversation history. Anthropic allows up to 4 cache breakpoints per request.
from anthropic import Anthropic
client = Anthropic()
# We start a conversation and cache the history as it grows
conversation_history = [
{"role": "user", "content": "Let's write a python script step-by-step. First, write a function to fetch data from an API."},
{"role": "assistant", "content": "Here is a basic function using `requests`:\n\n```
python\nimport requests\ndef fetch_data(url):\n return requests.get(url).json()\n
```"},
# We place the cache breakpoint on the most recent message to keep the history cached
{
"role": "user",
"content": "Now, add error handling to that function.",
"cache_control": {"type": "ephemeral"}
}
]
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=800,
messages=conversation_history
)
print(f"Assistant: {response.content[0].text}")
print(f"Cache Hits: {getattr(response.usage, 'cache_read_input_tokens', 0)} tokens")
Practical Takeaways for Developers
To get the most out of prompt caching, structure your API calls with these rules in mind:
- Order Matters: Claude processes prompts from top to bottom. If any character changes before a cache breakpoint, the entire cache after that point is invalidated. Always put your static content (system prompts, documentation, instructions) at the very beginning, and dynamic content (user queries) at the end.
- Mind the Limit: Caching only triggers if your prompt prefix meets the minimum token limit (1,024 for Sonnet/Haiku). Small prompts will not benefit from caching.
- Track Your Metrics: Always log
cache_read_input_tokensandcache_creation_input_tokensin your telemetry. This ensures you can audit your actual cost savings.
What will you build?
With a 90% cost reduction on large contexts, you can now build features that were previously cost-prohibitive: codebase-wide QA bots, real-time document synthesizers, and highly contextual agents.
Are you already using prompt caching in your projects? Share your cost-saving wins or ask your integration questions in the comments below!
Top comments (0)