Prompt Caching with Claude: Cut API Costs by 90% on Repeated Context
If you're sending the same large context (system prompt, documents, tool definitions) with every API call, you're paying full price each time. Prompt caching stores that context once and charges 10x less on subsequent calls.
How It Works
Normal API call: full input tokens billed every request.
With caching:
- First call: input tokens billed + small cache write fee
- Subsequent calls: only a cache read fee (90% cheaper than full tokens)
Marking Content for Caching
import anthropic
client = anthropic.Anthropic()
# Large document you send on every request
SYSTEM_DOCS = open('large_codebase_context.txt').read() # 50k tokens
response = client.messages.create(
model='claude-opus-4-6',
max_tokens=1024,
system=[
{
'type': 'text',
'text': SYSTEM_DOCS,
'cache_control': {'type': 'ephemeral'} # Cache this
}
],
messages=[{'role': 'user', 'content': 'What does the auth module do?'}]
)
Checking Cache Usage
usage = response.usage
print(f'Input tokens: {usage.input_tokens}')
print(f'Cache write tokens: {usage.cache_creation_input_tokens}')
print(f'Cache read tokens: {usage.cache_read_input_tokens}')
# First call: cache_creation_input_tokens > 0
# Subsequent calls: cache_read_input_tokens > 0, cost ~10% of normal
Multi-Turn Conversation Caching
# Cache the conversation history as it grows
messages = []
def chat(user_message: str) -> str:
messages.append({'role': 'user', 'content': user_message})
# Mark the last assistant message for caching
cached_messages = messages.copy()
if len(cached_messages) >= 2:
last_assistant_idx = next(
i for i in range(len(cached_messages)-1, -1, -1)
if cached_messages[i]['role'] == 'assistant'
)
if isinstance(cached_messages[last_assistant_idx]['content'], str):
cached_messages[last_assistant_idx]['content'] = [{
'type': 'text',
'text': cached_messages[last_assistant_idx]['content'],
'cache_control': {'type': 'ephemeral'}
}]
response = client.messages.create(
model='claude-opus-4-6',
max_tokens=1024,
messages=cached_messages
)
reply = response.content[0].text
messages.append({'role': 'assistant', 'content': reply})
return reply
Cache TTL and Invalidation
- Cache TTL: 5 minutes of inactivity (resets on each hit)
- Invalidation: automatic — change the content and it creates a new cache entry
- Max cached tokens per request: 4 cache breakpoints
Cost Comparison
| Scenario | Without Cache | With Cache |
|---|---|---|
| 100 calls, 50k token system prompt | 5M tokens billed | 50k write + 4.95M read tokens |
| Relative cost | 100% | ~15% |
Best Candidates for Caching
- Large system prompts with tool definitions
- Codebase context for code review
- Document analysis (contracts, technical specs)
- Long conversation history
Building AI features? The AI SaaS Starter Kit ships with Claude API routes pre-configured including prompt caching, streaming, and token tracking. $99 at whoffagents.com.
Top comments (0)