Prompt Caching for AI Coding: The $0 Optimization That Saves 60% on Every Call
If you're making multiple AI coding calls against the same codebase in a session — and you're NOT using prompt caching — you're burning money for no reason.
Prompt caching is supported by Anthropic, OpenAI, Google, and most providers. It's free to enable. It cuts input token costs by 60-90%.
Here's exactly how to implement it across every major provider.
How Prompt Caching Works (30-Second Version)
Without caching:
Call 1: Send 50K context + 500 token prompt → Pay for 50,500 tokens
Call 2: Send 50K context + 800 token prompt → Pay for 50,800 tokens
Call 3: Send 50K context + 300 token prompt → Pay for 50,300 tokens
Total input tokens billed: 151,600
With caching:
Call 1: Send 50K context (cached) + 500 prompt → Pay 50,500 (cache write)
Call 2: Cache hit on 50K + 800 prompt → Pay 5,800 (90% discount)
Call 3: Cache hit on 50K + 300 prompt → Pay 5,300 (90% discount)
Total input tokens billed: 61,600 (59% savings)
The more calls you make in a session, the more you save. By call 10, the average cost per call drops over 80%.
Anthropic (Claude) — Prompt Caching
Anthropic's caching is the most straightforward. Add cache_control to any message block.
import anthropic
client = anthropic.Client()
# Your static project context — this gets cached
project_context = open("AGENTS.md").read() + "\n" + open("src/main.ts").read()
def coding_call(prompt: str):
return client.messages.create(
model="claude-sonnet-4.6",
max_tokens=4096,
system=[{
"type": "text",
"text": project_context,
"cache_control": {"type": "ephemeral"} # This enables caching
}],
messages=[{
"role": "user",
"content": prompt
}]
)
# First call: cache miss (normal price + small write fee)
r1 = coding_call("Find bugs in the auth handler")
# Second call: cache HIT (90% cheaper on the 50K context)
r2 = coding_call("Write tests for the auth handler")
# Third call: still cached
r3 = coding_call("Add error logging to the auth handler")
Key rules for Anthropic caching:
- Cache lives for 5 minutes (resets on each hit)
- Minimum 1,024 tokens to cache (system), 2,048 for tools
- The cached text must be EXACTLY the same — byte-for-byte
- Cache write costs 25% more; cache read costs 90% less
Cost math for a typical session:
10 calls, 50K context each:
Without caching: 10 × 50K = 500K input tokens
With caching: 50K × 1.25 (write) + 9 × 5K (reads) = 107.5K effective tokens
Savings: 78%
OpenAI (GPT) — Automatic Caching
OpenAI caches automatically — no code changes needed. But there are tricks to maximize hit rates.
from openai import OpenAI
client = OpenAI()
# The KEY: keep your system message and prefix IDENTICAL across calls
system_message = """You are a senior developer working on this project.
Project context:
""" + open("AGENTS.md").read() + """
Source files:
""" + open("src/main.ts").read()
def coding_call(prompt: str):
return client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": prompt}
]
)
# OpenAI automatically caches the system message prefix
# Subsequent calls with the same prefix get 50% discount on input tokens
OpenAI caching rules:
- Automatic for prompts > 1,024 tokens
- 50% discount on cached input tokens (not 90% like Anthropic)
- Cache persists for 5-10 minutes of inactivity
- The prefix must match exactly — even whitespace matters
Maximizing OpenAI cache hits:
# ❌ BAD: Different prefix each time
messages = [
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": "Fix bug #142"},
{"role": "assistant", "content": previous_response}, # This changes!
{"role": "user", "content": "Now add tests"}
]
# ✅ GOOD: Static prefix, dynamic suffix
messages = [
{"role": "system", "content": context}, # STATIC — gets cached
{"role": "user", "content": combined_prompt} # All dynamic content here
]
Google (Gemini) — Explicit Context Caching
Google's caching is the most powerful for large contexts — it supports up to 2M tokens cached.
import google.generativeai as genai
# Create a cached context (persists for hours, not minutes)
cache = genai.caching.CachedContent.create(
model="gemini-2.5-pro",
display_name="my-project-context",
contents=[{
"role": "user",
"parts": [{"text": project_context}] # Your full codebase context
}],
ttl="3600s" # Cache for 1 hour
)
# Use the cached context
model = genai.GenerativeModel.from_cached_content(cache)
# Every call now uses cached context — massive savings
r1 = model.generate_content("Find security issues")
r2 = model.generate_content("Optimize the database queries")
r3 = model.generate_content("Write integration tests")
# Cleanup when done
cache.delete()
Google caching advantages:
- Longest TTL (hours vs minutes)
- Largest cacheable context (2M tokens)
- Cheapest cached reads (75% discount)
- You control the cache lifecycle explicitly
The Universal Caching Pattern
Regardless of provider, this pattern maximizes cache hits:
class CachedCodingSession:
"""Provider-agnostic cached coding session"""
def __init__(self, project_path: str, provider: str = "anthropic"):
self.provider = provider
self.context = self._build_context(project_path)
self.call_count = 0
self.estimated_savings = 0.0
def _build_context(self, path: str) -> str:
"""Build static context ONCE at session start"""
parts = []
# Always include AGENTS.md
agents_md = Path(path) / "AGENTS.md"
if agents_md.exists():
parts.append(f"# Project Context\n{agents_md.read_text()}")
# Include relevant source files
for f in self._get_relevant_files(path):
parts.append(f"# {f.name}\n```
{% endraw %}
\n{f.read_text()}\n
{% raw %}
```")
return "\n\n".join(parts)
def call(self, prompt: str) -> str:
"""Make a cached call"""
self.call_count += 1
if self.call_count > 1:
# Estimate savings from cache hit
context_tokens = len(self.context) // 4
self.estimated_savings += context_tokens * 0.000003 # ~$3/1M tokens saved
return self._provider_call(prompt)
def stats(self):
return {
"calls": self.call_count,
"estimated_savings": f"${self.estimated_savings:.2f}",
"cache_hit_rate": f"{max(0, (self.call_count-1)/self.call_count*100):.0f}%"
}
Common Mistakes That Break Caching
1. Changing the context between calls
# ❌ Adding timestamps breaks the cache
system = f"Context as of {datetime.now()}:\n{context}"
# ✅ Static context, timestamp in user message if needed
system = context
user = f"[{datetime.now()}] Fix the auth bug"
2. Reordering files in context
# ❌ Random file order = different hash = cache miss
files = os.listdir("src/") # Order varies by filesystem
# ✅ Sort files deterministically
files = sorted(os.listdir("src/"))
3. Not batching calls
# ❌ One call per file = cache expires between sessions
for file in files:
review(file) # 5 min gap between files = cache expired
# ✅ Batch calls in a tight loop
reviews = [review(file) for file in files] # All hit the cache
Real-World Savings
Here's what caching saves on a typical day of AI-assisted development:
Session: 4 hours, 35 AI calls, 80K average context
Without caching:
35 × 80K = 2.8M input tokens
Cost: ~$8.40 (at $3/1M tokens)
With caching:
80K × 1.25 (1 write) + 34 × 8K (reads)
= 100K + 272K = 372K effective tokens
Cost: ~$1.12
Daily savings: $7.28 (87%)
Monthly savings: ~$218
That's real money for zero code quality tradeoff.
Quick-Start Checklist
- ☐ Identify your static context (AGENTS.md, core source files)
- ☐ Build context ONCE at session start, not per call
- ☐ Add
cache_control(Anthropic) or use static prefixes (OpenAI) - ☐ Sort files deterministically in context
- ☐ Remove timestamps and dynamic content from cached portions
- ☐ Batch calls to keep cache warm
- ☐ Track cache hit rates and adjust
Full Caching Optimization Kit
The complete prompt caching configs, multi-provider implementations, and cost tracking dashboards are in the AI Dev Toolkit — 264 production frameworks including caching patterns for every major AI provider.
Are you using prompt caching? What's your cache hit rate? Share your setup in the comments.
Top comments (0)