DEV Community

dohko
dohko

Posted on

Prompt Caching for AI Coding: The $0 Optimization That Saves 60% on Every Call

Prompt Caching for AI Coding: The $0 Optimization That Saves 60% on Every Call

If you're making multiple AI coding calls against the same codebase in a session — and you're NOT using prompt caching — you're burning money for no reason.

Prompt caching is supported by Anthropic, OpenAI, Google, and most providers. It's free to enable. It cuts input token costs by 60-90%.

Here's exactly how to implement it across every major provider.


How Prompt Caching Works (30-Second Version)

Without caching:

Call 1: Send 50K context + 500 token prompt  → Pay for 50,500 tokens
Call 2: Send 50K context + 800 token prompt  → Pay for 50,800 tokens
Call 3: Send 50K context + 300 token prompt  → Pay for 50,300 tokens
Total input tokens billed: 151,600
Enter fullscreen mode Exit fullscreen mode

With caching:

Call 1: Send 50K context (cached) + 500 prompt → Pay 50,500 (cache write)
Call 2: Cache hit on 50K + 800 prompt          → Pay 5,800 (90% discount)
Call 3: Cache hit on 50K + 300 prompt          → Pay 5,300 (90% discount)
Total input tokens billed: 61,600  (59% savings)
Enter fullscreen mode Exit fullscreen mode

The more calls you make in a session, the more you save. By call 10, the average cost per call drops over 80%.


Anthropic (Claude) — Prompt Caching

Anthropic's caching is the most straightforward. Add cache_control to any message block.

import anthropic

client = anthropic.Client()

# Your static project context — this gets cached
project_context = open("AGENTS.md").read() + "\n" + open("src/main.ts").read()

def coding_call(prompt: str):
    return client.messages.create(
        model="claude-sonnet-4.6",
        max_tokens=4096,
        system=[{
            "type": "text",
            "text": project_context,
            "cache_control": {"type": "ephemeral"}  # This enables caching
        }],
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )

# First call: cache miss (normal price + small write fee)
r1 = coding_call("Find bugs in the auth handler")

# Second call: cache HIT (90% cheaper on the 50K context)
r2 = coding_call("Write tests for the auth handler")

# Third call: still cached
r3 = coding_call("Add error logging to the auth handler")
Enter fullscreen mode Exit fullscreen mode

Key rules for Anthropic caching:

  • Cache lives for 5 minutes (resets on each hit)
  • Minimum 1,024 tokens to cache (system), 2,048 for tools
  • The cached text must be EXACTLY the same — byte-for-byte
  • Cache write costs 25% more; cache read costs 90% less

Cost math for a typical session:

10 calls, 50K context each:

Without caching:  10 × 50K = 500K input tokens
With caching:     50K × 1.25 (write) + 9 × 5K (reads) = 107.5K effective tokens
Savings: 78%
Enter fullscreen mode Exit fullscreen mode

OpenAI (GPT) — Automatic Caching

OpenAI caches automatically — no code changes needed. But there are tricks to maximize hit rates.

from openai import OpenAI

client = OpenAI()

# The KEY: keep your system message and prefix IDENTICAL across calls
system_message = """You are a senior developer working on this project.

Project context:
""" + open("AGENTS.md").read() + """

Source files:
""" + open("src/main.ts").read()

def coding_call(prompt: str):
    return client.chat.completions.create(
        model="gpt-5.4",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt}
        ]
    )

# OpenAI automatically caches the system message prefix
# Subsequent calls with the same prefix get 50% discount on input tokens
Enter fullscreen mode Exit fullscreen mode

OpenAI caching rules:

  • Automatic for prompts > 1,024 tokens
  • 50% discount on cached input tokens (not 90% like Anthropic)
  • Cache persists for 5-10 minutes of inactivity
  • The prefix must match exactly — even whitespace matters

Maximizing OpenAI cache hits:

# ❌ BAD: Different prefix each time
messages = [
    {"role": "system", "content": f"Context: {context}"},
    {"role": "user", "content": "Fix bug #142"},
    {"role": "assistant", "content": previous_response},  # This changes!
    {"role": "user", "content": "Now add tests"}
]

# ✅ GOOD: Static prefix, dynamic suffix
messages = [
    {"role": "system", "content": context},        # STATIC — gets cached
    {"role": "user", "content": combined_prompt}    # All dynamic content here
]
Enter fullscreen mode Exit fullscreen mode

Google (Gemini) — Explicit Context Caching

Google's caching is the most powerful for large contexts — it supports up to 2M tokens cached.

import google.generativeai as genai

# Create a cached context (persists for hours, not minutes)
cache = genai.caching.CachedContent.create(
    model="gemini-2.5-pro",
    display_name="my-project-context",
    contents=[{
        "role": "user",
        "parts": [{"text": project_context}]  # Your full codebase context
    }],
    ttl="3600s"  # Cache for 1 hour
)

# Use the cached context
model = genai.GenerativeModel.from_cached_content(cache)

# Every call now uses cached context — massive savings
r1 = model.generate_content("Find security issues")
r2 = model.generate_content("Optimize the database queries")
r3 = model.generate_content("Write integration tests")

# Cleanup when done
cache.delete()
Enter fullscreen mode Exit fullscreen mode

Google caching advantages:

  • Longest TTL (hours vs minutes)
  • Largest cacheable context (2M tokens)
  • Cheapest cached reads (75% discount)
  • You control the cache lifecycle explicitly

The Universal Caching Pattern

Regardless of provider, this pattern maximizes cache hits:

class CachedCodingSession:
    """Provider-agnostic cached coding session"""

    def __init__(self, project_path: str, provider: str = "anthropic"):
        self.provider = provider
        self.context = self._build_context(project_path)
        self.call_count = 0
        self.estimated_savings = 0.0

    def _build_context(self, path: str) -> str:
        """Build static context ONCE at session start"""
        parts = []

        # Always include AGENTS.md
        agents_md = Path(path) / "AGENTS.md"
        if agents_md.exists():
            parts.append(f"# Project Context\n{agents_md.read_text()}")

        # Include relevant source files
        for f in self._get_relevant_files(path):
            parts.append(f"# {f.name}\n```
{% endraw %}
\n{f.read_text()}\n
{% raw %}
```")

        return "\n\n".join(parts)

    def call(self, prompt: str) -> str:
        """Make a cached call"""
        self.call_count += 1

        if self.call_count > 1:
            # Estimate savings from cache hit
            context_tokens = len(self.context) // 4
            self.estimated_savings += context_tokens * 0.000003  # ~$3/1M tokens saved

        return self._provider_call(prompt)

    def stats(self):
        return {
            "calls": self.call_count,
            "estimated_savings": f"${self.estimated_savings:.2f}",
            "cache_hit_rate": f"{max(0, (self.call_count-1)/self.call_count*100):.0f}%"
        }
Enter fullscreen mode Exit fullscreen mode

Common Mistakes That Break Caching

1. Changing the context between calls

# ❌ Adding timestamps breaks the cache
system = f"Context as of {datetime.now()}:\n{context}"

# ✅ Static context, timestamp in user message if needed
system = context
user = f"[{datetime.now()}] Fix the auth bug"
Enter fullscreen mode Exit fullscreen mode

2. Reordering files in context

# ❌ Random file order = different hash = cache miss
files = os.listdir("src/")  # Order varies by filesystem

# ✅ Sort files deterministically
files = sorted(os.listdir("src/"))
Enter fullscreen mode Exit fullscreen mode

3. Not batching calls

# ❌ One call per file = cache expires between sessions
for file in files:
    review(file)  # 5 min gap between files = cache expired

# ✅ Batch calls in a tight loop
reviews = [review(file) for file in files]  # All hit the cache
Enter fullscreen mode Exit fullscreen mode

Real-World Savings

Here's what caching saves on a typical day of AI-assisted development:

Session: 4 hours, 35 AI calls, 80K average context

Without caching:
  35 × 80K = 2.8M input tokens
  Cost: ~$8.40 (at $3/1M tokens)

With caching:
  80K × 1.25 (1 write) + 34 × 8K (reads)
  = 100K + 272K = 372K effective tokens
  Cost: ~$1.12

Daily savings: $7.28 (87%)
Monthly savings: ~$218
Enter fullscreen mode Exit fullscreen mode

That's real money for zero code quality tradeoff.


Quick-Start Checklist

  1. ☐ Identify your static context (AGENTS.md, core source files)
  2. ☐ Build context ONCE at session start, not per call
  3. ☐ Add cache_control (Anthropic) or use static prefixes (OpenAI)
  4. ☐ Sort files deterministically in context
  5. ☐ Remove timestamps and dynamic content from cached portions
  6. ☐ Batch calls to keep cache warm
  7. ☐ Track cache hit rates and adjust

Full Caching Optimization Kit

The complete prompt caching configs, multi-provider implementations, and cost tracking dashboards are in the AI Dev Toolkit — 264 production frameworks including caching patterns for every major AI provider.


Are you using prompt caching? What's your cache hit rate? Share your setup in the comments.

Top comments (0)