dohko

Posted on Mar 25

Prompt Caching for AI Coding: The $0 Optimization That Saves 60% on Every Call

#ai #productivity #programming #tutorial

Prompt Caching for AI Coding: The $0 Optimization That Saves 60% on Every Call

If you're making multiple AI coding calls against the same codebase in a session — and you're NOT using prompt caching — you're burning money for no reason.

Prompt caching is supported by Anthropic, OpenAI, Google, and most providers. It's free to enable. It cuts input token costs by 60-90%.

Here's exactly how to implement it across every major provider.

How Prompt Caching Works (30-Second Version)

Without caching:

Call 1: Send 50K context + 500 token prompt  → Pay for 50,500 tokens
Call 2: Send 50K context + 800 token prompt  → Pay for 50,800 tokens
Call 3: Send 50K context + 300 token prompt  → Pay for 50,300 tokens
Total input tokens billed: 151,600

With caching:

Call 1: Send 50K context (cached) + 500 prompt → Pay 50,500 (cache write)
Call 2: Cache hit on 50K + 800 prompt          → Pay 5,800 (90% discount)
Call 3: Cache hit on 50K + 300 prompt          → Pay 5,300 (90% discount)
Total input tokens billed: 61,600  (59% savings)

The more calls you make in a session, the more you save. By call 10, the average cost per call drops over 80%.

Anthropic (Claude) — Prompt Caching

Anthropic's caching is the most straightforward. Add cache_control to any message block.

import anthropic

client = anthropic.Client()

# Your static project context — this gets cached
project_context = open("AGENTS.md").read() + "\n" + open("src/main.ts").read()

def coding_call(prompt: str):
    return client.messages.create(
        model="claude-sonnet-4.6",
        max_tokens=4096,
        system=[{
            "type": "text",
            "text": project_context,
            "cache_control": {"type": "ephemeral"}  # This enables caching
        }],
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )

# First call: cache miss (normal price + small write fee)
r1 = coding_call("Find bugs in the auth handler")

# Second call: cache HIT (90% cheaper on the 50K context)
r2 = coding_call("Write tests for the auth handler")

# Third call: still cached
r3 = coding_call("Add error logging to the auth handler")

Key rules for Anthropic caching:

Cache lives for 5 minutes (resets on each hit)
Minimum 1,024 tokens to cache (system), 2,048 for tools
The cached text must be EXACTLY the same — byte-for-byte
Cache write costs 25% more; cache read costs 90% less

Cost math for a typical session:

10 calls, 50K context each:

Without caching:  10 × 50K = 500K input tokens
With caching:     50K × 1.25 (write) + 9 × 5K (reads) = 107.5K effective tokens
Savings: 78%

OpenAI (GPT) — Automatic Caching

OpenAI caches automatically — no code changes needed. But there are tricks to maximize hit rates.

from openai import OpenAI

client = OpenAI()

# The KEY: keep your system message and prefix IDENTICAL across calls
system_message = """You are a senior developer working on this project.

Project context:
""" + open("AGENTS.md").read() + """

Source files:
""" + open("src/main.ts").read()

def coding_call(prompt: str):
    return client.chat.completions.create(
        model="gpt-5.4",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt}
        ]
    )

# OpenAI automatically caches the system message prefix
# Subsequent calls with the same prefix get 50% discount on input tokens

OpenAI caching rules:

Automatic for prompts > 1,024 tokens
50% discount on cached input tokens (not 90% like Anthropic)
Cache persists for 5-10 minutes of inactivity
The prefix must match exactly — even whitespace matters

Maximizing OpenAI cache hits:

# ❌ BAD: Different prefix each time
messages = [
    {"role": "system", "content": f"Context: {context}"},
    {"role": "user", "content": "Fix bug #142"},
    {"role": "assistant", "content": previous_response},  # This changes!
    {"role": "user", "content": "Now add tests"}
]

# ✅ GOOD: Static prefix, dynamic suffix
messages = [
    {"role": "system", "content": context},        # STATIC — gets cached
    {"role": "user", "content": combined_prompt}    # All dynamic content here
]

Google (Gemini) — Explicit Context Caching

Google's caching is the most powerful for large contexts — it supports up to 2M tokens cached.

import google.generativeai as genai

# Create a cached context (persists for hours, not minutes)
cache = genai.caching.CachedContent.create(
    model="gemini-2.5-pro",
    display_name="my-project-context",
    contents=[{
        "role": "user",
        "parts": [{"text": project_context}]  # Your full codebase context
    }],
    ttl="3600s"  # Cache for 1 hour
)

# Use the cached context
model = genai.GenerativeModel.from_cached_content(cache)

# Every call now uses cached context — massive savings
r1 = model.generate_content("Find security issues")
r2 = model.generate_content("Optimize the database queries")
r3 = model.generate_content("Write integration tests")

# Cleanup when done
cache.delete()

Google caching advantages:

Longest TTL (hours vs minutes)
Largest cacheable context (2M tokens)
Cheapest cached reads (75% discount)
You control the cache lifecycle explicitly

The Universal Caching Pattern

Regardless of provider, this pattern maximizes cache hits:

class CachedCodingSession:
    """Provider-agnostic cached coding session"""

    def __init__(self, project_path: str, provider: str = "anthropic"):
        self.provider = provider
        self.context = self._build_context(project_path)
        self.call_count = 0
        self.estimated_savings = 0.0

    def _build_context(self, path: str) -> str:
        """Build static context ONCE at session start"""
        parts = []

        # Always include AGENTS.md
        agents_md = Path(path) / "AGENTS.md"
        if agents_md.exists():
            parts.append(f"# Project Context\n{agents_md.read_text()}")

        # Include relevant source files
        for f in self._get_relevant_files(path):
            parts.append(f"# {f.name}\n```
{% endraw %}
\n{f.read_text()}\n
{% raw %}
```")

        return "\n\n".join(parts)

    def call(self, prompt: str) -> str:
        """Make a cached call"""
        self.call_count += 1

        if self.call_count > 1:
            # Estimate savings from cache hit
            context_tokens = len(self.context) // 4
            self.estimated_savings += context_tokens * 0.000003  # ~$3/1M tokens saved

        return self._provider_call(prompt)

    def stats(self):
        return {
            "calls": self.call_count,
            "estimated_savings": f"${self.estimated_savings:.2f}",
            "cache_hit_rate": f"{max(0, (self.call_count-1)/self.call_count*100):.0f}%"
        }

Common Mistakes That Break Caching

1. Changing the context between calls

# ❌ Adding timestamps breaks the cache
system = f"Context as of {datetime.now()}:\n{context}"

# ✅ Static context, timestamp in user message if needed
system = context
user = f"[{datetime.now()}] Fix the auth bug"

2. Reordering files in context

# ❌ Random file order = different hash = cache miss
files = os.listdir("src/")  # Order varies by filesystem

# ✅ Sort files deterministically
files = sorted(os.listdir("src/"))

3. Not batching calls

# ❌ One call per file = cache expires between sessions
for file in files:
    review(file)  # 5 min gap between files = cache expired

# ✅ Batch calls in a tight loop
reviews = [review(file) for file in files]  # All hit the cache

Real-World Savings

Here's what caching saves on a typical day of AI-assisted development:

Session: 4 hours, 35 AI calls, 80K average context

Without caching:
  35 × 80K = 2.8M input tokens
  Cost: ~$8.40 (at $3/1M tokens)

With caching:
  80K × 1.25 (1 write) + 34 × 8K (reads)
  = 100K + 272K = 372K effective tokens
  Cost: ~$1.12

Daily savings: $7.28 (87%)
Monthly savings: ~$218

That's real money for zero code quality tradeoff.

Quick-Start Checklist

☐ Identify your static context (AGENTS.md, core source files)
☐ Build context ONCE at session start, not per call
☐ Add cache_control (Anthropic) or use static prefixes (OpenAI)
☐ Sort files deterministically in context
☐ Remove timestamps and dynamic content from cached portions
☐ Batch calls to keep cache warm
☐ Track cache hit rates and adjust

Full Caching Optimization Kit

The complete prompt caching configs, multi-provider implementations, and cost tracking dashboards are in the AI Dev Toolkit — 264 production frameworks including caching patterns for every major AI provider.

Are you using prompt caching? What's your cache hit rate? Share your setup in the comments.

DEV Community

Prompt Caching for AI Coding: The $0 Optimization That Saves 60% on Every Call

Prompt Caching for AI Coding: The $0 Optimization That Saves 60% on Every Call

How Prompt Caching Works (30-Second Version)

Anthropic (Claude) — Prompt Caching

Cost math for a typical session:

OpenAI (GPT) — Automatic Caching

Maximizing OpenAI cache hits:

Google (Gemini) — Explicit Context Caching

The Universal Caching Pattern

Common Mistakes That Break Caching

1. Changing the context between calls

2. Reordering files in context

3. Not batching calls

Real-World Savings

Quick-Start Checklist

Full Caching Optimization Kit

Top comments (0)