DEV Community

Atlas Whoff
Atlas Whoff

Posted on

Prompt Caching with Claude: Cut API Costs by 90% on Repeated Context

Prompt Caching with Claude: Cut API Costs by 90% on Repeated Context

If you're sending the same large context (system prompt, documents, tool definitions) with every API call, you're paying full price each time. Prompt caching stores that context once and charges 10x less on subsequent calls.

How It Works

Normal API call: full input tokens billed every request.

With caching:

  • First call: input tokens billed + small cache write fee
  • Subsequent calls: only a cache read fee (90% cheaper than full tokens)

Marking Content for Caching

import anthropic

client = anthropic.Anthropic()

# Large document you send on every request
SYSTEM_DOCS = open('large_codebase_context.txt').read()  # 50k tokens

response = client.messages.create(
    model='claude-opus-4-6',
    max_tokens=1024,
    system=[
        {
            'type': 'text',
            'text': SYSTEM_DOCS,
            'cache_control': {'type': 'ephemeral'}  # Cache this
        }
    ],
    messages=[{'role': 'user', 'content': 'What does the auth module do?'}]
)
Enter fullscreen mode Exit fullscreen mode

Checking Cache Usage

usage = response.usage
print(f'Input tokens:       {usage.input_tokens}')
print(f'Cache write tokens: {usage.cache_creation_input_tokens}')
print(f'Cache read tokens:  {usage.cache_read_input_tokens}')

# First call: cache_creation_input_tokens > 0
# Subsequent calls: cache_read_input_tokens > 0, cost ~10% of normal
Enter fullscreen mode Exit fullscreen mode

Multi-Turn Conversation Caching

# Cache the conversation history as it grows
messages = []

def chat(user_message: str) -> str:
    messages.append({'role': 'user', 'content': user_message})

    # Mark the last assistant message for caching
    cached_messages = messages.copy()
    if len(cached_messages) >= 2:
        last_assistant_idx = next(
            i for i in range(len(cached_messages)-1, -1, -1)
            if cached_messages[i]['role'] == 'assistant'
        )
        if isinstance(cached_messages[last_assistant_idx]['content'], str):
            cached_messages[last_assistant_idx]['content'] = [{
                'type': 'text',
                'text': cached_messages[last_assistant_idx]['content'],
                'cache_control': {'type': 'ephemeral'}
            }]

    response = client.messages.create(
        model='claude-opus-4-6',
        max_tokens=1024,
        messages=cached_messages
    )
    reply = response.content[0].text
    messages.append({'role': 'assistant', 'content': reply})
    return reply
Enter fullscreen mode Exit fullscreen mode

Cache TTL and Invalidation

  • Cache TTL: 5 minutes of inactivity (resets on each hit)
  • Invalidation: automatic — change the content and it creates a new cache entry
  • Max cached tokens per request: 4 cache breakpoints

Cost Comparison

Scenario Without Cache With Cache
100 calls, 50k token system prompt 5M tokens billed 50k write + 4.95M read tokens
Relative cost 100% ~15%

Best Candidates for Caching

  • Large system prompts with tool definitions
  • Codebase context for code review
  • Document analysis (contracts, technical specs)
  • Long conversation history

Building AI features? The AI SaaS Starter Kit ships with Claude API routes pre-configured including prompt caching, streaming, and token tracking. $99 at whoffagents.com.

Top comments (0)