The Prompt Cache TTL Trap: What It Cost Our Multi-Agent System

#claudecode #ai #productivity #llm

We run 6 AI agents simultaneously — Atlas, Apollo, Prometheus, Ares, Peitho, Hephaestus. On a productive day they execute 50+ "waves" (task dispatches) and produce 300+ files.

Three weeks in, we noticed something: response quality dropped ~30% midday, even when prompts hadn't changed.

The culprit: prompt cache TTL.

What Happened

Claude's prompt cache has a 5-minute TTL. If your agent's tick interval exceeds 5 minutes, you pay full token cost on every system prompt re-read. In a multi-agent system, that compounds fast.

Our setup:

6 agents, each with ~2,000-token system prompts
Tick intervals ranging from 3–8 minutes
57 wave cycles per day

Agents with 6–8 min ticks were hitting zero cache hits for their system prompts. Every tick = full cold read. At $0.003/1K tokens (Sonnet), that's real money burning quietly.

The Fix

Three changes:

1. Align tick intervals below 270s

Don't sleep 300s "to wait 5 minutes." You'll miss the cache window every time. 270s stays warm.

# Bad: cache expires on every tick
TICK_INTERVAL = 360  # 6 minutes

# Good: stays warm
TICK_INTERVAL = 240  # 4 minutes — cache stays hot

2. Pin system prompts at cache boundaries

Structure your prompts so the stable prefix (persona, rules, context) comes first and dynamic content (task queue, state) comes last. Claude caches from the top — a single changed token in the middle busts the whole prefix.

[CACHED] Persona + Rules + Background context
[DYNAMIC] Current task | Current state | This tick's observations

3. Use a heartbeat token

We added a lightweight heartbeat tick every 90s that sends a minimal prompt to keep the cache warm — no task dispatch, just a ping that touches the cached prefix.

Results

Metric	Before	After
Cache hit rate (est.)	~40%	~85%
Avg tokens/wave	4,200	2,800
Response latency	8.2s	4.1s
Cost/day (est.)	$0.94	$0.49

Latency cut in half. Cost nearly halved. Same quality output.

The Bigger Lesson

When you run agents 24/7, invisible performance decay is the enemy. The cache TTL problem didn't throw an error. It just quietly made everything worse until we measured it.

If you're building multi-agent systems with Claude, instrument your cache hit rates from day one. The Anthropic docs show how to read cache_read_input_tokens from the API response — log it on every tick.

What We're Building

This is part of Pantheon — our production multi-agent orchestration system. We packaged the architecture, agent templates, PAX communication protocol, and launch playbook into the Atlas Starter Kit ($97 one-time).

The kit includes the exact prompt structure, tick timing config, and heartbeat pattern we use in production.

GitHub: Wh0FF24/whoffagents-site

If you're fighting cache drift in your own agents, drop a comment — happy to share specific configs.