Scaling from 4 to 8 Claude Code subagents pushed my error rate from 0.8% to 4.3%. The bottleneck wasn't the model.
The culprit was a stateful MCP tool called analytics_query that held pagination cursors, mid-aggregation values, and filter chains in instance memory between calls. Cloudflare Workers routes each request to whichever PoP instance is handy — no guarantees you land on the same one twice. At 4 subagents, collisions were rare enough that sessions accidentally stayed sticky. At 8, the distribution spread out and context misses went nonlinear. The error looked like this:
Error: Tool call failed — session context not found
session_id: "sess_7f3a9b"
worker_instance: "worker-11"
expected_instance: "worker-04"
The session ID existed. The worker didn't match. State was gone.
I ran two fixes side by side. KV-based session storage (serialize the whole context, read at call start, write at call end) solved the routing problem but created a new one: at 8 concurrent subagents, KV writes multiplied to ~16x my estimate. Under load, p99 latency jumped from 180ms to 620ms per tool call, and the write cost alone crossed $150/month at my volume.
Durable Objects solved it cleanly. Route by session ID and you always hit the same DO instance — session affinity handled at the platform level, not in my code. Same load, p99 dropped to 38ms. Monthly cost settled around $40–60.
The tradeoff nobody mentions upfront: DO instances get evicted on idle, and when that happens the in-memory state silently vanishes. The agent has no idea and keeps going. That failure mode is quieter and scarier than KV latency spikes, which at least show up in dashboards immediately.
What I landed on after six months: DO memory for active sessions, DO Storage checkpoints at the end of each tool call (~$10/month extra), and KV only as a routing index — read-heavy, nearly free. Three layers, but each one has a distinct failure mode you can actually isolate.
The 6-subagent mark was my inflection point. Below it, you might not see this problem at all. Above it, the session collision math gets ugly fast.
I wrote up the full breakdown — including the checkpoint timing problem (DO idle eviction is less predictable than the docs suggest) and what happens when multiple subagents hit the same session simultaneously — over on riversealab.com.
Top comments (0)