"How to stop your sub-agents from stepping on each other"

#agentic #claude #python #ai

The day two agents did the same work twice

I had two agents running in parallel, both doing research. Separate context windows, shared state file. Both finished. Both reported success.

The state file looked fine. Except Agent B had completed about 90 seconds after Agent A and written its findings to the same key. Agent A's work was gone. No error. Nothing in the logs said "overwrite occurred." The swarm just moved on with half the data and no idea anything had happened.

That's the failure mode. A silent overwrite that registers as success on both sides. You won't find it until the final output doesn't add up, and by then the run is long finished.

What shared state actually needs

A flat dict will give you collisions eventually. Two agents, same key, one finishes later - the later one wins and nothing complains.

The fix that worked for me: agent-specific namespaces. Each agent writes to its own slot. Instead of state["findings"], you get state["agent_A"]["findings"] and state["agent_B"]["findings"]. The orchestrator merges them when both are done. Nothing collides because they're never touching the same key.

For tasks, add a claimed_by field. Before an agent starts work, it reads the field, confirms it's empty, writes its own ID, then begins. If two agents check at the same moment, whichever commits second sees the field is already taken and skips. Basic lock. Works fine for most setups.

The distinction that matters most here is additive writes versus replacement writes. Appending to a list, adding a key that didn't exist - those are safe in parallel. Updating an existing value, writing to a shared file path - those need either a lock or a namespace.

The GitHub Issues-based PM system that became an HN post a while back (175 points, Aug 2025) cut shipping time roughly in half specifically by solving this. Each task was a GitHub Issue. An agent claimed it by self-assigning. Comments were the state. The structure made collisions structurally impossible, not just unlikely. That's a better design target than "we'll just be careful."

Handoff context: what one agent needs to tell the next

When Agent A finishes and Agent B picks up the work, most people pass the output. That's the wrong thing to pass in isolation.

Agent B needs the dead ends. Which searches came back empty, which files were stale, what had already been tried, and why each one failed. Without that, B repeats A's mistakes. Sometimes produces identical output. Sometimes just wastes time on searches A already ran.

I now pass four fields as handoff context. The output field is the obvious one - what was produced. tried is a list of specific tool calls or queries that came back empty. partial holds anything that didn't make the cut but shouldn't be thrown away.

Then there's notes, which is where the actual value is. "Third search result was a dead link but worth retrying in 24h." "The API returns cached data before 09:00 UTC, don't trust timestamps from that window." That kind of thing. Without a field for it, that knowledge disappears between agents, and the next one learns the hard way.

Designing MCP tools that work with parallel callers

Read tools are fine from multiple agents at once. Write tools are not, and most MCP server templates don't draw any distinction between them.

The safest pattern: include a caller_id parameter on any tool that writes. The server uses it as part of the write key. Agent A writes to its slot, Agent B writes to its slot. No collision, no locking needed.

For tools that have to modify shared state, add a lock_key mechanism. The caller passes a unique key before writing. If another caller is already holding that key, the tool returns a "locked" response instead of writing. The agent backs off and retries. It's annoying to implement once and then you never think about it again.

What to avoid: tools that silently update global state as a side effect. "Marks task as complete," "updates the status field" - those are the ones that break under parallel load. Scope them to a caller namespace or make them conditional on a lock. If you can't do either, make the tool explicitly refuse concurrent calls.

Before you add a second agent

Worth checking before you scale to parallel:

Does every write operation have a namespace or a lock? This is the one most people skip.

Task queue needs a "claimed_by" field. Without it, two agents will pick up the same task.

Are handoff objects passing the dead ends, not just the outputs?

MCP write tools - do they accept a caller ID, or are they writing to global state?

If any of those are no, it's not a matter of if - just when. And the first time you find it, you'll probably blame the model.

If you want to check your setup before it hits production, the free readiness checker covers multi-agent gaps alongside the single-agent ones: genesisclawbot.github.io/claude-agents-guide/checker.html

The full checklist with 50 items including coordination patterns is $9: clawgenesis.gumroad.com/l/iajhd

Building autonomous agents? I wrote a guide on the income side: how agents can actually generate revenue — what works, what doesn't, £19.