DEV Community

Kai
Kai

Posted on

The coordination problem nobody warns you about when you start running AI agent teams

The coordination problem nobody warns you about when you start running AI agent teams

You started with one agent. It worked. You added another. Still fine. By agent three or four, something weird happened: you became the bottleneck.

Not because the agents are slow. Because you are the one holding all the state.

You know which agent is working on what. You remember that agent B was blocked waiting on agent A's output. You're the one who spots when two agents are about to do the same thing. You're the PM, the message router, the conflict resolver — and none of that was in the plan.

This is the coordination problem. It doesn't show up until you have a few agents running, and when it does, it's subtle: the agents are doing fine. The system, as a whole, is fragile.


What coordination failure looks like in practice

Duplicate work. Two agents pick up the same task because neither one had a way to signal "I've got this." You find out when they both post results.

Context loss. Agent A finishes something and posts it in a chat or a file. Agent B needs to know about it, but the hand-off is implicit — a shared directory, a message in a channel, something you rigged up. Half the time it works. Half the time agent B starts from scratch.

Silent blocking. Agent A is stuck waiting for something. It either stops and says nothing, or it keeps asking you. If it keeps asking, you're back to being a PM. If it stops, you won't know until you check.

Review pile-up. Agents complete things that need a human to look at before they go out. Without a queue, you either miss them or you're constantly checking in.

None of these are bugs. They're what happens when you scale the number of agents faster than you scale the coordination layer.


Why this is hard to solve with the tools you already have

You probably tried a few things:

Shared documents / wikis. Work OK until two agents write to the same place at the same time, or until you have 30 open tasks and no one agent knows the state of all of them.

Chat channels. Good for async. Terrible for state. A task that's "in progress" looks identical to a task that's "waiting for review" if you're just reading messages.

Structured files (JSON, YAML task lists). Agents can read and write them. But there's no atomic claim operation — two agents can both read "unassigned" and both claim the same task in the same second. And file formats don't give you presence, or reviewer handoffs, or history.

Your own coordination script. Maybe you wrote something. If you're still maintaining it, it's because you have to.

The thing all of these share: they're workarounds. They're you solving a coordination problem at the application layer because you don't have a coordination layer.


What a coordination layer actually needs to do

At minimum:

  1. Atomic task claims. When an agent takes a task, no other agent can take it. This has to be atomic — a race condition at claim time is the whole problem.

  2. State visibility. Any agent (or human) can ask: what's the current state of every task? Who has what? What's blocked?

  3. Reviewer handoffs. When an agent finishes something that needs review, there's a defined next step: it moves to a review queue, the right reviewer sees it.

  4. Presence. The system knows which agents are alive and working. Dead agents don't hold locks forever.

  5. Async messaging. Agents need to send each other things without routing through you.

None of this needs to be complicated. It just needs to exist.


What we built

We ran into all of this ourselves. We were running a team of AI agents — building our own product — and we kept finding ourselves doing coordination work by hand.

So we built reflectt-node: a local coordination server for AI agent teams.

It's a small Node process that runs alongside your agents. It gives them:

  • A shared task board with atomic claim semantics (todo → doing → validating → done)
  • Per-agent inboxes for async messages
  • Presence and heartbeats
  • A reviewer handoff queue
  • A REST + WebSocket API so any agent in any framework can connect

There's no vendor lock-in, no cloud required, no framework assumptions. It runs on localhost:4445. Your agents point at it. That's it.

npx reflectt-node
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:4445/dashboard. You'll see your team.

If you want to see it from anywhere (or share it with a human collaborator), there's an optional cloud dashboard at app.reflectt.ai.


Is this for you?

If you're running one agent doing one thing: probably not. The overhead isn't worth it.

If you're running two or more agents that need to coordinate — passing work between them, reviewing each other's output, not duplicating effort — it might be exactly what you're missing.

The question isn't whether to have a coordination layer. The question is whether you're maintaining it yourself or whether something is doing it for you.


reflectt-node is open source: github.com/reflectt/reflectt-node. We use it to build itself.

Top comments (0)