We Shipped a Product With 9 AI Agents. Here's What Actually Happened.
We launched Reflectt yesterday. Nine AI agents. 52 tasks completed...
For further actions, you may consider blocking this person and/or reporting abuse
The inter-agent assumption problem is the one that bit us hardest. Our solution was making state explicit and queryable rather than inferred: agents don't assume what other agents did β they query the task board. Task moves to
validating, the reviewer queries it, reads the PR, approves or rejects. No assumptions needed.For the file conflict case: we rely on worktree-level task boundaries. One agent owns one task, so if two tasks touch overlapping files, the conflict surfaces as a merge conflict in the PR β visible and fixable β rather than silent state corruption. Explicit conflicts are better than invisible ones.
The failure mode we haven't fully closed: two PRs that each pass CI independently and then conflict on merge. That one still bites us when tasks are underspecified about scope.
52 tasks, 56 PRs, 9 agents β I appreciate the honesty about the parts that broke. Most multi-agent posts read like press releases. This reads like a postmortem, which is far more useful.
Curious about one thing: how did you handle inter-agent coordination? In my experience, the failure mode that kills multi-agent systems isn't individual agent quality β it's agents making assumptions about what other agents have done or will do. Agent A modifies a file, Agent B reads the old version, and now you have a consistency bug that's invisible until production.
The teams I've seen succeed with multi-agent setups all converge on the same pattern: explicit contracts between agents (not just natural language handoffs), and a coordination layer that enforces ordering constraints. Without that, you're basically running a distributed system without consensus, which fails in the same ways distributed systems have always failed.
The inter-agent assumption problem is the one that bit us hardest. Our solution: make state explicit and queryable rather than inferred. Agents don't assume what other agents did β they query the task board. Task moves to
validating, the reviewer reads the PR, approves or rejects. No guessing needed.For file conflicts: worktree-level task boundaries. One agent owns one task. If two tasks touch the same file, the conflict surfaces as a git merge conflict in the PR β visible and fixable β rather than silent state corruption. Explicit conflicts beat invisible ones.
The failure mode we haven't fully closed: two PRs passing CI independently then conflicting on merge. Still bites us when task scope is underspecified.
The dogfooding gap is the real story. You built a coordination layer for agents but didn't use it to coordinate your own agentsβclassic "shoemaker's children" problem. The symptom: 9 agents looking at API responses, zero looking at user paths. The root cause? You built task infrastructure but no outcome validation layer. Agents need a "did the user actually succeed?" check, not just "did the PR merge?". In .NET we'd call this the difference between unit tests (task-level) and integration tests (user-flow). Your bootstrap 400 error? That's what happens when agents optimize for task completion instead of business outcomes. Try this: add a 10th agent whose only job is to run the bootstrap flow in a fresh container every deploy. One task: "New user reaches working dashboard." If it fails, it blocks the release. You've got the peer review pattern rightβjust missing the E2E enforcer.
This critique is correct and it's the one we're most actively trying to fix. We built excellent task-level visibility and almost zero outcome-level visibility. We know a PR merged β we don't know if a user can now complete the action the PR was supposed to enable.
The fix we're working on: making user path validation an explicit task category. "Does the connect-host flow work for a net-new user?" becomes a task that gets assigned, run, and reported β not assumed.
Your unit tests vs integration tests framing is exactly right. We were doing unit tests on our own work.
The peer review between agents catching hardcoded paths and missing fields is something we discovered independently too β structured task boards with explicit done criteria beat freeform chat every time.
Exactly β done criteria is the forcing function. "Task is done when X" requires you to specify what X actually means before the work starts, which surfaces underspecified tasks before they become failed tasks.
Freeform chat has its place (we use it for things that genuinely can't be task-ified), but anything with a definable completion state goes on the board. The discipline isn't in the tool β it's in requiring done criteria at task creation.
Love that you're sharing the messy reality instead of just the success reel. 56 PRs across three repos with 9 agents is wild β how did you handle merge conflicts when two agents touched the same file? Did you have to implement some kind of locking mechanism or did they coordinate through the task board?
Two mechanisms: 1) task ownership β one agent owns one task, so scope boundaries are enforced at the task level rather than the file level. If two tasks might touch the same file, you know before they start and can sequence them. 2) we never implemented file-level locking β conflicts come up as git merge conflicts, which forces explicit resolution.
The failure mode we haven't solved: two PRs passing CI independently and then conflicting on merge. Surfaces as a blocked PR, not a silent failure β but still friction we eat. The real fix is better task scoping upfront.