My AI agents were individually correct and collectively a disaster

#agents #ai #architecture #softwareengineering

TL;DR: Multi-agent systems don't have an execution problem, they have a coordination problem. I built a gatekeeper layer called Nexus that sits above all other agents and is the only one that can create a ticket.

Repo: https://github.com/PermaShipAI/nexus

When I started building multi-agent systems for software engineering tasks, at first, the architecture felt obvious. Create specialized agents for things like security, reliability, test coverage, and performance. Point them at a codebase and let them run.

The problem showed up fast.

The agents were individually correct. For example, the CISO agent found a real vulnerability and proposed a patch. The SRE agent identified the same affected component and proposed an architectural change that would eliminate the entire class of problem. Both of the proposals were valid but neither agent knew the other existed. They would have shipped conflicting changes to the same files.

That's the easy version of the problem.

The harder version was agents that are locally optimal but globally disruptive. An agent proposes a dependency upgrade that is a good upgrade. But the CI pipeline is red, staging has a blocked circular dependency, and the CTO issued a directive hold on non-critical changes. The agent doesn't know any of that. It just sees a stale dependency.

I was not dealing with an agent quality problem. The agents were doing their jobs. I was dealing with a coordination problem. There was nobody to decide whether their jobs should be done.

This wasn't an orchestration problem. Orchestration assumes you know what needs doing and assigns it. These agents are discovering work independently.

The design decision: one agent with veto power

I built Nexus as an executive layer sitting above all other agents. The rule is simple. Only Nexus can create a ticket. Every other agent identifies work and makes its case. Nexus decides whether it's worth doing, at the right time, for the right reason.

That's the core question Nexus asks before anything enters the execution pipeline: Is this the right thing to do, at the right time, for the right reason?

Nexus does a few things that make this work:

Cross-agent review. When the CISO agent and SRE agent both propose work touching the same component, Nexus doesn't just pick one. It synthesizes them, rejecting the narrower patch, merging the security requirements into the architectural ticket, and adding the CISO agent as a mandatory reviewer. One ticket, not two conflicting ones.
Temporal judgment. This one took the most work to get right. Nexus tracks system state: CI health, active incidents, error budgets, strategic directives. The same proposal that gets approved during normal operations gets deferred if you're in incident mitigation mode. Same proposal, different answer. Context matters more than correctness.
Rejection isn't binary. A proposal that fundamentally conflicts with core principles gets killed entirely. A proposal where the problem is valid but the execution plan is flawed gets kicked back to the originating agent with specific feedback to resubmit. No proposal is ever silently dropped.
Conflict detection and organizational memory. Agents tag the files, routes, and components their proposals touch. Nexus evaluates actual overlap, not just text similarity. And every approval, rejection, or modification feeds back into what Nexus knows about what your team values. It gets more accurate over time. Slowly, but it does.

Every proposal submitted to Nexus must follow a Decision Brief format before anything moves:

- Problem statement (user harm / business risk)
- Evidence (metrics, incidents, frequency)
- Proposed change (what exactly)
- Alternatives considered
- Risks (security, reliability, correctness, UX)
- Dependencies / prerequisites
- Effort estimate (rough order-of-magnitude)
- Measurement plan (how success will be judged)
- Rollout / rollback plan
- Required reviewers (which agents must sign off)

No brief, no ticket.

Here's what a ticket looks like when a proposal passes:

Phase 1.3: Deterministic Offline Test for Publishing State Machine
pending
task · ux-designer · 3/21/2026, 11:36:32 PM

Write a deterministic offline test suite to verify the core publishing state
machine using the mock adapter.

Acceptance Criteria:
1. Offline Execution: The test suite must run completely offline without
   hitting any external social media networks.
2. Linear State Transitions: The test must explicitly assert the exact
   lifecycle of a publishing job, transitioning from Pending -> Publishing
   -> Success (or Failed).
3. Status Visibility (UX Guardrail): The test must prove that intermediate
   and final states are unambiguously persisted to the database. This
   guarantees the user-facing dashboard can always display an accurate,
   real-time system status (preventing 'ghost' or 'stuck' UI states).
4. Mock Integration: Successfully utilize the MockPlatformAdapter to
   deterministically trigger and verify both the happy path and the
   expected error paths.

Review Gates:
- QA Review: Must verify test determinism to prevent CI flakiness.
- UX/Product Review: Verify failure states contain enough context to render
  clear, actionable error messages in the UI.

Risks & Mitigations:
- Risk: Mock adapter behavior drifts from actual API reality.
- Mitigation: Keep mock logic intentionally dumb; map responses strictly
  to official platform API documentation.

Stop Conditions:
- Halt and escalate to humans if the state machine becomes deadlocked or
  orphaned, or if the mock adapter requires excessive complexity to simulate
  basic state transitions.

Fallback: If the mock adapter cannot accurately simulate all necessary state
transitions, fall back to a local HTTP stubbing tool (e.g., WireMock) to
simulate network-level responses against API contracts.

Why open source it

Honestly, the gatekeeper architecture is the part I'm most interested in getting feedback on. The multi-agent coordination problem is real and most implementations I've seen punt on it entirely. I wanted to put the decision layer out in the open and see what people do with it. The more feedback I can get on it, the faster it will improve and evolve into something better.

The repo is here: https://github.com/PermaShipAI/nexus

Runs locally, works with local models or Anthropic, OpenAI, Gemini APIs.

If you're building multi-agent systems and hitting the coordination wall, open an issue or drop a comment. Genuinely curious what edge cases people are running into.

Top comments (2)

Andre Cytryn • Mar 25

the distinction between orchestration and coordination is the clearest framing I've seen for this problem. orchestration assumes you already know what needs doing. coordination is the harder problem of deciding whether it should be done at all, and when.

the "temporal judgment" piece is what most systems skip entirely. it's easy to evaluate a proposal in isolation, hard to evaluate it against system state. same dependency upgrade, completely different answer depending on whether CI is green and whether there's a feature freeze.

curious how you handle cases where Nexus itself might be wrong, like rejecting a security patch during an incident when that patch would actually reduce the blast radius. is there a human escalation path, or does the system just defer until conditions change?

Yo Sub Kwon • Mar 26

I spent a lot of time reading everything Nexus did for the first week, then less time, and now I skim things here and there. There are different modes you can run it in, but even in autonomous mode it will escalate certain things to humans. At this point though, I generally defer to its own judgment over mine when it comes to evaluating proposals that the AI agents come up with.
It has already outclassed me in the way it has decided to handle responding to various incidents.