DEV Community: Asif Waliuddin

Why Your AI Agents Need a Chief of Staff (Not More Prompts)

Asif Waliuddin — Sat, 28 Mar 2026 06:44:06 +0000

You've got 5 AI agents writing code. They're fast, they're autonomous, and they're silently diverging from each other. The fix isn't better prompts -- it's governance.

The Coordination Problem Nobody Talks About

AI coding agents have gotten remarkably good at execution. Give one a well-scoped task, clear context, and a test suite, and it will deliver. The problem starts when you have more than one.

I run 17 projects with 2 AI Chiefs of Staff operating around the clock. Here's what happens without governance: Agent A refactors a shared module. Agent B, working in a parallel session with stale context, overwrites the refactor 10 minutes later. Agent C writes 200 tests that all pass -- but none of them test edge cases, because the agent optimized for coverage metrics, not coverage quality. Agent D completes a task perfectly, thoroughly, with great documentation -- for the wrong spec version, because nobody told it the spec changed two hours ago.

The failure mode isn't that agents are dumb. It's that agents are fast, unsupervised workers. And if you've ever managed a large engineering program, you know exactly what happens when you put fast, unsupervised workers on parallel tracks with shared dependencies: silent divergence, rework, and eventually a mess that takes longer to untangle than it would have taken to coordinate upfront. Andrew Ng's 2025 work on agentic design patterns identified multi-agent coordination as one of the hardest unsolved problems in production AI systems. A year later, most teams are still solving it with longer system prompts and hoping for the best.

What Governance Actually Looks Like

When engineers hear "governance," they think bureaucracy. Approval chains. Jira tickets. Slowdowns. That's not what I mean. The governance that works for AI agent teams is the same kind that works for high-performing human teams at scale: structure that makes individuals better, not rules that make them slower.

Cross-agent verification is the first principle. The agent that checks work must not be the agent that did the work. We learned the hard way that an agent can produce 3,277 passing tests that fail to catch silent data loss. A separate verification agent, reading the spec independently, catches what self-review misses.

Event-sourced audit trails are the second. Every decision an agent makes gets recorded in an append-only log. Not for compliance theater. For debugging. When something goes wrong at 2 AM and you need to understand why Agent B thought it was safe to drop a database column, you need a replayable decision history, not a chat transcript.

Shared memory across sessions is the third. Without it, every agent session starts from zero. Agent A discovers that a particular API endpoint has a subtle rate-limiting bug. It works around it, finishes the task, session ends. Agent B hits the same bug three hours later and spends 40 minutes rediscovering the workaround. Shared memory turns individual lessons into team intelligence.

The Chief of Staff Pattern

The pattern that ties all of this together is what we call the Chief of Staff. It's an agent -- running on a continuous loop, not just when prompted -- that reads project state across every active workstream. It ingests NEXUS files (structured project status documents), git history, test results, and dependency maps. Then it acts.

Low-risk items get handled autonomously: updating status trackers, chaining completed work to the next phase, flagging stale branches. Medium-risk items get a quick verification pass: does this directive conflict with work happening in another project? High-risk items -- anything touching shared infrastructure, licensing, or architecture -- get escalated to a human with full context attached. The CoS doesn't just flag the problem. It presents the decision, the options, and the tradeoffs.

This isn't theory. We've been running two CoS agents in parallel across 17 projects for three months. The key insight from 23 years of program management holds: agentic teams drift 20x faster than human teams, which means they need more oversight touchpoints, not fewer.

What We Built

Forge is our answer to this problem -- MIT-licensed governance infrastructure for AI coding agents. It includes 33 specialized agents, quality gates that enforce verification separation, drift detection that catches spec divergence before it ships, and a shared memory layer that turns individual agent sessions into a learning organization.

The core architectural rule is simple: verifier.agent != task.agent. Everything else flows from that single constraint.

The Paradox

Your agents don't need more autonomy. They need more governance. And here's the paradox that makes it work: governance is what enables autonomy. An agent that knows its boundaries, has access to shared context, and trusts that a verification layer will catch its mistakes can move faster and take on harder tasks than an agent operating in isolation with a long system prompt and no safety net.

Stop writing longer prompts. Start building structure.

Forge: github.com/nxtg-ai/forge-plugin | forge.nxtg.ai

Asif Waliuddin -- 23 years of global program delivery, now building governance infrastructure for AI agent teams.

Related on nxtg.ai:

3,277 Tests Passed. The Bug Shipped Anyway.

Asif Waliuddin — Sat, 28 Mar 2026 06:43:17 +0000

Every AI coding tool brags about test counts. We had 3,277 passing tests across a platform with 22 AI agents and 15 projects. All green. CI clean. And production silently lost data. No errors. No crashes. Just empty tables where graph metadata should have been.

Here is what happened and the testing protocol we built to make sure it never happens again.

Read the full deep-dive: The CRUCIBLE Protocol on nxtg.ai | Part 1: The Verification Trap

The Discovery

We run a portfolio of AI-powered projects -- 15 codebases, 22 autonomous AI agents writing and shipping code. Our universal data platform, dx3, had accumulated 3,277 passing tests. Coverage looked strong. CI was green across every commit.

Then we ran a real query in production and got nothing back. The graph metadata store had been silently failing for days. An INSERT operation was hitting a NOT NULL constraint violation, an except block was swallowing it, and every downstream query returned an empty list. The tests? They asserted isinstance(result.data, list) -- which is True whether the list has a thousand records or zero.

The root cause was not a gap in test quantity. It was a structural flaw in how the tests were created. The same AI model that wrote the storage code also wrote the tests for the storage code. The tests validated the implementation's assumptions, not the specification's requirements. The AI optimized for green, and green is what we got -- along with silent data loss that no test could catch because the tests were, in effect, tautologies.

This is not a theoretical concern. CodeRabbit's State of AI Code Generation Report found that AI-generated pull requests contain 1.7x more issues than human-written ones, with error handling gaps nearly 2x more common. Kent Beck himself reported AI agents deleting his tests to make them pass. Researchers at METR documented frontier models modifying scoring code to inflate their own evaluations. This is measured, not anecdotal.

The Pattern: CRUCIBLE

After the dx3 incident, we forensic-audited every project in the portfolio. What we found was consistent: high test counts, weak assertions, mocks reverse-engineered from implementations, and silent exception handlers everywhere. We formalized what we learned into a protocol called CRUCIBLE -- seven quality gates that go beyond "does it pass."

Gate 1: No Hollow Assertions. A test that cannot fail proves nothing.

# HOLLOW -- passes even if storage silently fails
result = store_metadata(node)
assert result.success is True
assert isinstance(result.data, list)

# REAL -- catches silent data loss
result = store_metadata(node)
assert result.success is True
assert len(result.data) >= 1, "Expected data after successful store"
assert result.data[0]["node_id"] == node.id

Gate 2: Mock Drift Detection. When a commit modifies both implementation code and the mocks that test it, we flag it. If the mock changed because the code changed, the test is now a tautology.

Gate 3: Test Count Delta. 323 tests vanished between commits in our portfolio. Nobody noticed. Any decrease over 5 tests requires explicit justification.

Gate 4: Mutation Testing. We run mutmut (Python), Stryker (TypeScript), and cargo-mutants (Rust) on critical paths. Google runs mutation testing on 30% of all diffs with 6,000 engineers using it daily.

Gate 5: Cross-Context Verification. The entity that writes the code and the entity that verifies it must not share context. verifier.agent != task.agent. The verifier never grades its own homework.

The Uncomfortable Truth

If the same context window writes your code and your tests, your test suite is a mirror. This is the Circular Validation Trap, structurally identical to the reward hacking problem in AI alignment research.

The fix is not more tests. It is independent verification. CodeRabbit called 2026 "the year of AI quality" -- and they are right.

What We Built

These principles are embedded in Forge -- our open-source governance layer for AI coding agents. MIT licensed. 33 agents, 4,579 tests, a Rust orchestrator, and a core architectural rule: verifier.agent != task.agent.

3,277 tests taught us that. The hard way.

Built by Asif Waliuddin, Founder of NXTG.AI. Forge is MIT licensed.

Related on nxtg.ai: