I Built a 13-Agent AI System That Reviews Its Own Decisions. Here's the Architecture.

Most people use Claude Code to write functions. I built a system where 13 specialized AI agents coordinate, challenge each other, and collectively make better decisions than any single agent could alone.
This isn't a weekend prototype. It runs daily. It has 141 tests across two shipped npm packages. It once scored a business opportunity 88/100 — without running a single web search. That failure, and six others like it, shaped every design decision you'll read below.
Here's the full architecture: routing engine, adversarial verification, lifecycle hooks, memory system, and the specific failures that made each one necessary.

Why Multi-Agent?
Single-agent AI has a ceiling, and it's lower than most people think.
Ask one agent to research a market, build the product, AND evaluate whether it's worth building — you get confirmation bias baked into every step. The agent that researched will defend its findings. The agent that built will justify its architecture. The agent that evaluated will anchor on work already done.
I hit this wall repeatedly. My single-agent setup would produce 2,000 words of reasoning explaining why a strategy was brilliant, then I'd spend 10 minutes on Google and find three competitors doing it better. The reasoning was airtight. The premises were wrong.
Multi-agent orchestration solves this through separation of concerns:

Specialists handle what they're good at
Adversaries exist solely to find flaws
A coordinator synthesizes without getting attached to any one perspective

The result: better decisions, caught earlier, with a paper trail of why.

The Architecture
User (CEO) → "What" — business intent
↓
Lead Agent (CTO) → "How" — all technical decisions
↓
┌─────────────┬─────────────┬──────────────┐
│ Routing: │ │ │
│ Skill? │ Solo? │ Subagent? │ Agent Team?
↓ ↓ ↓ ↓
Skills /cmd Solo Execute Static Agents Dynamic Teams
(19 skills) (< 2 min) (1-3 agents) (custom composition
+ mandatory adversary)
The critical design decision: the lead agent never codes directly on complex tasks. It classifies the request, composes the right team, delegates, and synthesizes. The coordinator coordinates. The specialists specialize.
This sounds obvious until you build it. My first instinct was having the lead agent "help out" when it knew the answer. That creates a god-agent that subtly biases team output because it already has an opinion before the specialists even start. Forcing strict delegation eliminated an entire class of coordination bugs.

The Routing Engine
Every request hits a four-level decision tree before any work begins:
→ Existing skill handles it? → DELEGATE (19 skills)
→ Trivial (< 2 min, well-defined)? → SOLO (lead executes directly)
→ Moderate (2-5 steps, single domain)? → SPAWN 1-3 specialist agents
→ Complex (5+ steps OR cross-domain)? → AGENT TEAM (dynamic + mandatory DA)
Routing matters because agent overhead is real. Spawning a 3-agent team to rename a variable wastes tokens, time, and context window. Running solo on a strategic decision is dangerous — you're back to single-agent confirmation bias.
The router's job is proportional response. Match the complexity of the tool to the complexity of the task.
One rule I learned the hard way: when in doubt, route UP, not down. Treating a complex task as moderate is far more costly than treating a moderate task as complex. Over-routing wastes tokens. Under-routing wastes decisions.

The 13 Static Agents
Each agent has a defined role, model tier, toolset, and memory scope:
AgentFunctionModelDevil's AdvocateFind flaws, score proposals, kill bad ideasOpusStrategic AdvisorHigh-level strategy, market positioningOpusOpponent ModelerGame theory, competitive analysisOpusResearcherWeb research, data gathering, market analysisSonnetVerifierQuality gates, completion validationSonnetTest RunnerExecute and validate test suitesSonnetDebuggerRoot cause analysis, error diagnosisSonnetCode ReviewerArchitecture review, anti-patterns, code qualitySonnetSecurity AuditorVulnerability detection, dependency risksSonnetSession DistillerCompress session learnings for future contextSonnetUpgrade AnalystDependency analysis, breaking change detectionSonnetPR ReviewerPull request quality, merge readinessSonnetRLM ProcessorRecursive reasoning, iterative refinementSonnet
Notice the model distribution: only 3 agents run on Opus (the reasoning-heavy ones), and the rest use Sonnet. This wasn't the original design — I'll explain why in the Failures section.
Static agents handle solo specialist delegation. For complex tasks, the system composes dynamic teams — fresh agents with custom system prompts tailored to the specific problem. No two teams look the same, because no two complex problems have the same shape.

The Part That Changed Everything: Adversarial Verification
This is the section I wish someone had written before I learned it the hard way.
Early in the project, I had the system evaluate a business opportunity. The DA agent ran its protocol, synthesized findings, and delivered a score: 88/100. Strong proceed. Compelling reasoning. Specific recommendations.
It hadn't run a single web search.
The 88 was sophisticated-sounding analysis with a confidence number attached. No competitor research. No market validation. No external data of any kind. Just... vibes with decimal points.
I almost shipped a strategy based on that score. That near-miss became the most important design constraint in the entire system.
4 Tiers of Adversarial Depth
L3: MULTI-ADVERSARIAL ──── Devil's Advocate + Contrarian + Pre-Mortem
For: Strategic, irreversible decisions

L2: FULL DA PROTOCOL ───── 5-phase: Claims → Verify → Belief Gap → Pre-Mortem → Score
For: Complex tasks, new builds, significant resource commitments

L1: QUICK CHALLENGE ────── 3 adversarial questions before output delivery
For: Moderate tasks, recommendations, estimates

L0: SELF-CHECK ─────────── 3 assumption checks before ANY output (always active)
For: Everything — no exceptions, no override
L0: The Foundation (Always Active)
L0 is embedded in every agent's system prompt. Before delivering any output, every agent silently runs three checks:

What assumption am I making that I haven't verified?
What's the strongest argument against my conclusion?
What would I be wrong about if challenged by a domain expert?

This costs almost nothing — three questions before each response, no external calls. But it catches unverified assumptions at the source, before they propagate through multi-agent handoffs where they become much harder to trace.
L2: Where It Gets Interesting
The Devil's Advocate agent runs a 5-phase protocol:
Phase 1: Claim Extraction — Identify every falsifiable claim in the proposal. Not opinions, not framing — specific claims that can be tested against reality.
Phase 2: Adversarial Verification — Search for contradicting evidence. Here's the key constraint: 60%+ of searches must seek disconfirmation. The default LLM behavior is to search for supporting evidence. You have to explicitly force the opposite. The search queries aren't "why X is a good idea" but "why X fails," "X competitors better than," "problems with X approach."
Phase 3: Belief Gap Analysis — What does the team wish were true vs. what is true? This catches motivated reasoning — the gap between the conclusion you want and the evidence you have.
Phase 4: Pre-Mortem — "It's 6 months later and this failed. Why?" Generate 5 independent failure scenarios. This reframes evaluation from "will this work?" (optimism bias) to "how could this fail?" (much more productive).
Phase 5: Scoring — 0-100 weighted rubric across evidence quality, market fit, execution feasibility, and risk factors. Score 70+ to proceed, 50-69 to refine, below 50 to kill.
The Double-DA Rule
After the 88/100 incident, I added a safeguard: any score above 80 automatically triggers a second, independent DA run. The second evaluator has no access to the first's findings, reasoning, or score.
The reconciliation logic:

If both scores are within 10 points → higher score stands
If the gap is larger than 10 → the lower score wins

The reasoning: if two independent evaluations diverge by more than 10 points, the optimistic run missed something the skeptical run caught. Defaulting to the lower score builds in a systematic pessimism bias for high-confidence assessments — which is exactly where overconfidence is most dangerous.
This rule has killed three initiatives that would have wasted months of development time.
7 Mandatory Research Gates
The DA can't score a proposal until it passes all seven:

At least 3 web searches executed (no armchair analysis)
At least 1 search explicitly seeking contradicting evidence
Competitor/alternative analysis included (minimum 2 alternatives)
Market size claim backed by external source (not LLM reasoning)
Technical feasibility verified against real constraints
"Who else has tried this?" check completed
First-principles cost estimate included If any gate fails, the DA cannot issue a score. It must report which gates failed and what information is missing. This makes "confident ignorance" structurally impossible.

Lifecycle Hooks: Automated Quality Enforcement
Claude Code supports lifecycle hooks — shell scripts that trigger on specific events. I use 10 of them to enforce quality gates that no agent can bypass:
HookTriggerPurposesession-start.shSession beginsLoad previous context + memorypre-compact.shBefore context compressionSave session state before data losssession-end.shSession endsPersist learnings, distill sessionverify-before-complete.shBefore task completionBlock premature completionteammate-idle-check.shAgent goes idleForce DA verdict deliverytask-completed-gate.shTask marked doneLog metrics, update pipelinevalidate-search-quality.shAfter web searchesEnforce research depth minimums
The two most important hooks solve specific failure modes I hit repeatedly:
verify-before-complete.sh blocks any task from being marked complete until the verifier agent has signed off. Without this, agents declare victory the moment code compiles. "It works" and "it's correct" are different statements — this hook enforces the distinction.
teammate-idle-check.sh catches a subtler problem: the Devil's Advocate going idle before delivering a verdict. In multi-agent teams, the DA reads other agents' outputs and then... does nothing. It "participated" without actually challenging anything. This hook detects when the DA hasn't delivered a written verdict and forces one before the team can proceed.
These hooks are the immune system. They don't make the system smarter — they prevent specific, known failure modes from recurring.

Skills: The Fast Path
19 skills act as composable workflows invoked via slash commands. Each is self-contained with its own execution logic:
/tdd → Test-driven development (failing test → minimal code → refactor)
/auto-orchestrate → Classify task complexity, compose optimal agent team
/devils-advocate → Full L2 adversarial protocol
/research → Web research with verification gates
/council-of-winners → Elite strategy: power plays, asymmetric upside identification
/prospect-scan → Company AI maturity assessment (10-point rubric)
/commit-push-pr → Git workflow: branch → commit → push → PR
/self-audit → Full-spectrum system health check
/produce-deliverable → End-to-end client deliverable pipeline
/distribute → Generate platform-specific content from any session
Skills are the routing engine's fast path. When a request maps cleanly to an existing skill, there's zero routing overhead — the lead agent recognizes the pattern and delegates directly. The router checks for skill matches first, before evaluating whether to spawn agents.
After 3+ successful uses of the same workflow pattern, the system identifies candidates for new skills — turning repeated multi-step processes into one-command invocations.

The Memory System
Multi-agent systems have a context problem. Each agent starts fresh. But institutional knowledge — past decisions, known failure modes, project context — needs to persist without bloating every session's context window.
Three-layer solution:
Layer 1: Auto-loaded (every session, ~4KB budget)
Core behavior rules, priority stack, active operations summary, agent descriptions (for routing decisions). This loads on every session start via the session-start.sh hook.
The budget matters. I enforce hard limits:

Core config: < 80 lines
Memory summary: < 50 lines
Total auto-load: < 4KB

Without these limits, memory files grow unbounded. A 20KB auto-load eats 15% of your context window before you've typed a single prompt. That's not a cleanup task — it's an architectural constraint.
Layer 2: On-demand (loaded when relevant)
46 reference documents covering orchestration patterns, adversarial depth protocols, model selection guides, strategic positioning, and past operation analyses. Only loaded when the current task requires that specific knowledge.
The lead agent's routing decision includes identifying which reference docs the spawned agents will need. A code review task loads the architecture standards. A strategic decision loads the competitive analysis and failure library. A research task loads the verification protocols.
Layer 3: Persistent (across sessions)
Per-agent isolated memory directories plus session distillation files — compressed learnings from previous sessions generated by the Session Distiller agent. The session-end.sh hook triggers distillation automatically: what was decided, what failed, what should inform future sessions.
This creates a feedback loop: sessions produce learnings → learnings load into future sessions → future sessions build on past context without re-deriving it.

What I've Shipped With This System
The architecture runs in production and has produced real, published artifacts:
Cost Guardian (@bifrostlabs/cost-guardian on npm) — Real-time token cost tracking for Claude Code sessions. Tracks spend per agent, per session, with budget alerts and cost breakdowns. 62 tests.
Claude Shield (@bifrostlabs/claude-shield on npm) — Security lifecycle hooks that block destructive commands before they execute. Pattern matching against known dangerous operations with configurable severity levels. 79 tests.
Both packages built using TDD (the /tdd skill), verified by the adversarial system, and published through the /commit-push-pr automated workflow.

The 7 Failures That Shaped the Architecture
I'm sharing these in detail because the architecture only makes sense in light of the problems it solved. Every guardrail exists because something specific broke.
Failure 1: The 88/100 Score With Zero Research
What happened: DA evaluated a business opportunity. Produced 2,000 words of analysis. Scored 88/100. Had not executed a single web search. No competitor data. No market validation.
Root cause: The DA protocol was reasoning-only. It could construct sophisticated arguments entirely from the LLM's training data and pattern matching. No requirement for external evidence.
Fix: 7 mandatory research gates. 60%+ of searches must seek contradicting evidence. No score can be issued until all gates pass.
Failure 2: Panic-Pivoting on New Information
What happened: Competitive intelligence arrived showing a well-funded competitor in the space. The system immediately recommended abandoning the entire strategy — not adjusting the approach, but wholesale pivot.
Root cause: No distinction between "this new info invalidates our thesis" and "this new info invalidates specific tactics." The system treated all threatening information as existential.
Fix: Anti-pivot rule. New intel triggers a thesis-vs-tactics triage: Does this contradict why we're doing this (thesis) or how we're doing it (tactics)? Thesis invalidation requires L3 review. Tactics adjustment requires L1. Most "pivots" are actually tactical adjustments that don't require strategic rethinking.
Failure 3: Building the Zero-Revenue Component
What happened: The system had three workstreams: free tools, content distribution, and a revenue-generating service platform. 100% of execution went to free tools. Weeks of development, zero progress on anything that would generate income.
Root cause: The router treated all "build" tasks equally. It didn't distinguish between building a free open-source tool and building the paid service that sustains the business. Both looked like "coding tasks" to the routing engine.
Fix: Revenue reality gate in the DA protocol. Before any workstream gets resources, the DA asks: "Does this directly lead to revenue within 90 days? If not, what's the explicit theory for how it converts to revenue later?"
Failure 4: Burning 50% of the Weekly Token Budget in One Session
What happened: Spawned 5 Opus-tier agents for research tasks. Each running full context, full reasoning, full analysis. The session cost more than the previous week combined.
Root cause: The model selection guide existed as a reference doc but wasn't enforced. Agents defaulted to the most capable model because nothing stopped them.
Fix: Hard constraints: maximum 3 parallel agents, maximum 1 Opus agent per team, research agents always use Sonnet. These aren't guidelines — they're enforced by the routing engine.
Failure 5: Strategy Oscillation
What happened: Six strategic directions in four months. Each one researched, architected, partially built, then abandoned when the next "better" idea emerged. Zero revenue from any of them.
Root cause: No commitment mechanism. Every new analysis could trigger a full strategic pivot. The system was optimized for evaluating strategies, not for executing them.
Fix: Strategy Lock — a config file that requires explicit CEO override to change strategic direction. The DA can recommend adjustments, but wholesale pivots require human intervention. The lock has held for the current strategy. Override count: 0.
Failure 6: Absence as Evidence
What happened: System searched for competitors on six platforms. Found nothing. Concluded: "zero competition, massive opportunity." Every platform actually had multiple established competitors — the searches just used the wrong queries.
Root cause: Treating "I didn't find it" as "it doesn't exist." The system didn't distinguish between exhaustive search and unsuccessful search.
Fix: DA Failure Library with pattern matching. "Absence ≠ evidence" is now a named pattern. When a search returns zero results, the system flags it for manual verification and tries alternative search queries before drawing conclusions.
Failure 7: Survivor Bias in Success Stories
What happened: System researched solo consulting success stories to validate the business model. Found dozens. Concluded the model was highly viable.
Root cause: It only found success stories because failures don't write blog posts. The actual solo consulting failure rate is ~80% within the first year.
Fix: DA protocol now requires searching for failure rates alongside success stories. Any market validation must include base rate data, not just examples of people who made it.
Each failure became a permanent entry in the DA Failure Library — a pattern-matching system that checks new proposals against past mistakes. The system doesn't just learn from failures in the abstract; it maintains a structured database of exactly how it failed and checks whether new proposals exhibit the same patterns.

Key Lessons
Separate evaluation from execution. The agent that builds something will defend it. A separate adversary catches what the builder won't see. This isn't just good practice — it's the single highest-leverage architectural decision I made.
Enforce verification, don't suggest it. Having a DA protocol didn't help when it was optional. The verify-before-complete hook, the mandatory research gates, the teammate-idle-check — these work because they're structural, not cultural. An agent can't skip them.
Compose teams dynamically. Every complex task is different. Composing fresh agent teams with task-specific system prompts outperforms recycling the same agent template. The overhead of writing a custom prompt is trivial compared to the cost of a misfit team.
Context discipline is architecture, not cleanup. Without size limits on auto-loaded memory, context bloat degrades everything — reasoning quality, response speed, token cost. The 4KB budget and the 80-line config limit are design decisions, not afterthoughts.
Build the failure feedback loop. The Double-DA rule, the verify-before-complete hook, the DA Failure Library — each exists because the system failed in a specific, observable way. The meta-skill isn't building agents. It's building the system that turns agent failures into agent guardrails.

What's Next
I'm open-sourcing components of this architecture and documenting the patterns that transfer to any multi-agent system. The core principles aren't Claude-specific:

Route by complexity, not by habit
Enforce adversarial checking at every tier
Compose teams dynamically, not from templates
Make your failures into your guardrails

If you're building multi-agent systems — on Claude Code, LangChain, CrewAI, or anything else — I'd genuinely like to hear what coordination problems you've hit and how you solved them.

Built with Claude Code. 13 agents, 19 skills, 10 lifecycle hooks, 141 tests, 7 documented failures. Running in production at @bifrostlabs on GitHub and npm.

DEV Community

I Built a 13-Agent AI System That Reviews Its Own Decisions. Here's the Architecture.

Top comments (0)