How We Built a Multi-Agent AI Documentation System (And What We Learned)

Muhammad Adnan Sultan — Mon, 25 May 2026 05:36:06 +0000

Last quarter at Zeppelin Labs, we shipped Orchestrator-15 — a multi-agent documentation generation platform that takes a codebase or idea spec and produces production-grade technical documentation using coordinated AI agents.

This post covers the architecture, the mistakes, and the specific patterns that made multi-agent coordination actually work in production. Not a tutorial — a war story.

Why Multi-Agent, Not Just One Big Prompt?

The naive approach to AI documentation generation is one giant prompt: "here's my codebase, write the docs."

It fails for the same reason you wouldn't ask one person to simultaneously be a technical writer, an API analyst, a diagram designer, and an editor. Context windows are finite. Tasks have different optimization targets. And a single agent trying to do everything produces mediocre output across the board.

The multi-agent approach assigns specialized roles:

Analyzer Agent — reads the codebase structure, identifies modules, maps dependencies
Writer Agent — takes structured analysis output and produces prose documentation
Formatter Agent — applies templates, ensures consistency, handles cross-references
Reviewer Agent — checks completeness, flags gaps, scores output quality Each agent is good at one thing. The orchestrator coordinates them in sequence — and sometimes in parallel.

The Architecture

Input (codebase / spec)
        │
        ▼
┌──────────────────┐
│  Orchestrator    │  ← decides task graph, manages state
└──────┬───────────┘
       │
   ┌───┴────────────────────────────┐
   │                                │
   ▼                                ▼
Analyzer Agent                 Context Builder
(GPT-4o, low temp)            (builds shared memory)
   │
   ▼
Writer Agent(s)          ← spawned per module, run in parallel
(Claude 3.5, temp 0.7)
   │
   ▼
Formatter Agent
(structured output)
   │
   ▼
Reviewer Agent           ← gates output quality
(GPT-4o, strict prompt)
   │
   ▼
Final Documentation

The key design decision: shared memory over message passing. Each agent reads from and writes to a shared context object rather than receiving inputs directly from the previous agent. This lets the Reviewer Agent access the Analyzer's raw output without it being filtered through the Writer — which turned out to be critical for catching documentation that technically read well but missed important implementation details.

The State Machine

Each document module moves through states:

type ModuleState =
  | 'pending'
  | 'analyzing'
  | 'writing'
  | 'formatting'
  | 'reviewing'
  | 'approved'
  | 'failed';

interface DocumentModule {
  id: string;
  name: string;
  state: ModuleState;
  analyzerOutput?: AnalysisResult;
  draft?: string;
  formattedDraft?: string;
  reviewScore?: number;
  reviewFeedback?: string;
  retryCount: number;
}

Modules that fail the Reviewer Agent's quality gate (score < 0.75 on our rubric) get re-queued to the Writer Agent with the review feedback included in the prompt. We cap retries at 3 before flagging for human review.

This retry loop was the single biggest quality improvement we made. First-pass writer output approved directly produced documentation that was grammatically fine but structurally shallow. With the reviewer feedback loop, output quality jumped substantially — especially for complex modules.

Parallelism: Where It Works and Where It Breaks

Writer Agents can run in parallel — each module is independent. We spawn up to 8 concurrent Writer Agents using Promise.allSettled:

async function writeModulesInParallel(
  modules: DocumentModule[],
  context: SharedContext
): Promise<DocumentModule[]> {
  const chunks = chunkArray(modules, 8); // max 8 concurrent
  const results: DocumentModule[] = [];

  for (const chunk of chunks) {
    const settled = await Promise.allSettled(
      chunk.map(module => writerAgent.process(module, context))
    );

    for (const result of settled) {
      if (result.status === 'fulfilled') {
        results.push(result.value);
      } else {
        // mark failed, will retry with orchestrator
        results.push(markFailed(chunk[settled.indexOf(result)]));
      }
    }
  }

  return results;
}

What doesn't parallelize well: anything that needs global consistency. The Formatter Agent must run sequentially because it maintains a cross-reference map — if two formatter instances run concurrently they produce conflicting internal link structures. We tried distributed locking on the reference map. It was brittle. Sequential formatting was the right call.

Prompt Architecture: The Part Nobody Talks About

The agents are only as good as their prompts. Our production prompts have four sections:

1. Role definition — what this agent is, what it optimizes for, what it explicitly ignores

2. Input schema — structured description of what the agent receives

3. Output schema — strict JSON format the agent must produce

4. Failure modes — explicit instructions for what to do when input is ambiguous, incomplete, or contradictory

The failure mode section was added after production. Agents without it hallucinated confidently when given ambiguous input. Agents with explicit failure mode instructions instead returned structured { "status": "needs_clarification", "question": "..." } responses that the orchestrator could handle gracefully.

The GitHub Copilot SDK Integration

Orchestrator-15 uses the GitHub Copilot SDK for the Analyzer Agent specifically — the SDK's code-understanding capabilities are significantly stronger than general LLM prompting for structural code analysis. It can identify:

Public API surfaces vs. internal implementation details
Dependency graphs between modules
Comment density and existing documentation coverage
Test coverage as a proxy for module stability The Analyzer feeds this structured analysis to the Writer Agent, which dramatically reduces hallucinated API signatures — one of the most common failures in pure-LLM documentation generation.

What We'd Do Differently

Use structured outputs from the start. We started with free-form text outputs and added JSON schemas later. Every agent refactor was painful because downstream agents had built implicit assumptions about output format. Define your schemas before writing a single agent prompt.

Build the reviewer first. We built it last. If we'd built the quality rubric and reviewer first, we would have caught bad writer prompt patterns in day 1 instead of week 4.

Token budgets per agent. Without explicit token limits per agent, the Writer Agent would occasionally produce exhaustive output for simple modules and thin output for complex ones. Calibrating per-module token budgets based on the Analyzer's complexity score (lines of code, dependency count) significantly improved consistency.

The Repo

Orchestrator-15 is open source. You can find it on the Zeppelin Labs GitHub. We're actively developing it — issues and PRs welcome.

If you're building multi-agent systems and want to compare notes, drop a comment below or reach out through zeppelinlabs.digital.

Built at Zeppelin Labs — a software development studio building SaaS products, AI systems, and automation platforms from Islamabad, Pakistan.

DEV Community: Muhammad Adnan Sultan