Muhammad Adnan Sultan

Posted on May 25

How We Built a Multi-Agent AI Documentation System (And What We Learned)

#ai #python #machinelearning #career

Last quarter at Zeppelin Labs, we shipped Orchestrator-15 — a multi-agent documentation generation platform that takes a codebase or idea spec and produces production-grade technical documentation using coordinated AI agents.

This post covers the architecture, the mistakes, and the specific patterns that made multi-agent coordination actually work in production. Not a tutorial — a war story.

Why Multi-Agent, Not Just One Big Prompt?

The naive approach to AI documentation generation is one giant prompt: "here's my codebase, write the docs."

It fails for the same reason you wouldn't ask one person to simultaneously be a technical writer, an API analyst, a diagram designer, and an editor. Context windows are finite. Tasks have different optimization targets. And a single agent trying to do everything produces mediocre output across the board.

The multi-agent approach assigns specialized roles:

Analyzer Agent — reads the codebase structure, identifies modules, maps dependencies
Writer Agent — takes structured analysis output and produces prose documentation
Formatter Agent — applies templates, ensures consistency, handles cross-references
Reviewer Agent — checks completeness, flags gaps, scores output quality Each agent is good at one thing. The orchestrator coordinates them in sequence — and sometimes in parallel.

The Architecture

Input (codebase / spec)
        │
        ▼
┌──────────────────┐
│  Orchestrator    │  ← decides task graph, manages state
└──────┬───────────┘
       │
   ┌───┴────────────────────────────┐
   │                                │
   ▼                                ▼
Analyzer Agent                 Context Builder
(GPT-4o, low temp)            (builds shared memory)
   │
   ▼
Writer Agent(s)          ← spawned per module, run in parallel
(Claude 3.5, temp 0.7)
   │
   ▼
Formatter Agent
(structured output)
   │
   ▼
Reviewer Agent           ← gates output quality
(GPT-4o, strict prompt)
   │
   ▼
Final Documentation

The key design decision: shared memory over message passing. Each agent reads from and writes to a shared context object rather than receiving inputs directly from the previous agent. This lets the Reviewer Agent access the Analyzer's raw output without it being filtered through the Writer — which turned out to be critical for catching documentation that technically read well but missed important implementation details.

The State Machine

Each document module moves through states:

type ModuleState =
  | 'pending'
  | 'analyzing'
  | 'writing'
  | 'formatting'
  | 'reviewing'
  | 'approved'
  | 'failed';

interface DocumentModule {
  id: string;
  name: string;
  state: ModuleState;
  analyzerOutput?: AnalysisResult;
  draft?: string;
  formattedDraft?: string;
  reviewScore?: number;
  reviewFeedback?: string;
  retryCount: number;
}

Modules that fail the Reviewer Agent's quality gate (score < 0.75 on our rubric) get re-queued to the Writer Agent with the review feedback included in the prompt. We cap retries at 3 before flagging for human review.

This retry loop was the single biggest quality improvement we made. First-pass writer output approved directly produced documentation that was grammatically fine but structurally shallow. With the reviewer feedback loop, output quality jumped substantially — especially for complex modules.

Parallelism: Where It Works and Where It Breaks

Writer Agents can run in parallel — each module is independent. We spawn up to 8 concurrent Writer Agents using Promise.allSettled:

async function writeModulesInParallel(
  modules: DocumentModule[],
  context: SharedContext
): Promise<DocumentModule[]> {
  const chunks = chunkArray(modules, 8); // max 8 concurrent
  const results: DocumentModule[] = [];

  for (const chunk of chunks) {
    const settled = await Promise.allSettled(
      chunk.map(module => writerAgent.process(module, context))
    );

    for (const result of settled) {
      if (result.status === 'fulfilled') {
        results.push(result.value);
      } else {
        // mark failed, will retry with orchestrator
        results.push(markFailed(chunk[settled.indexOf(result)]));
      }
    }
  }

  return results;
}

What doesn't parallelize well: anything that needs global consistency. The Formatter Agent must run sequentially because it maintains a cross-reference map — if two formatter instances run concurrently they produce conflicting internal link structures. We tried distributed locking on the reference map. It was brittle. Sequential formatting was the right call.

Prompt Architecture: The Part Nobody Talks About

The agents are only as good as their prompts. Our production prompts have four sections:

1. Role definition — what this agent is, what it optimizes for, what it explicitly ignores

2. Input schema — structured description of what the agent receives

3. Output schema — strict JSON format the agent must produce

4. Failure modes — explicit instructions for what to do when input is ambiguous, incomplete, or contradictory

The failure mode section was added after production. Agents without it hallucinated confidently when given ambiguous input. Agents with explicit failure mode instructions instead returned structured { "status": "needs_clarification", "question": "..." } responses that the orchestrator could handle gracefully.

The GitHub Copilot SDK Integration

Orchestrator-15 uses the GitHub Copilot SDK for the Analyzer Agent specifically — the SDK's code-understanding capabilities are significantly stronger than general LLM prompting for structural code analysis. It can identify:

Public API surfaces vs. internal implementation details
Dependency graphs between modules
Comment density and existing documentation coverage
Test coverage as a proxy for module stability The Analyzer feeds this structured analysis to the Writer Agent, which dramatically reduces hallucinated API signatures — one of the most common failures in pure-LLM documentation generation.

What We'd Do Differently

Use structured outputs from the start. We started with free-form text outputs and added JSON schemas later. Every agent refactor was painful because downstream agents had built implicit assumptions about output format. Define your schemas before writing a single agent prompt.

Build the reviewer first. We built it last. If we'd built the quality rubric and reviewer first, we would have caught bad writer prompt patterns in day 1 instead of week 4.

Token budgets per agent. Without explicit token limits per agent, the Writer Agent would occasionally produce exhaustive output for simple modules and thin output for complex ones. Calibrating per-module token budgets based on the Analyzer's complexity score (lines of code, dependency count) significantly improved consistency.

The Repo

Orchestrator-15 is open source. You can find it on the Zeppelin Labs GitHub. We're actively developing it — issues and PRs welcome.

If you're building multi-agent systems and want to compare notes, drop a comment below or reach out through zeppelinlabs.digital.

Built at Zeppelin Labs — a software development studio building SaaS products, AI systems, and automation platforms from Islamabad, Pakistan.

Top comments (3)

Bitwise Stash • May 25

A good demostration of skills and well aligned project

Muhammad Adnan Sultan • May 25

Thanks you and I and along with my company try best to build a good content for community

Harjot Singh • May 31

Docs are a great first multi-agent use case because the task decomposes so cleanly into specialist roles - one agent extracts structure from the code, one drafts prose, one checks accuracy against the source, one enforces style. That natural division is exactly where multi-agent shines vs one model trying to hold the whole job. The "what we learned" is the valuable part though, because doc generation surfaces the universal multi-agent lessons fast: handoff format between agents, and the killer one for docs specifically - keeping the output truthful to the actual code rather than plausibly-wrong.

That accuracy-verification step is where doc agents live or die: confidently-wrong documentation is worse than none, because people trust it. A verify-against-source gate (does this doc claim match what the code actually does) is the difference between helpful and dangerous. That gate-the-output discipline is core to how I build Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - generated artifacts get verified against ground truth, not trusted raw. Really useful writeup. What was the hardest lesson - the inter-agent handoffs, or keeping the docs accurate to the code? The accuracy problem seems the nastier one for docs.