<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Adnan Sultan</title>
    <description>The latest articles on DEV Community by Muhammad Adnan Sultan (@madnansultan).</description>
    <link>https://dev.to/madnansultan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1528425%2F174374ee-2b54-4c5e-8423-cf5fb4d085e9.webp</url>
      <title>DEV Community: Muhammad Adnan Sultan</title>
      <link>https://dev.to/madnansultan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/madnansultan"/>
    <language>en</language>
    <item>
      <title>How We Built a Multi-Agent AI Documentation System (And What We Learned)</title>
      <dc:creator>Muhammad Adnan Sultan</dc:creator>
      <pubDate>Mon, 25 May 2026 05:36:06 +0000</pubDate>
      <link>https://dev.to/madnansultan/how-we-built-a-multi-agent-ai-documentation-system-and-what-we-learned-3mce</link>
      <guid>https://dev.to/madnansultan/how-we-built-a-multi-agent-ai-documentation-system-and-what-we-learned-3mce</guid>
      <description>&lt;p&gt;Last quarter at &lt;a href="https://zeppelinlabs.digital" rel="noopener noreferrer"&gt;Zeppelin Labs&lt;/a&gt;, we shipped Orchestrator-15 — a multi-agent documentation generation platform that takes a codebase or idea spec and produces production-grade technical documentation using coordinated AI agents.&lt;/p&gt;

&lt;p&gt;This post covers the architecture, the mistakes, and the specific patterns that made multi-agent coordination actually work in production. Not a tutorial — a war story.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Multi-Agent, Not Just One Big Prompt?
&lt;/h2&gt;

&lt;p&gt;The naive approach to AI documentation generation is one giant prompt: &lt;em&gt;"here's my codebase, write the docs."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It fails for the same reason you wouldn't ask one person to simultaneously be a technical writer, an API analyst, a diagram designer, and an editor. Context windows are finite. Tasks have different optimization targets. And a single agent trying to do everything produces mediocre output across the board.&lt;/p&gt;

&lt;p&gt;The multi-agent approach assigns specialized roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Analyzer Agent&lt;/strong&gt; — reads the codebase structure, identifies modules, maps dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writer Agent&lt;/strong&gt; — takes structured analysis output and produces prose documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Formatter Agent&lt;/strong&gt; — applies templates, ensures consistency, handles cross-references&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reviewer Agent&lt;/strong&gt; — checks completeness, flags gaps, scores output quality
Each agent is good at one thing. The orchestrator coordinates them in sequence — and sometimes in parallel.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input (codebase / spec)
        │
        ▼
┌──────────────────┐
│  Orchestrator    │  ← decides task graph, manages state
└──────┬───────────┘
       │
   ┌───┴────────────────────────────┐
   │                                │
   ▼                                ▼
Analyzer Agent                 Context Builder
(GPT-4o, low temp)            (builds shared memory)
   │
   ▼
Writer Agent(s)          ← spawned per module, run in parallel
(Claude 3.5, temp 0.7)
   │
   ▼
Formatter Agent
(structured output)
   │
   ▼
Reviewer Agent           ← gates output quality
(GPT-4o, strict prompt)
   │
   ▼
Final Documentation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design decision: &lt;strong&gt;shared memory over message passing.&lt;/strong&gt; Each agent reads from and writes to a shared context object rather than receiving inputs directly from the previous agent. This lets the Reviewer Agent access the Analyzer's raw output without it being filtered through the Writer — which turned out to be critical for catching documentation that technically read well but missed important implementation details.&lt;/p&gt;




&lt;h2&gt;
  
  
  The State Machine
&lt;/h2&gt;

&lt;p&gt;Each document module moves through states:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;ModuleState&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;analyzing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;writing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;formatting&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;reviewing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;approved&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;failed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;DocumentModule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ModuleState&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;analyzerOutput&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;AnalysisResult&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;draft&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;formattedDraft&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;reviewScore&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;reviewFeedback&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;retryCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Modules that fail the Reviewer Agent's quality gate (score &amp;lt; 0.75 on our rubric) get re-queued to the Writer Agent with the review feedback included in the prompt. We cap retries at 3 before flagging for human review.&lt;/p&gt;

&lt;p&gt;This retry loop was the single biggest quality improvement we made. First-pass writer output approved directly produced documentation that was grammatically fine but structurally shallow. With the reviewer feedback loop, output quality jumped substantially — especially for complex modules.&lt;/p&gt;




&lt;h2&gt;
  
  
  Parallelism: Where It Works and Where It Breaks
&lt;/h2&gt;

&lt;p&gt;Writer Agents can run in parallel — each module is independent. We spawn up to 8 concurrent Writer Agents using &lt;code&gt;Promise.allSettled&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;writeModulesInParallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;modules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;DocumentModule&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
  &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;SharedContext&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;DocumentModule&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunkArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;modules&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// max 8 concurrent&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;DocumentModule&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;settled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;allSettled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;module&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;writerAgent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;settled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fulfilled&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// mark failed, will retry with orchestrator&lt;/span&gt;
        &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;markFailed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;settled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)]));&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What doesn't parallelize well:&lt;/strong&gt; anything that needs global consistency. The Formatter Agent must run sequentially because it maintains a cross-reference map — if two formatter instances run concurrently they produce conflicting internal link structures. We tried distributed locking on the reference map. It was brittle. Sequential formatting was the right call.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt Architecture: The Part Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;The agents are only as good as their prompts. Our production prompts have four sections:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Role definition&lt;/strong&gt; — what this agent is, what it optimizes for, what it explicitly ignores&lt;br&gt;&lt;br&gt;
&lt;strong&gt;2. Input schema&lt;/strong&gt; — structured description of what the agent receives&lt;br&gt;&lt;br&gt;
&lt;strong&gt;3. Output schema&lt;/strong&gt; — strict JSON format the agent must produce&lt;br&gt;&lt;br&gt;
&lt;strong&gt;4. Failure modes&lt;/strong&gt; — explicit instructions for what to do when input is ambiguous, incomplete, or contradictory&lt;/p&gt;

&lt;p&gt;The failure mode section was added after production. Agents without it hallucinated confidently when given ambiguous input. Agents with explicit failure mode instructions instead returned structured &lt;code&gt;{ "status": "needs_clarification", "question": "..." }&lt;/code&gt; responses that the orchestrator could handle gracefully.&lt;/p&gt;




&lt;h2&gt;
  
  
  The GitHub Copilot SDK Integration
&lt;/h2&gt;

&lt;p&gt;Orchestrator-15 uses the GitHub Copilot SDK for the Analyzer Agent specifically — the SDK's code-understanding capabilities are significantly stronger than general LLM prompting for structural code analysis. It can identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public API surfaces vs. internal implementation details&lt;/li&gt;
&lt;li&gt;Dependency graphs between modules&lt;/li&gt;
&lt;li&gt;Comment density and existing documentation coverage&lt;/li&gt;
&lt;li&gt;Test coverage as a proxy for module stability
The Analyzer feeds this structured analysis to the Writer Agent, which dramatically reduces hallucinated API signatures — one of the most common failures in pure-LLM documentation generation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use structured outputs from the start.&lt;/strong&gt; We started with free-form text outputs and added JSON schemas later. Every agent refactor was painful because downstream agents had built implicit assumptions about output format. Define your schemas before writing a single agent prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the reviewer first.&lt;/strong&gt; We built it last. If we'd built the quality rubric and reviewer first, we would have caught bad writer prompt patterns in day 1 instead of week 4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token budgets per agent.&lt;/strong&gt; Without explicit token limits per agent, the Writer Agent would occasionally produce exhaustive output for simple modules and thin output for complex ones. Calibrating per-module token budgets based on the Analyzer's complexity score (lines of code, dependency count) significantly improved consistency.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Repo
&lt;/h2&gt;

&lt;p&gt;Orchestrator-15 is open source. You can find it on the &lt;a href="https://github.com/zeppelin-labs" rel="noopener noreferrer"&gt;Zeppelin Labs GitHub&lt;/a&gt;. We're actively developing it — issues and PRs welcome.&lt;/p&gt;

&lt;p&gt;If you're building multi-agent systems and want to compare notes, drop a comment below or reach out through &lt;a href="https://zeppelinlabs.digital" rel="noopener noreferrer"&gt;zeppelinlabs.digital&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built at &lt;a href="https://zeppelinlabs.digital" rel="noopener noreferrer"&gt;Zeppelin Labs&lt;/a&gt; — a software development studio building SaaS products, AI systems, and automation platforms from Islamabad, Pakistan.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>career</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
