DEV Community

Alex
Alex

Posted on

Open-source multi-agent pipeline: 61K Python, 12 agents, 5 quality gates...

I spent the last month building an open-source (MIT) pipeline that takes a plain-language idea and runs it through 12 specialized agents — analyst, PM, architect, design critic, developer, QA, security, DevOps, marketing, and more — with 5 quality gates, a strict state machine with recovery, and an AI Director that autonomously manages the whole thing.
Think Bolt.new or Lovable, but self-hosted, MIT licensed, with quality gates that actually prevent the model from shipping broken stubs.
The interesting part isn't the LLM calls. Here's what broke in production.

  1. LLM failover creates consistency problems
    I have 6+ providers (DeepSeek, Anthropic, OpenAI, Ollama, Groq, etc.) with automatic health-check failover every 60s. The footgun: DeepSeek and Claude write different code. Same prompt, wildly different output structure. If the router switches providers mid-pipeline, the architect output (Claude) won't match what the developer agent (DeepSeek) expects.
    Solution: task-level pinning. Heavy tasks (architect, developer) stay locked to the primary provider. Light tasks (marketing copy, naming) can fall back freely. I also added a model capability matrix check before routing — otherwise you get an architect running on a 7B local model producing garbage.

  2. State machines need to survive the model being wrong
    11 states, 34 valid transitions, JSON + SQLite dual persistence. Sounds solid until the model writes a corrupted artifact that crashes the state machine on the next task load.
    Had to add:

  3. Recovery fallback: if JSON parse fails, restore from SQLite snapshot

  4. Stranded product recovery: products stuck in pm_quality_fail because the model hallucinated a non-existent file path

  5. Async save with timeout guards so a slow disk write doesn't block the pipeline
    The lesson: your state machine needs to survive both a wrong model AND a corrupted disk. Not theoretical — happened in production.

  6. The Director AI feedback loop problem
    The Director runs a 6-phase autonomous cycle: route chat → analyze metrics → generate decisions → apply actions → rank what to build next → log.
    The footgun: feedback loops. Director generates a decision → applies it → next cycle reads its own output → generates another decision based on that → infinite loop. Had to add noop detection that breaks the cycle when decisions become empty.
    The chat classification is also tricky. The Director classifies owner messages as new_idea, product_feedback, or general_directive via LLM. If it misclassifies "fix the login page" as new_idea, you get a duplicate product instead of a bug fix. I added an orphan feedback heuristic: if a message mentions a product name that doesn't exist yet, route to new_idea; otherwise link to the existing product.

  7. Quality gates — what I wish I'd built first
    | Gate | What it checks |
    |------|---------------|
    | Demo quality | 12 checkpoints: contrast, CTA, broken links, spec coverage |
    | Browser E2E | Playwright crawl (desktop + mobile), JS errors, 404s |
    | Visual QA | 9 heuristics: contrast ratio, CSS vars, empty states, nav |
    | Security | AST scan: eval(), innerHTML, exposed tokens, hardcoded secrets |
    | Methodology | Domain packs: fintech, ecomm, healthcare, etc |

Real example: visual QA flagged a white-on-white CTA button — the model generated color: white on background: white assuming a dark theme that wasn't applied. The gate caught it, sent it back to the developer with the exact CSS selector. Fixed next cycle.

  1. Preview fidelity is pure web engineering When AI-generated code runs in a sandbox iframe, every web platform quirk amplifies: relative URLs break, is missing, CSP blocks inline styles, `target="_top"` kills navigation. Had to write a dedicated URL rewriter that: injects pointing to the correct sandbox route, rewrites absolute / links to relative, adds permissive CSP headers, strips target="_top". Not AI work. But without it, the preview is broken and users blame you, not the LLM.
  • 61,503 Python LOC, 22,997 TypeScript/TSX LOC
  • 12 specialized agents, 5 quality gates
  • 11 pipeline states, 34 valid transitions
  • 6+ LLM providers with auto-failover
  • 72 test files, MIT licensed

Repo: github.com/alexar76/aicom — FastAPI + Next.js + Docker Compose, self-hosted, MIT, BYO API keys.

Top comments (0)