Open-source multi-agent pipeline: 61K Python, 12 agents, 5 quality gates...

#agents #opensource #python #showdev

I spent the last month building an open-source (MIT) pipeline that takes a plain-language idea and runs it through 12 specialized agents — analyst, PM, architect, design critic, developer, QA, security, DevOps, marketing, and more — with 5 quality gates, a strict state machine with recovery, and an AI Director that autonomously manages the whole thing.
Think Bolt.new or Lovable, but self-hosted, MIT licensed, with quality gates that actually prevent the model from shipping broken stubs.
The interesting part isn't the LLM calls. Here's what broke in production.

LLM failover creates consistency problems
I have 6+ providers (DeepSeek, Anthropic, OpenAI, Ollama, Groq, etc.) with automatic health-check failover every 60s. The footgun: DeepSeek and Claude write different code. Same prompt, wildly different output structure. If the router switches providers mid-pipeline, the architect output (Claude) won't match what the developer agent (DeepSeek) expects.
Solution: task-level pinning. Heavy tasks (architect, developer) stay locked to the primary provider. Light tasks (marketing copy, naming) can fall back freely. I also added a model capability matrix check before routing — otherwise you get an architect running on a 7B local model producing garbage.
State machines need to survive the model being wrong
11 states, 34 valid transitions, JSON + SQLite dual persistence. Sounds solid until the model writes a corrupted artifact that crashes the state machine on the next task load.
Had to add:
Recovery fallback: if JSON parse fails, restore from SQLite snapshot
Stranded product recovery: products stuck in pm_quality_fail because the model hallucinated a non-existent file path
Async save with timeout guards so a slow disk write doesn't block the pipeline
The lesson: your state machine needs to survive both a wrong model AND a corrupted disk. Not theoretical — happened in production.
The Director AI feedback loop problem
The Director runs a 6-phase autonomous cycle: route chat → analyze metrics → generate decisions → apply actions → rank what to build next → log.
The footgun: feedback loops. Director generates a decision → applies it → next cycle reads its own output → generates another decision based on that → infinite loop. Had to add noop detection that breaks the cycle when decisions become empty.
The chat classification is also tricky. The Director classifies owner messages as new_idea, product_feedback, or general_directive via LLM. If it misclassifies "fix the login page" as new_idea, you get a duplicate product instead of a bug fix. I added an orphan feedback heuristic: if a message mentions a product name that doesn't exist yet, route to new_idea; otherwise link to the existing product.
Quality gates — what I wish I'd built first
| Gate | What it checks |
|------|---------------|
| Demo quality | 12 checkpoints: contrast, CTA, broken links, spec coverage |
| Browser E2E | Playwright crawl (desktop + mobile), JS errors, 404s |
| Visual QA | 9 heuristics: contrast ratio, CSS vars, empty states, nav |
| Security | AST scan: eval(), innerHTML, exposed tokens, hardcoded secrets |
| Methodology | Domain packs: fintech, ecomm, healthcare, etc |

Real example: visual QA flagged a white-on-white CTA button — the model generated color: white on background: white assuming a dark theme that wasn't applied. The gate caught it, sent it back to the developer with the exact CSS selector. Fixed next cycle.

Preview fidelity is pure web engineering When AI-generated code runs in a sandbox iframe, every web platform quirk amplifies: relative URLs break, is missing, CSP blocks inline styles, `target="_top"` kills navigation. Had to write a dedicated URL rewriter that: injects pointing to the correct sandbox route, rewrites absolute / links to relative, adds permissive CSP headers, strips target="_top". Not AI work. But without it, the preview is broken and users blame you, not the LLM.

61,503 Python LOC, 22,997 TypeScript/TSX LOC
12 specialized agents, 5 quality gates
11 pipeline states, 34 valid transitions
6+ LLM providers with auto-failover
72 test files, MIT licensed

Repo: github.com/alexar76/aicom — FastAPI + Next.js + Docker Compose, self-hosted, MIT, BYO API keys.

Top comments (2)

Harjot Singh • Jun 1

the way you tackled the consistency problems with LLM failover is really insightful. it's crucial to ensure that output remains reliable across different providers. moonshift could help streamline your deployments - you can get a full next.js + postgres + auth app up in about 7 min, and you own the code on your github. if you're curious, I'd be happy to offer a free run to give it a spin.

Alex • Jun 1

Thanks, Harjot, glad the failover section worked)
the hardest part wasn't the failover mechanism itself, but keeping the output semantically stable when a request seamlessly transitions to another provider mid-run. I normalized everything into a provider-agnostic schema before it reaches the state machine, plus deterministic retries on gate failures, so retrying with a different model doesn't break the pipeline.

and it's funny you mention Moonshift — I looked into it, and we're clearly working on systems in the same class: you've got 14 agents in a 10-stage workflow with a human-approval gate, while I've got 12 agents + 5 quality gates with an AI Director. I couldn't find how Moonshift handles this part, but my solution is multi-provider, which is exactly why fault tolerance matters so much
the main difference is positioning: this project is fully open-source and self-hosted, while Moonshift bundles the entire "launch day" (deployment, DNS, launch copy, hero images) into a managed SaaS — frankly, the part most developers skip.

so it's less a direct comparison and more two approaches to the same idea, thanks for the free-run offer — I'll definitely give it a try, curious how you organize the deploy and launch-kit phases. 👍