I spent the last month building an open-source (MIT) pipeline that takes a plain-language idea and runs it through 12 specialized agents — analyst, PM, architect, design critic, developer, QA, security, DevOps, marketing, and more — with 5 quality gates, a strict state machine with recovery, and an AI Director that autonomously manages the whole thing.
Think Bolt.new or Lovable, but self-hosted, MIT licensed, with quality gates that actually prevent the model from shipping broken stubs.
The interesting part isn't the LLM calls. Here's what broke in production.
LLM failover creates consistency problems
I have 6+ providers (DeepSeek, Anthropic, OpenAI, Ollama, Groq, etc.) with automatic health-check failover every 60s. The footgun: DeepSeek and Claude write different code. Same prompt, wildly different output structure. If the router switches providers mid-pipeline, the architect output (Claude) won't match what the developer agent (DeepSeek) expects.
Solution: task-level pinning. Heavy tasks (architect, developer) stay locked to the primary provider. Light tasks (marketing copy, naming) can fall back freely. I also added a model capability matrix check before routing — otherwise you get an architect running on a 7B local model producing garbage.State machines need to survive the model being wrong
11 states, 34 valid transitions, JSON + SQLite dual persistence. Sounds solid until the model writes a corrupted artifact that crashes the state machine on the next task load.
Had to add:Recovery fallback: if JSON parse fails, restore from SQLite snapshot
Stranded product recovery: products stuck in
pm_quality_failbecause the model hallucinated a non-existent file pathAsync save with timeout guards so a slow disk write doesn't block the pipeline
The lesson: your state machine needs to survive both a wrong model AND a corrupted disk. Not theoretical — happened in production.The Director AI feedback loop problem
The Director runs a 6-phase autonomous cycle: route chat → analyze metrics → generate decisions → apply actions → rank what to build next → log.
The footgun: feedback loops. Director generates a decision → applies it → next cycle reads its own output → generates another decision based on that → infinite loop. Had to add noop detection that breaks the cycle when decisions become empty.
The chat classification is also tricky. The Director classifies owner messages asnew_idea,product_feedback, orgeneral_directivevia LLM. If it misclassifies "fix the login page" asnew_idea, you get a duplicate product instead of a bug fix. I added an orphan feedback heuristic: if a message mentions a product name that doesn't exist yet, route tonew_idea; otherwise link to the existing product.Quality gates — what I wish I'd built first
| Gate | What it checks |
|------|---------------|
| Demo quality | 12 checkpoints: contrast, CTA, broken links, spec coverage |
| Browser E2E | Playwright crawl (desktop + mobile), JS errors, 404s |
| Visual QA | 9 heuristics: contrast ratio, CSS vars, empty states, nav |
| Security | AST scan: eval(), innerHTML, exposed tokens, hardcoded secrets |
| Methodology | Domain packs: fintech, ecomm, healthcare, etc |
Real example: visual QA flagged a white-on-white CTA button — the model generated color: white on background: white assuming a dark theme that wasn't applied. The gate caught it, sent it back to the developer with the exact CSS selector. Fixed next cycle.
- Preview fidelity is pure web engineering
When AI-generated code runs in a sandbox iframe, every web platform quirk amplifies: relative URLs break,
is missing, CSP blocks inline styles, `target="_top"` kills navigation. Had to write a dedicated URL rewriter that: injectspointing to the correct sandbox route, rewrites absolute/links to relative, adds permissive CSP headers, stripstarget="_top". Not AI work. But without it, the preview is broken and users blame you, not the LLM.
- 61,503 Python LOC, 22,997 TypeScript/TSX LOC
- 12 specialized agents, 5 quality gates
- 11 pipeline states, 34 valid transitions
- 6+ LLM providers with auto-failover
- 72 test files, MIT licensed
Repo: github.com/alexar76/aicom — FastAPI + Next.js + Docker Compose, self-hosted, MIT, BYO API keys.

Top comments (2)
the way you tackled the consistency problems with LLM failover is really insightful. it's crucial to ensure that output remains reliable across different providers. moonshift could help streamline your deployments - you can get a full next.js + postgres + auth app up in about 7 min, and you own the code on your github. if you're curious, I'd be happy to offer a free run to give it a spin.
Thanks, Harjot, glad the failover section worked)
the hardest part wasn't the failover mechanism itself, but keeping the output semantically stable when a request seamlessly transitions to another provider mid-run. I normalized everything into a provider-agnostic schema before it reaches the state machine, plus deterministic retries on gate failures, so retrying with a different model doesn't break the pipeline.
and it's funny you mention Moonshift — I looked into it, and we're clearly working on systems in the same class: you've got 14 agents in a 10-stage workflow with a human-approval gate, while I've got 12 agents + 5 quality gates with an AI Director. I couldn't find how Moonshift handles this part, but my solution is multi-provider, which is exactly why fault tolerance matters so much
the main difference is positioning: this project is fully open-source and self-hosted, while Moonshift bundles the entire "launch day" (deployment, DNS, launch copy, hero images) into a managed SaaS — frankly, the part most developers skip.
so it's less a direct comparison and more two approaches to the same idea, thanks for the free-run offer — I'll definitely give it a try, curious how you organize the deploy and launch-kit phases. 👍