I spent the last month building an open-source (MIT) pipeline that takes a plain-language idea and runs it through 12 specialized agents — analyst, PM, architect, design critic, developer, QA, security, DevOps, marketing, and more — with 5 quality gates, a strict state machine with recovery, and an AI Director that autonomously manages the whole thing.
Think Bolt.new or Lovable, but self-hosted, MIT licensed, with quality gates that actually prevent the model from shipping broken stubs.
The interesting part isn't the LLM calls. Here's what broke in production.
LLM failover creates consistency problems
I have 6+ providers (DeepSeek, Anthropic, OpenAI, Ollama, Groq, etc.) with automatic health-check failover every 60s. The footgun: DeepSeek and Claude write different code. Same prompt, wildly different output structure. If the router switches providers mid-pipeline, the architect output (Claude) won't match what the developer agent (DeepSeek) expects.
Solution: task-level pinning. Heavy tasks (architect, developer) stay locked to the primary provider. Light tasks (marketing copy, naming) can fall back freely. I also added a model capability matrix check before routing — otherwise you get an architect running on a 7B local model producing garbage.State machines need to survive the model being wrong
11 states, 34 valid transitions, JSON + SQLite dual persistence. Sounds solid until the model writes a corrupted artifact that crashes the state machine on the next task load.
Had to add:Recovery fallback: if JSON parse fails, restore from SQLite snapshot
Stranded product recovery: products stuck in
pm_quality_failbecause the model hallucinated a non-existent file pathAsync save with timeout guards so a slow disk write doesn't block the pipeline
The lesson: your state machine needs to survive both a wrong model AND a corrupted disk. Not theoretical — happened in production.The Director AI feedback loop problem
The Director runs a 6-phase autonomous cycle: route chat → analyze metrics → generate decisions → apply actions → rank what to build next → log.
The footgun: feedback loops. Director generates a decision → applies it → next cycle reads its own output → generates another decision based on that → infinite loop. Had to add noop detection that breaks the cycle when decisions become empty.
The chat classification is also tricky. The Director classifies owner messages asnew_idea,product_feedback, orgeneral_directivevia LLM. If it misclassifies "fix the login page" asnew_idea, you get a duplicate product instead of a bug fix. I added an orphan feedback heuristic: if a message mentions a product name that doesn't exist yet, route tonew_idea; otherwise link to the existing product.Quality gates — what I wish I'd built first
| Gate | What it checks |
|------|---------------|
| Demo quality | 12 checkpoints: contrast, CTA, broken links, spec coverage |
| Browser E2E | Playwright crawl (desktop + mobile), JS errors, 404s |
| Visual QA | 9 heuristics: contrast ratio, CSS vars, empty states, nav |
| Security | AST scan: eval(), innerHTML, exposed tokens, hardcoded secrets |
| Methodology | Domain packs: fintech, ecomm, healthcare, etc |
Real example: visual QA flagged a white-on-white CTA button — the model generated color: white on background: white assuming a dark theme that wasn't applied. The gate caught it, sent it back to the developer with the exact CSS selector. Fixed next cycle.
- Preview fidelity is pure web engineering
When AI-generated code runs in a sandbox iframe, every web platform quirk amplifies: relative URLs break,
is missing, CSP blocks inline styles, `target="_top"` kills navigation. Had to write a dedicated URL rewriter that: injectspointing to the correct sandbox route, rewrites absolute/links to relative, adds permissive CSP headers, stripstarget="_top". Not AI work. But without it, the preview is broken and users blame you, not the LLM.
- 61,503 Python LOC, 22,997 TypeScript/TSX LOC
- 12 specialized agents, 5 quality gates
- 11 pipeline states, 34 valid transitions
- 6+ LLM providers with auto-failover
- 72 test files, MIT licensed
Repo: github.com/alexar76/aicom — FastAPI + Next.js + Docker Compose, self-hosted, MIT, BYO API keys.

Top comments (0)