Harjot Singh

Posted on Jun 1

How we built a 14-agent pipeline that ships a deployed app + launch assets in ~7 minutes

#devtools

Most AI app builders stop at "deployed." You prompt, you get a repo, maybe a preview URL, and then the actual work starts: wiring a domain, writing the landing copy, cutting screenshots, drafting the launch thread. We wanted the pipeline to stop at "launched" instead, so we built one. This is how it works under the hood, including the parts that broke.

The product is Moonshift. One prompt triggers 14 specialized agents across 10 phases. Average run is ~7 minutes and ~$3 in API spend, with a hard $5 ceiling that aborts the run. Everything ships to your Vercel, your GitHub, your database. This post is the engineering, not the pitch.

The core problem: parallel agents drift

The naive version of "many agents build an app" falls apart fast. If a backend agent and a frontend agent both work from a vague English spec, they invent incompatible contracts. The backend returns { user_id }, the frontend reads userId, and you find out at runtime in production.

Our fix is a planner that emits a JSON contract first. Before any code is written, one agent produces a typed contract: routes, request/response shapes, table schemas, env vars, page list. That contract is the single source of truth. Backend, frontend, database, and test agents all build against it in parallel instead of against prose.

Then a contract-validator agent runs after the parallel build and diffs the actual code against the contract. When the frontend's fetch shape doesn't match the backend's handler, the validator doesn't just flag it. It patches the mismatch. This one agent removed the largest single class of "looks done, 500s on click" failures we had.

The 10 phases

Plan - generate the JSON contract.
Scaffold - lay down the framework skeleton (Next.js, config, deny-globs that protect files agents shouldn't touch).
Backend - API routes and server logic against the contract.
Frontend - pages and components against the same contract.
Database - schema + migrations (Drizzle + Turso).
Validate - contract-validator reconciles 3-5 in parallel, auto-fixes drift.
Test + fix - generated tests run; a fixer loop addresses failures.
Deploy - ships to your Vercel via your token.
Audit - security and a11y passes on the live deployment.
Market + publish - a marketer agent drafts X and LinkedIn launch posts in your voice, image-gen produces hero images, and a publisher gates everything behind your one-tap approval.

The interesting phases are 6, 7, and 10. Everyone has a code-gen step. Almost nobody has a reconcile step, a real fixer loop, or a phase whose only job is launch assets.

Reliability: not every failure is equal

Long multi-agent runs fail in boring ways: a rate limit, a flaky deploy, a model that returns prose where you asked for JSON. If you retry all of them the same way, you either give up too early on transient errors or burn money death-looping on deterministic ones.

We run a failure classifier that buckets every failure into transient, deterministic, or permanent:

transient (429s, network blips, stream idle) - retry with backoff.
deterministic (a test that fails the same way every time) - hand to a fixer agent, don't blindly retry the same call.
permanent (bad auth, missing token) - stop and surface it. No point spending more.

Retries are capped on three axes at once: per-phase, per-agent, and a global per-run ceiling. The global cap is what keeps a single bad run from quietly turning into a $40 bill. Combined with the hard $5 abort, the worst case is bounded and visible instead of a surprise invoice.

A subtle one we hit: a long LLM stream can go idle without erroring (the upstream connection gets severed but the socket never closes). A naive loop waits forever. We added an idle watchdog in the agent loop so a silent stall is treated as a transient failure and retried, instead of hanging the whole run.

Design constraints that shaped everything

Your infra, zero lock-in. Code lands in your GitHub, the app deploys to your Vercel, the database is yours. Cancel the subscription and you keep a working product. This forced the deployer to operate purely through user-supplied tokens, which is more work than deploying to our own infra but is the entire point.
The publisher physically cannot post without you. Social publishing is gated per post, per platform, behind an explicit human tap. Autonomy ends at the point where it would speak as you in public. Generation is automatic; publishing is not.
Hard cost ceiling. $5/run, enforced mid-run, not reconciled after. Agents check remaining budget before expensive calls.

Stack

Next.js for web, Drizzle + Turso (libSQL) for data, Playwright for browser automation in the marketing/audit phases. The orchestrator is a separate runtime from the web app, spawned per run from source so a fix ships without a full rebuild.

Honest lessons

A typed contract beats a smarter prompt. We spent weeks trying to make agents "just agree." Making them agree on a machine-checkable artifact was the actual fix.
A reconcile phase is worth more than a better code-gen model. Catching drift after the fact, cheaply, beat every attempt to prevent it perfectly up front.
Classify failures before you retry them. Uniform retry is how multi-agent systems burn money and still fail.
Bound the blast radius in money, not just time. A per-run dollar cap is the single most important guardrail in an autonomous pipeline that calls paid APIs in a loop.

If you want to see the output, the first run is free with your own API key at moonshift.io. Happy to answer architecture questions in the comments.

Top comments (4)

Nimesh Kulkarni • Jun 1

That is solid high level stuff.
Your project is awesome but I saw some of the preview websites listed on landing page.. bro it is still slop. The architecture is great but the real client intrested more on ONE SHOT ui. For that you can use Skills , there are some of the skills like

Ui ux pro
Design.md (not skill but design inspiration)

So LLM can do intresting genration.

What say?

Theo Valmis • Jun 2

14 agents is a lot of coordination surface. The number worth tracking over time is failure rate at each handoff: compounding even a 5% per-stage error rate across 14 stages produces a roughly 50% end-to-end failure rate. Curious how you're handling state when an intermediate agent fails.

xulingfeng • Jun 1

Contract-validator is the real insight here — most multi-agent pipelines skip the reconcile step and just hope. We hit the same drift problem with parallel Hermes subagents and ended up with a similar pattern (a "glue agent" that reconciles outputs against a shared spec before merging). How many fixer iterations does the test+fix phase typically need before converging?

Bob Oner • Jun 1

Thanks — this is a very sharp way to frame it.

For this version, I’m thinking about it purely as a browser-side navigation and rendering problem, not as model context control. The script does not remove server-side history, does not decide what ChatGPT sees, and does not call the API. It only reduces the visible surface area in the local browser so a long conversation is easier to scan and review.

But I agree with the deeper analogy: the rendered DOM and the model context window have a similar “too much stale information hurts usefulness” shape. In the browser, that shows up as scrolling, visual noise, and sometimes rendering cost. In the model context, it shows up as attention being spent on old or less relevant turns.

That distinction is important to me: trim what you render is a UI/navigation lever; bound what you send is an API/context-management lever. I deliberately kept this project on the first side because I wanted the privacy boundary to stay simple and reviewable.

If I ever explore the second side, I would treat it as a separate API-based project, probably around explicit context selection, summaries, and “what should actually be sent to the model” rather than DOM manipulation.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.