Nilofer 🚀

Posted on May 14

SPEC-TO-SHIP: A Multi-Agent Pipeline That Turns Feature Ideas Into Production Code

#ai #multiagent #opensource #machinelearning

Writing a feature spec and getting it to production involves a lot of steps, architecture decisions, task planning, implementation, testing, and code review. In a real engineering team, these are handled by different people with different specializations. Most AI coding tools collapse all of that into a single step and ask one model to do everything.

SPEC TO SHIP takes a different approach. It orchestrates five specialized AI agents Architect, Planner, Engineer, QA, and Reviewer within a single Node.js process to simulate a complete startup engineering team workflow. Raw feature ideas go in. Committed, tested, reviewed code comes out.

The Five Agents

The pipeline follows a sequential flow where each agent's output informs the next, with a tight loop between Engineering and QA. Each agent has a defined role, a specific output format, and a clear handoff point - so no single agent is asked to do more than it is designed for.

ArchitectAgent-Senior Software Architect: The first agent in the pipeline. Takes the raw feature idea and generates a comprehensive technical specification covering Overview, Goals, API Contracts, Data Models, and Security sections. Output is a Markdown spec file that every downstream agent works from. Model:google/gemini-2.0-flash-001via OpenRouter.

PlannerAgent-Staff Engineering Manager: Receives the spec from the Architect and breaks it into actionable, dependency-aware development tasks. Output is a JSON array of tasks with topological ordering and acceptance criteria - so the Engineer knows exactly what to build and in what order.

EngineerAgent-Principal Software Engineer: Takes each task from the Planner and implements production-grade TypeScript code for it. Output is source files with proper typing, error handling, and JSDoc. This is where the actual code gets written.

QAAgent-Senior QA Engineer: Receives the Engineer's output and writes exhaustive Vitest test suites for each task. Output is test files covering acceptance criteria and edge cases. The tight loop between Engineer and QA means the implementation is always tested before the Reviewer sees it.

ReviewerAgent-Principal Engineer Reviewer: The final stage, conducts an audit across security, performance, and correctness across everything the previous agents produced. Output is a score from 0 to 100 and an approval status that tells you whether the output is ready to ship.

Quality and Resilience

The pipeline is built for production reliability, not just happy-path execution. Several resilience patterns are built in at the infrastructure level so agent failures do not cascade into full pipeline failures:

Strict TypeScript - No any types allowed anywhere in the generated code.

Exponential Backoff - Retries on 429/529 errors at 1s, 2s, 4s, 8s, and 16s intervals. Rate limit hits do not kill the pipeline.

JSON Robustness - When an agent returns malformed JSON, the pipeline automatically retries with explicit instructions to fix the format rather than failing immediately.

Timeout - Ahard 20-minute limit per pipeline run prevents runaway executions.

Getting Started

Prerequisites

Node.js 20+
OpenRouter API Key

Installation
Clone the repository, then install dependencies:

npm install

Configure the environment. The only required variable is your OpenRouter API key - everything else has sensible defaults:

cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

Configuration

The system uses envalid for robust configuration management:

OPENROUTER_API_KEY - required for LLM access
DEFAULT_MODEL - set to google/gemini-2.0-flash-001
PORT - API server port, default: 3000
DB_PATH - SQLite database path, default: spec-to-ship.db

Usage

Terminal UI
Run the interactive CLI to start a new pipeline:

npm run start

The CLI uses Ink to provide real-time status updates and token streaming as each agent works through its stage. You can watch the pipeline progress in real time - each agent's output appears as it is generated rather than waiting for the full run to complete.

Industrial Dashboard
Start the API server:

npm run dev

Then open dashboard/index.html in your browser. The dashboard features an Industrial Command Center aesthetic with Dark Charcoal and Amber Glow styling and uses Server-Sent Events for real-time observability of the pipeline as it runs. This gives a visual view of the same pipeline that the CLI runs, useful for sharing progress with others or monitoring longer runs.

Output Structure

Every pipeline run writes its artifacts to ./output/{runId}/. Each file maps directly to one agent's output, so you can inspect any stage independently:

spec.md : Architectural specification from the Architect agent. The source of truth every downstream agent works from.

tasks.json : Task breakdown from the Planner agent. The dependency-ordered list of what gets built.

src/ : mplementation code from the Engineer agent.

tests/ : Vitest tests from the QA agent.

review.md : Final review report from the Reviewer agent, including the score and approval status.

meta.json : Token usage, cost, and timing for the full run.

pipeline.log : NDJSON event log of the entire pipeline execution.

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The idea was a multi-agent pipeline that mirrors how a real engineering team works each role specialized, each handoff structured, and the whole thing running autonomously from a feature idea through to reviewed, committed code. The requirements included five distinct agent roles with clear responsibilities, a sequential handoff structure with a QA loop, production-grade TypeScript output, real-time observability via both a CLI and a web dashboard, and resilience patterns like exponential backoff and JSON retry logic.

NEO built the full system: the five agent implementations with their respective prompts and output schemas, the pipeline orchestration layer coordinating sequential handoffs, the Ink-based CLI with real-time token streaming, the Node.js API server with SSE for dashboard observability, the Industrial Command Center dashboard in HTML, the SQLite-backed database, the artifact output structure, and the envalid configuration layer.

How You Can Use and Extend This With NEO

Use it to go from idea to working code in a single command.
Write a feature description, run the pipeline, and get a complete implementation with architecture docs, TypeScript source, Vitest tests, and a reviewer score without manually coordinating any of the steps. The five-agent structure ensures each stage is handled by a role optimised for that specific task.

Use the reviewer score as a quality gate.
The ReviewerAgent scores every run from 0 to 100 across security, performance, and correctness. Teams can use this score as a threshold before accepting generated code - only promoting runs that clear a minimum score into the codebase.

Use the NDJSON event log for pipeline observability.
Every run writes a structured pipeline.log in NDJSON format. This can be parsed by any log processing tool to track pipeline performance, token costs, and approval rates across runs over time.

Extend it with additional agent roles.
The five-agent structure is sequential and modular. A new agent that receives the previous stage's output and produces its own artifact can be added without restructuring the existing pipeline - the handoff pattern is already established for each stage.

Final Notes

SPEC TO SHIP compresses the gap between a feature idea and production-ready code by distributing the work across five specialized agents, each focused on what it does best. Architecture, planning, implementation, testing, and review - all coordinated automatically, with structured handoffs and resilience built in at every stage.

The code is at https://github.com/dakshjain-1616/Spec-To-Ship
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Top comments (2)

Whatsonyourmind • May 19

The five-agent sequential structure is a smart simulation of real teams; one extension that pays off quickly once you start running this at any scale:

Make per-agent model selection dynamic, not config-fixed. Right now DEFAULT_MODEL=google/gemini-2.0-flash-001 runs every agent on the same tier. That's overkill on QA (writing Vitest from a passing spec is mechanical) and under-spec on Architect (API contracts + data models + security in one prompt is where the run quality is decided). Per-agent overrides via env give the obvious linear improvement (ARCHITECT_MODEL=anthropic/claude-opus-4, QA_MODEL=meta-llama/llama-3.3-8b), but the bigger win is per-task dynamic routing:

Architect on a high-complexity feature → strong reasoning model.
Architect on a small bug fix → mid-tier.
Engineer for boilerplate scaffold → small model.
Engineer for state-machine logic → strong model.

A task_complexity_estimator step before each agent (token count of prior artifacts + feature flags from the spec) feeding a routing decision lets you spend 3-5x less per pipeline run on simple tasks while keeping the hard ones on strong models. The QA/Engineer tight loop is where this compounds — if QA needs 3 rounds it's a hard task, escalate Engineer to a stronger model for round 4+.

Two other things from running similar pipelines:

The Engineer↔QA loop is unbounded. A hard cap on retries (3 rounds, then escalate to human with the failure diff) prevents the runaway-tokens case. The 20-min global timeout is too coarse.
Pre-run cost estimate. meta.json captures cost post-hoc. A pre-run estimate per stage (input tokens × model rate + max output × model rate, summed across agents) lets you decide whether to start a large run. Especially useful when the Reviewer score gate means a run could fail at 90% completion with full cost spent.

For the routing decision specifically (which model for which agent for which task), I packaged UCB1/Thompson Sampling/LinUCB and a few static-policy variants as an MCP server in case it's useful as a drop-in: github.com/Whatsonyourmind/oraclaw. Different scope from SPEC-TO-SHIP but the picker primitive is the same shape.

Harjot Singh • May 31

Spec-to-ship as a multi-agent pipeline is a great way to frame the real opportunity, because the win isn't a single clever code-gen agent, it's the deterministic pipeline that carries a feature idea through defined stages (spec, plan, implement, test, review) with structure between them. The spec is doing the heavy lifting here, a clear, machine-usable spec is what keeps the downstream agents on track, because most agent coding failures trace back to ambiguous intent, so investing in the spec stage pays off everywhere after it. The stages I'd guard hardest are the verification ones: a test/review agent that actually checks the implementation against the spec is what turns plausible code into shippable code, and the irreversible step (merge/deploy) is where a human gate belongs, the pipeline does the breadth, the human owns the ship decision. The thing that makes this production rather than demo is what happens at the seams when a stage produces something wrong, does the next stage catch it (validation, the review agent) or does the bad output flow to production. Deterministic stages, a strong spec, verification between them, and a gate on ship. That structure-the-pipeline-and-verify-each-stage instinct is core to how I think about Moonshift. In your pipeline, does the review/test stage gate progression (must pass to continue), or run advisory alongside?