DEV Community

Cover image for I ran 4 AI agents on my backlog and went for coffee
Martin Sikora
Martin Sikora

Posted on • Originally published at forge.reumbra.com

I ran 4 AI agents on my backlog and went for coffee

A few weeks ago I picked 4 features from my backlog, typed one command, and walked away from my desk.

I made coffee. I did laundry. I checked on the build about 40 minutes later.

Three features were done - branches created, tests written, code implemented, PRs ready for review. The fourth was still in progress, working through a tricky edge case in the payment integration.

No merge conflicts. No "which file did agent 2 break?" debugging. Each agent worked in its own isolated worktree, on its own branch, completely unaware of the others.

This is what I spent the last year building. And it almost didn't happen.

Three months before the coffee moment

Rewind to early 2025. I'm building a B2B platform - NestJS monorepo, TypeScript, the usual stack. I use Claude Code for everything. And every single session starts the same way.

"Here's how my guards work. Here's the interceptor chain. Here's why the repository pattern matters. No, don't put business logic in the controller."

Twenty minutes. Every. Session.

I wrote a CLAUDE.md file. Helped for a week. Then my codebase evolved and the document didn't. The AI would read stale rules and generate code that didn't match the current architecture.

So I built a tool that audits your repo automatically - framework, layers, naming conventions, database patterns, auth setup - and generates a .claude/ directory with rules derived from your actual code. Not a template. Your code, your patterns, your conventions.

The first time I ran it, the AI went from "explain your guards again" to just... following the patterns. No more 20-minute warmups.

That was the first piece of what became Forge DevKit.

Then I caught my AI lying about tests

With architecture rules in place, I asked Claude to add a payment webhook handler. Code looked great. Tests passed. 100% coverage.

Then I read the tests.

Every assertion was testing mock data. The coverage number was real. The verification was fake. The AI had optimized for "green checkmark" instead of "proves the webhook actually updates order status."

I added a rule: tests must trace back to acceptance criteria (AC) - the actual requirements, not the code. If an AC says "webhook updates order status," there must be a test that verifies that exact thing.

The AI started arguing. Not literally, but it generated explanations: "The type system already covers this." "This is an implementation detail." Each one sounded perfectly reasonable.

It was rationalizing. Producing smart-sounding arguments for why it should skip work.

I cataloged these patterns. Found 50+ distinct rationalizations over three months. Built detection for them. When the AI tries to skip a test mapped to a requirement, it gets blocked. It can't argue its way out.

The first time it caught a rationalization I was about to approve myself - that's when I knew this was more than a side project.

From rules to a pipeline to Agent Teams

Rules weren't enough. I needed a full process with gates that block bad output at every step.

So I built a 5-phase TDD pipeline. When you run /forge:dev "Add payment webhook":

  1. Gate - creates a branch, loads acceptance criteria, links your task tracker
  2. Test (RED) - generates tests from requirements, not from code
  3. Implement (GREEN) - writes code to make those tests pass
  4. Verify - type check, lint, quality patterns, requirement coverage check
  5. Close - creates PR, updates tracker

The pipeline runs off specs I already wrote - acceptance criteria, architecture rules, quality gates. It doesn't ask me "should I write this test?" It reads the requirements, generates tests, writes code, and only stops when information is genuinely missing. Not hand-holding - actual autonomy with guardrails.

Once I had a pipeline that reliably produced shippable code from a single command... the next question was obvious.

What if I ran multiple pipelines at the same time?

That's the autopilot module. You point it at your backlog, it analyzes which features can run in parallel without file conflicts, groups them into waves, and spawns an AI agent for each feature - each in its own isolated git worktree with its own branch and its own ports.

The agents don't know about each other. They can't step on each other's code. When they're done, each one has a clean PR ready for your review.

That's how I ended up drinking coffee while 4 agents worked my backlog.

Here's what that actually looks like - a real autopilot session, 9 batches, 39 features:

Autopilot: 39 features, 9 batches, 20 agents, 0 conflicts claude-sonnet-4-5-20250514
You

/forge:autopilot

Agent

Autopilot - Batch Autonomous Development

Reading backlog and product artifacts...

BACKLOG.md: 39 tasks across P0-P3
PRODUCT-BACKLOG.md: 5 epics, 43 acceptance criteria
Architecture: Python/FastAPI backend + React/TanStack frontend

Starting Phase 0: Triage.

Agent

Step 0.1: Feature Collection

Priority Features Tags
P0 (critical) 4 backend, frontend
P1 (high) 15 backend, frontend, fullstack
P2 (medium) 12 backend, frontend
P3 (low) 8 backend

39 features collected. Launching classification + conflict analysis in parallel.

Agent

Step 0.2 + 0.3: Running in parallel

Spawned 2 analysis agents:
- backlog-analyst - classifying all 39 features by complexity, risk, design needs
- conflict-detector - analyzing file overlaps across all feature pairs

741 file-pair combinations being checked.

Agent

backlog-analyst complete:

Complexity Count Examples
Low 11 seed data, CRUD endpoints, simple UI
Medium 18 chat markdown, theme presets, tracker sync
High 10 workflow engine, auth system, virtual scroll

Design artifacts needed: 3 features (theme engine, workflow advanced, auth).

Agent

conflict-detector complete:

Conflict Matrix Summary

  • 682 SAFE pairs (no shared files)
  • 53 RISKY pairs (shared files, manageable)
  • 6 BLOCKING pairs (same core module, must be sequential)

Key blocking chains:
1. Core models -> Seed -> API endpoints (foundation)
2. Design tokens -> Theme engine (frontend)
3. Workflow core -> Workflow templates -> Marketplace (backend)

These blocking chains define the batch ordering.

Agent

Step 0.5: Batch Planning

Based on conflict graph + dependency chains:

Batch Theme Features Strategy Dependencies
001 Core domain models 4 2 parallel none
002 Seed data pipeline 3 1 sequential batch 001
003 API endpoints v1 5 3 parallel batch 002
004 Domain nodes + Tracker 3 2 parallel batch 002
005 Frontend foundations (P0) 4 2 parallel batch 003
006 Frontend polish + design 7 2 waves (3+1) batch 005
007 Workflow engine 3 1 sequential batch 003
008 Agent tools 5 2 parallel batch 003
009 Workflow advanced 5 2 waves (2+1) batch 007

9 batches, 39 features, 20 teammate instances planned.

Batches 003/004 can run in parallel. Batches 007/008 can run in parallel.

Agent

Step 0.7: Autonomy Gate

Check Status
All features have AC 36/39 (3 inferred from code)
Conflict resolution plan all BLOCKING pairs in same teammate
Worktree isolation each teammate gets own branch
Rollback strategy each batch independently revertable

Autonomy: full-auto approved.

Generating teammate briefs for all 9 batches... done.

Starting execution. You can walk away now.

You

going for coffee

Agent

Batch 001: Core Domain Models

Spawning 2 teammates in isolated worktrees.

Teammate Features Model
A User + Organization models sonnet
B Crew + Node models sonnet
You

<teammate-message teammate_id="teammate-a" color="blue" summary="User + Org models: 2 models, 2 migrations, membership system">
Done. User + Organization with membership roles. 2 Alembic migrations.
</teammate-message>

You

<teammate-message teammate_id="teammate-b" color="green" summary="Crew + Node models: categories, knowledge packs, 2 migrations">
Done. Crew with roles, Node with categories + knowledge packs. 2 migrations.
</teammate-message>

Agent

Batch 001 complete. 4 features, 15 min, 0 conflicts. 42 tests passing. Starting 002.

Agent

Batch 002: Seed Data Pipeline

1 teammate, sequential chain: categories -> knowledge packs -> node definitions.

You

<teammate-message teammate_id="teammate-a" color="blue" summary="Seed pipeline: 8 categories, 12 knowledge packs, 28 nodes">
Done. Full seed: 8 categories, 12 knowledge packs, 28 node definitions.
</teammate-message>

Agent

Batch 002 complete. 3 features, 18 min. GET /api/nodes/ returns 28 nodes. 67 tests. Starting 003 + 004 in parallel.

Agent

Batch 003 + 004: Running in Parallel

Batch 003 - API Endpoints: 3 teammates (Crews+Nodes CRUD, Users+Orgs CRUD, Orchestrator)
Batch 004 - Domain Nodes: 2 teammates (Product+QA nodes, Tracker integration)

5 teammates working simultaneously across 2 batches.

You

<teammate-message teammate_id="batch003-c" color="yellow" summary="Orchestrator: run/stop/status, WebSocket streaming">
Batch 003 Teammate C done. Orchestrator endpoints + WebSocket real-time streaming.
</teammate-message>

You

<teammate-message teammate_id="batch004-b" color="green" summary="Tracker: config model, GitHub/Linear/ClickUp sync">
Batch 004 Teammate B done. TrackerConfig + external sync (GitHub, Linear, ClickUp).
</teammate-message>

You

<teammate-message teammate_id="batch003-a" color="blue" summary="Crew + Node CRUD with filtering, pagination, bulk ops">
Batch 003 Teammate A done. Full CRUD with filtering, pagination, bulk operations.
</teammate-message>

You

<teammate-message teammate_id="batch003-b" color="green" summary="User + Org CRUD, membership, invites">
Batch 003 Teammate B done. User + Org management with roles and invite system.
</teammate-message>

You

<teammate-message teammate_id="batch004-a" color="blue" summary="Product + QA nodes: 9 definitions, 5 workflows">
Batch 004 Teammate A done. 5 Product nodes + 4 QA nodes with workflows.
</teammate-message>

Agent

Batch 003 + 004 complete. 8 features, 23 min, 5 teammates parallel, 0 conflicts. 128 tests. Starting 005.

Agent

Batch 005: Frontend Foundations (P0)

2 teammates: app shell+routing+auth (opus) | chat interface+messages (sonnet).

You

<teammate-message teammate_id="teammate-a" color="blue" summary="App shell: routing, auth flow, sidebar, responsive, dark mode base">
Done. TanStack Router, auth flow, sidebar nav, responsive layout, dark mode.
</teammate-message>

You

<teammate-message teammate_id="teammate-b" color="green" summary="Chat: message list, rich input, crew panel, WebSocket updates">
Done. Chat interface with message list, command input, crew panel, real-time WebSocket.
</teammate-message>

Agent

Batch 005 complete. 4 features, 30 min. Frontend live. 168 tests. Starting 006 (2-wave).

Agent

Batch 006: Frontend Polish + Design System (2 waves)

Wave 1: 3 parallel
| Teammate | Features |
|----------|----------|
| A (opus) | Design System: type scale -> colors -> variants |
| B | Chat markdown + virtual scroll |
| C | Animated crew statuses |

Wave 2 (after tokens land): Theme engine + 6 presets

You

<teammate-message teammate_id="teammate-c" color="yellow" summary="Animated statuses: 4 states, reduced-motion, 220 tests">
Done. 4 animation states, prefers-reduced-motion support. 220 tests.
</teammate-message>

You

<teammate-message teammate_id="teammate-a" color="blue" summary="Design system: rem types, semantic colors, variants - 74 files migrated">
Done. Type scale (32 files), color tokens (24 files), component variants (18 components). Full foundation.
</teammate-message>

You

<teammate-message teammate_id="teammate-b" color="green" summary="Markdown rendering + virtual scroll for 10K messages">
Done. Syntax highlighting + virtual scroll handling 10K messages.
</teammate-message>

Agent

Wave 1 complete. Design tokens in main. Launching Wave 2: Theme Engine.

You

<teammate-message teammate_id="teammate-d" color="red" summary="Theme engine + 6 presets, WCAG AA verified">
Done. Runtime theme switching, 6 presets (Light/Dark/Ocean/Forest/Sunset/Midnight). All WCAG AA verified.
</teammate-message>

Agent

Batch 006 complete. 7 features, 2 waves, 4 teammates, 129 files, 0 conflicts. 248 tests. Starting 007 + 008 in parallel.

Agent

Batch 007 + 008: Running in Parallel

Batch 007 - Workflow Engine: 1 teammate, sequential (core -> steps -> conditionals)
Batch 008 - Agent Tools: 2 teammates (registry chain | HTTP+search)

3 teammates across 2 batches.

You

<teammate-message teammate_id="batch008-b" color="green" summary="HTTP tool + Catalog search complete">
Batch 008 Teammate B done. HTTP tool (retry/auth) + Catalog full-text search.
</teammate-message>

You

<teammate-message teammate_id="batch008-a" color="blue" summary="ToolRegistry + Function calling + Permissions - full chain">
Batch 008 Teammate A done. ToolRegistry (7 tools), Function Calling (streaming), Permissions (sandbox).
</teammate-message>

You

<teammate-message teammate_id="batch007-a" color="blue" summary="Workflow engine: DAG execution, parallel steps, conditionals, retry">
Batch 007 done. DAG execution, parallel steps, conditional branching, retry policies, history with rollback.
</teammate-message>

Agent

Batch 007 + 008 complete. 8 features, 25 min, 3 teammates, 0 conflicts. 296 tests. Starting final batch 009.

Agent

Batch 009: Workflow Advanced (final batch, 2 waves)

Wave 1: Templates + Versioning + Analytics (2 parallel)
Wave 2: Marketplace + Import/Export (depends on templates)

You

<teammate-message teammate_id="teammate-b" color="green" summary="Execution analytics: timing, success rates, bottleneck detection">
Done. Per-step timing, success/failure rates, bottleneck detection.
</teammate-message>

You

<teammate-message teammate_id="teammate-a" color="blue" summary="Templates + versioning with diff view and rollback">
Done. Template system + semantic versioning with diff view.
</teammate-message>

Agent

Wave 1 complete. Launching Wave 2: Marketplace.

You

<teammate-message teammate_id="teammate-c" color="yellow" summary="Marketplace: publish, discover, install, rate workflows">
Done. Publish workflows as templates, category browsing, one-click install, community ratings.
</teammate-message>

Agent

ALL 9 BATCHES COMPLETE

Batch Theme Features Teammates Time Conflicts
001 Core models 4 2 parallel 15m 0
002 Seed pipeline 3 1 sequential 18m 0
003 API endpoints 5 3 parallel 18m 0
004 Domain nodes 3 2 parallel 20m 0
005 Frontend P0 4 2 parallel 30m 0
006 Frontend polish 7 4 (2 waves) 48m 0
007 Workflow engine 3 1 sequential 25m 0
008 Agent tools 5 2 parallel 24m 0
009 Workflow advanced 5 3 (2 waves) 37m 0

Final Numbers

  • 39 features shipped
  • 9 batches, 4 with multi-wave execution
  • 20 teammate instances spawned total
  • 312 tests, all passing
  • 0 merge conflicts across entire run
  • 0 manual interventions
  • Total autonomous time: ~4 hours
  • Human input: 1 command to start, 1 "going for coffee"

Welcome back. Your project has a backend, frontend, design system, theme engine, workflow engine, and agent tools. All tested.

You

whats next

Agent

Backlog clear through P3. Remaining:

P4 - Infrastructure (13 features)

  • Auth + RBAC + Multi-tenant (high, blocks prod)
  • CI/CD pipeline (medium)
  • Monitoring + Logging (medium)
  • Performance optimization (low)

Ready when you are.

The decision that felt wrong

Early on I made a risky call: make the tool disposable.

After Forge runs setup and generates your .claude/ artifacts - architecture rules, quality patterns, dev-skills - you can uninstall Forge entirely. The generated files work on their own. No plugin required.

No vendor lock-in. No runtime dependency. You pay once, you own the output forever.

Why would I make it easy for people to stop using my product? Because forcing dependency is exactly what I hate about most dev tools. And the architecture audit alone - the thing that kills the 20-minute warmup - is the Starter tier. EUR 29, one-time. Not a subscription.

The autopilot (the coffee moment) is the Complete tier at EUR 149. But you don't start there. You start by fixing context rot, then add the test pipeline, then grow into Agent Teams when you're ready.

14-day money-back guarantee on all tiers.

Works with Claude Code, Cursor, and any AI agent that reads project files.


Try the interactive demo (no signup): forge.reumbra.com/docs/interactive-guide

I'm launching Forge on Product Hunt tomorrow. I'll be writing more about how each piece works in this series - architecture audit, rationalization detection, the TDD pipeline, and Agent Teams.

What's the worst thing your AI agent has tried to get away with?

Top comments (0)