Martin Sikora

Posted on Mar 16 • Originally published at forge.reumbra.com

I ran 4 AI agents on my backlog and went for coffee

#claudecode #ai #tooling #devtools

A few weeks ago I picked 4 features from my backlog, typed one command, and walked away from my desk.

I made coffee. I did laundry. I checked on the build about 40 minutes later.

Three features were done - branches created, tests written, code implemented, PRs ready for review. The fourth was still in progress, working through a tricky edge case in the payment integration.

No merge conflicts. No "which file did agent 2 break?" debugging. Each agent worked in its own isolated worktree, on its own branch, completely unaware of the others.

This is what I spent the last year building. And it almost didn't happen.

Three months before the coffee moment

Rewind to early 2025. I'm building a B2B platform - NestJS monorepo, TypeScript, the usual stack. I use Claude Code for everything. And every single session starts the same way.

"Here's how my guards work. Here's the interceptor chain. Here's why the repository pattern matters. No, don't put business logic in the controller."

Twenty minutes. Every. Session.

I wrote a CLAUDE.md file. Helped for a week. Then my codebase evolved and the document didn't. The AI would read stale rules and generate code that didn't match the current architecture.

So I built a tool that audits your repo automatically - framework, layers, naming conventions, database patterns, auth setup - and generates a .claude/ directory with rules derived from your actual code. Not a template. Your code, your patterns, your conventions.

The first time I ran it, the AI went from "explain your guards again" to just... following the patterns. No more 20-minute warmups.

That was the first piece of what became Forge DevKit.

Then I caught my AI lying about tests

With architecture rules in place, I asked Claude to add a payment webhook handler. Code looked great. Tests passed. 100% coverage.

Then I read the tests.

Every assertion was testing mock data. The coverage number was real. The verification was fake. The AI had optimized for "green checkmark" instead of "proves the webhook actually updates order status."

I added a rule: tests must trace back to acceptance criteria (AC) - the actual requirements, not the code. If an AC says "webhook updates order status," there must be a test that verifies that exact thing.

The AI started arguing. Not literally, but it generated explanations: "The type system already covers this." "This is an implementation detail." Each one sounded perfectly reasonable.

It was rationalizing. Producing smart-sounding arguments for why it should skip work.

I cataloged these patterns. Found 50+ distinct rationalizations over three months. Built detection for them. When the AI tries to skip a test mapped to a requirement, it gets blocked. It can't argue its way out.

The first time it caught a rationalization I was about to approve myself - that's when I knew this was more than a side project.

From rules to a pipeline to Agent Teams

Rules weren't enough. I needed a full process with gates that block bad output at every step.

So I built a 5-phase TDD pipeline. When you run /forge:dev "Add payment webhook":

Gate - creates a branch, loads acceptance criteria, links your task tracker
Test (RED) - generates tests from requirements, not from code
Implement (GREEN) - writes code to make those tests pass
Verify - type check, lint, quality patterns, requirement coverage check
Close - creates PR, updates tracker

The pipeline runs off specs I already wrote - acceptance criteria, architecture rules, quality gates. It doesn't ask me "should I write this test?" It reads the requirements, generates tests, writes code, and only stops when information is genuinely missing. Not hand-holding - actual autonomy with guardrails.

Once I had a pipeline that reliably produced shippable code from a single command... the next question was obvious.

What if I ran multiple pipelines at the same time?

That's the autopilot module. You point it at your backlog, it analyzes which features can run in parallel without file conflicts, groups them into waves, and spawns an AI agent for each feature - each in its own isolated git worktree with its own branch and its own ports.

The agents don't know about each other. They can't step on each other's code. When they're done, each one has a clean PR ready for your review.

That's how I ended up drinking coffee while 4 agents worked my backlog.

Here's what that actually looks like - a real autopilot session, 9 batches, 39 features:

Autopilot: 39 features, 9 batches, 20 agents, 0 conflicts claude-sonnet-4-5-20250514

You

/forge:autopilot

Agent

Autopilot - Batch Autonomous Development

Reading backlog and product artifacts...

BACKLOG.md: 39 tasks across P0-P3
PRODUCT-BACKLOG.md: 5 epics, 43 acceptance criteria
Architecture: Python/FastAPI backend + React/TanStack frontend

Starting Phase 0: Triage.

Agent

Step 0.1: Feature Collection

Priority	Features	Tags
P0 (critical)	4	backend, frontend
P1 (high)	15	backend, frontend, fullstack
P2 (medium)	12	backend, frontend
P3 (low)	8	backend

39 features collected. Launching classification + conflict analysis in parallel.

Agent

Step 0.2 + 0.3: Running in parallel

Spawned 2 analysis agents:
- backlog-analyst - classifying all 39 features by complexity, risk, design needs
- conflict-detector - analyzing file overlaps across all feature pairs

741 file-pair combinations being checked.

Agent

backlog-analyst complete:

Complexity	Count	Examples
Low	11	seed data, CRUD endpoints, simple UI
Medium	18	chat markdown, theme presets, tracker sync
High	10	workflow engine, auth system, virtual scroll

Design artifacts needed: 3 features (theme engine, workflow advanced, auth).

Agent

conflict-detector complete:

Conflict Matrix Summary

682 SAFE pairs (no shared files)
53 RISKY pairs (shared files, manageable)
6 BLOCKING pairs (same core module, must be sequential)

Key blocking chains:
1. Core models -> Seed -> API endpoints (foundation)
2. Design tokens -> Theme engine (frontend)
3. Workflow core -> Workflow templates -> Marketplace (backend)

These blocking chains define the batch ordering.

Agent

Step 0.5: Batch Planning

Based on conflict graph + dependency chains:

Batch	Theme	Features	Strategy	Dependencies
001	Core domain models	4	2 parallel	none
002	Seed data pipeline	3	1 sequential	batch 001
003	API endpoints v1	5	3 parallel	batch 002
004	Domain nodes + Tracker	3	2 parallel	batch 002
005	Frontend foundations (P0)	4	2 parallel	batch 003
006	Frontend polish + design	7	2 waves (3+1)	batch 005
007	Workflow engine	3	1 sequential	batch 003
008	Agent tools	5	2 parallel	batch 003
009	Workflow advanced	5	2 waves (2+1)	batch 007

9 batches, 39 features, 20 teammate instances planned.

Batches 003/004 can run in parallel. Batches 007/008 can run in parallel.

Agent

Step 0.7: Autonomy Gate

Check	Status
All features have AC	36/39 (3 inferred from code)
Conflict resolution plan	all BLOCKING pairs in same teammate
Worktree isolation	each teammate gets own branch
Rollback strategy	each batch independently revertable

Autonomy: full-auto approved.

Generating teammate briefs for all 9 batches... done.

Starting execution. You can walk away now.

You

going for coffee

Agent

Batch 001: Core Domain Models

Spawning 2 teammates in isolated worktrees.

Teammate	Features	Model
A	User + Organization models	sonnet
B	Crew + Node models	sonnet

You

<teammate-message teammate_id="teammate-a" color="blue" summary="User + Org models: 2 models, 2 migrations, membership system">
Done. User + Organization with membership roles. 2 Alembic migrations.
</teammate-message>

You

<teammate-message teammate_id="teammate-b" color="green" summary="Crew + Node models: categories, knowledge packs, 2 migrations">
Done. Crew with roles, Node with categories + knowledge packs. 2 migrations.
</teammate-message>

Agent

Batch 001 complete. 4 features, 15 min, 0 conflicts. 42 tests passing. Starting 002.

Agent

Batch 002: Seed Data Pipeline

1 teammate, sequential chain: categories -> knowledge packs -> node definitions.

You

<teammate-message teammate_id="teammate-a" color="blue" summary="Seed pipeline: 8 categories, 12 knowledge packs, 28 nodes">
Done. Full seed: 8 categories, 12 knowledge packs, 28 node definitions.
</teammate-message>

Agent

Batch 002 complete. 3 features, 18 min. GET /api/nodes/ returns 28 nodes. 67 tests. Starting 003 + 004 in parallel.

Agent

Batch 003 + 004: Running in Parallel

Batch 003 - API Endpoints: 3 teammates (Crews+Nodes CRUD, Users+Orgs CRUD, Orchestrator)
Batch 004 - Domain Nodes: 2 teammates (Product+QA nodes, Tracker integration)

5 teammates working simultaneously across 2 batches.

You

<teammate-message teammate_id="batch003-c" color="yellow" summary="Orchestrator: run/stop/status, WebSocket streaming">
Batch 003 Teammate C done. Orchestrator endpoints + WebSocket real-time streaming.
</teammate-message>

You

<teammate-message teammate_id="batch004-b" color="green" summary="Tracker: config model, GitHub/Linear/ClickUp sync">
Batch 004 Teammate B done. TrackerConfig + external sync (GitHub, Linear, ClickUp).
</teammate-message>

You

<teammate-message teammate_id="batch003-a" color="blue" summary="Crew + Node CRUD with filtering, pagination, bulk ops">
Batch 003 Teammate A done. Full CRUD with filtering, pagination, bulk operations.
</teammate-message>

You

<teammate-message teammate_id="batch003-b" color="green" summary="User + Org CRUD, membership, invites">
Batch 003 Teammate B done. User + Org management with roles and invite system.
</teammate-message>

You

<teammate-message teammate_id="batch004-a" color="blue" summary="Product + QA nodes: 9 definitions, 5 workflows">
Batch 004 Teammate A done. 5 Product nodes + 4 QA nodes with workflows.
</teammate-message>

Agent

Batch 003 + 004 complete. 8 features, 23 min, 5 teammates parallel, 0 conflicts. 128 tests. Starting 005.

Agent

Batch 005: Frontend Foundations (P0)

2 teammates: app shell+routing+auth (opus) | chat interface+messages (sonnet).

You

<teammate-message teammate_id="teammate-a" color="blue" summary="App shell: routing, auth flow, sidebar, responsive, dark mode base">
Done. TanStack Router, auth flow, sidebar nav, responsive layout, dark mode.
</teammate-message>

You

<teammate-message teammate_id="teammate-b" color="green" summary="Chat: message list, rich input, crew panel, WebSocket updates">
Done. Chat interface with message list, command input, crew panel, real-time WebSocket.
</teammate-message>

Agent

Batch 005 complete. 4 features, 30 min. Frontend live. 168 tests. Starting 006 (2-wave).

Agent

Batch 006: Frontend Polish + Design System (2 waves)

Wave 1: 3 parallel
| Teammate | Features |
|----------|----------|
| A (opus) | Design System: type scale -> colors -> variants |
| B | Chat markdown + virtual scroll |
| C | Animated crew statuses |

Wave 2 (after tokens land): Theme engine + 6 presets

You

<teammate-message teammate_id="teammate-c" color="yellow" summary="Animated statuses: 4 states, reduced-motion, 220 tests">
Done. 4 animation states, prefers-reduced-motion support. 220 tests.
</teammate-message>

You

<teammate-message teammate_id="teammate-a" color="blue" summary="Design system: rem types, semantic colors, variants - 74 files migrated">
Done. Type scale (32 files), color tokens (24 files), component variants (18 components). Full foundation.
</teammate-message>

You

<teammate-message teammate_id="teammate-b" color="green" summary="Markdown rendering + virtual scroll for 10K messages">
Done. Syntax highlighting + virtual scroll handling 10K messages.
</teammate-message>

Agent

Wave 1 complete. Design tokens in main. Launching Wave 2: Theme Engine.

You

<teammate-message teammate_id="teammate-d" color="red" summary="Theme engine + 6 presets, WCAG AA verified">
Done. Runtime theme switching, 6 presets (Light/Dark/Ocean/Forest/Sunset/Midnight). All WCAG AA verified.
</teammate-message>

Agent

Batch 006 complete. 7 features, 2 waves, 4 teammates, 129 files, 0 conflicts. 248 tests. Starting 007 + 008 in parallel.

Agent

Batch 007 + 008: Running in Parallel

Batch 007 - Workflow Engine: 1 teammate, sequential (core -> steps -> conditionals)
Batch 008 - Agent Tools: 2 teammates (registry chain | HTTP+search)

3 teammates across 2 batches.

You

<teammate-message teammate_id="batch008-b" color="green" summary="HTTP tool + Catalog search complete">
Batch 008 Teammate B done. HTTP tool (retry/auth) + Catalog full-text search.
</teammate-message>

You

<teammate-message teammate_id="batch008-a" color="blue" summary="ToolRegistry + Function calling + Permissions - full chain">
Batch 008 Teammate A done. ToolRegistry (7 tools), Function Calling (streaming), Permissions (sandbox).
</teammate-message>

You

<teammate-message teammate_id="batch007-a" color="blue" summary="Workflow engine: DAG execution, parallel steps, conditionals, retry">
Batch 007 done. DAG execution, parallel steps, conditional branching, retry policies, history with rollback.
</teammate-message>

Agent

Batch 007 + 008 complete. 8 features, 25 min, 3 teammates, 0 conflicts. 296 tests. Starting final batch 009.

Agent

Batch 009: Workflow Advanced (final batch, 2 waves)

Wave 1: Templates + Versioning + Analytics (2 parallel)
Wave 2: Marketplace + Import/Export (depends on templates)

You

<teammate-message teammate_id="teammate-b" color="green" summary="Execution analytics: timing, success rates, bottleneck detection">
Done. Per-step timing, success/failure rates, bottleneck detection.
</teammate-message>

You

<teammate-message teammate_id="teammate-a" color="blue" summary="Templates + versioning with diff view and rollback">
Done. Template system + semantic versioning with diff view.
</teammate-message>

Agent

Wave 1 complete. Launching Wave 2: Marketplace.

You

<teammate-message teammate_id="teammate-c" color="yellow" summary="Marketplace: publish, discover, install, rate workflows">
Done. Publish workflows as templates, category browsing, one-click install, community ratings.
</teammate-message>

Agent

ALL 9 BATCHES COMPLETE

Batch	Theme	Features	Teammates	Time
001	Core models	4	2 parallel	15m
002	Seed pipeline	3	1 sequential	18m
003	API endpoints	5	3 parallel	18m
004	Domain nodes	3	2 parallel	20m
005	Frontend P0	4	2 parallel	30m
006	Frontend polish	7	4 (2 waves)	48m
007	Workflow engine	3	1 sequential	25m
008	Agent tools	5	2 parallel	24m
009	Workflow advanced	5	3 (2 waves)	37m

Final Numbers

39 features shipped
9 batches, 4 with multi-wave execution
20 teammate instances spawned total
312 tests, all passing
0 merge conflicts across entire run
0 manual interventions
Total autonomous time: ~4 hours
Human input: 1 command to start, 1 "going for coffee"

Welcome back. Your project has a backend, frontend, design system, theme engine, workflow engine, and agent tools. All tested.

You

whats next

Agent

Backlog clear through P3. Remaining:

P4 - Infrastructure (13 features)

Auth + RBAC + Multi-tenant (high, blocks prod)
CI/CD pipeline (medium)
Monitoring + Logging (medium)
Performance optimization (low)

Ready when you are.

The decision that felt wrong

Early on I made a risky call: make the tool disposable.

After Forge runs setup and generates your .claude/ artifacts - architecture rules, quality patterns, dev-skills - you can uninstall Forge entirely. The generated files work on their own. No plugin required.

No vendor lock-in. No runtime dependency. You pay once, you own the output forever.

Why would I make it easy for people to stop using my product? Because forcing dependency is exactly what I hate about most dev tools. And the architecture audit alone - the thing that kills the 20-minute warmup - is the Starter tier. EUR 29, one-time. Not a subscription.

The autopilot (the coffee moment) is the Complete tier at EUR 149. But you don't start there. You start by fixing context rot, then add the test pipeline, then grow into Agent Teams when you're ready.

14-day money-back guarantee on all tiers.

Works with Claude Code, Cursor, and any AI agent that reads project files.

Try the interactive demo (no signup): forge.reumbra.com/docs/interactive-guide

I'm launching Forge on Product Hunt tomorrow. I'll be writing more about how each piece works in this series - architecture audit, rationalization detection, the TDD pipeline, and Agent Teams.

What's the worst thing your AI agent has tried to get away with?

Top comments (1)

Jon Gottfried • Mar 20

Love this use of Agent Sessions, and the overall post. I'm really curious to try Forge - it seems like writing out guardrails/instructions for an agent is often the biggest barrier to entry to increasing output quality