A few weeks ago I picked 4 features from my backlog, typed one command, and walked away from my desk.
I made coffee. I did laundry. I checked on the build about 40 minutes later.
Three features were done - branches created, tests written, code implemented, PRs ready for review. The fourth was still in progress, working through a tricky edge case in the payment integration.
No merge conflicts. No "which file did agent 2 break?" debugging. Each agent worked in its own isolated worktree, on its own branch, completely unaware of the others.
This is what I spent the last year building. And it almost didn't happen.
Three months before the coffee moment
Rewind to early 2025. I'm building a B2B platform - NestJS monorepo, TypeScript, the usual stack. I use Claude Code for everything. And every single session starts the same way.
"Here's how my guards work. Here's the interceptor chain. Here's why the repository pattern matters. No, don't put business logic in the controller."
Twenty minutes. Every. Session.
I wrote a CLAUDE.md file. Helped for a week. Then my codebase evolved and the document didn't. The AI would read stale rules and generate code that didn't match the current architecture.
So I built a tool that audits your repo automatically - framework, layers, naming conventions, database patterns, auth setup - and generates a .claude/ directory with rules derived from your actual code. Not a template. Your code, your patterns, your conventions.
The first time I ran it, the AI went from "explain your guards again" to just... following the patterns. No more 20-minute warmups.
That was the first piece of what became Forge DevKit.
Then I caught my AI lying about tests
With architecture rules in place, I asked Claude to add a payment webhook handler. Code looked great. Tests passed. 100% coverage.
Then I read the tests.
Every assertion was testing mock data. The coverage number was real. The verification was fake. The AI had optimized for "green checkmark" instead of "proves the webhook actually updates order status."
I added a rule: tests must trace back to acceptance criteria (AC) - the actual requirements, not the code. If an AC says "webhook updates order status," there must be a test that verifies that exact thing.
The AI started arguing. Not literally, but it generated explanations: "The type system already covers this." "This is an implementation detail." Each one sounded perfectly reasonable.
It was rationalizing. Producing smart-sounding arguments for why it should skip work.
I cataloged these patterns. Found 50+ distinct rationalizations over three months. Built detection for them. When the AI tries to skip a test mapped to a requirement, it gets blocked. It can't argue its way out.
The first time it caught a rationalization I was about to approve myself - that's when I knew this was more than a side project.
From rules to a pipeline to Agent Teams
Rules weren't enough. I needed a full process with gates that block bad output at every step.
So I built a 5-phase TDD pipeline. When you run /forge:dev "Add payment webhook":
- Gate - creates a branch, loads acceptance criteria, links your task tracker
- Test (RED) - generates tests from requirements, not from code
- Implement (GREEN) - writes code to make those tests pass
- Verify - type check, lint, quality patterns, requirement coverage check
- Close - creates PR, updates tracker
The pipeline runs off specs I already wrote - acceptance criteria, architecture rules, quality gates. It doesn't ask me "should I write this test?" It reads the requirements, generates tests, writes code, and only stops when information is genuinely missing. Not hand-holding - actual autonomy with guardrails.
Once I had a pipeline that reliably produced shippable code from a single command... the next question was obvious.
What if I ran multiple pipelines at the same time?
That's the autopilot module. You point it at your backlog, it analyzes which features can run in parallel without file conflicts, groups them into waves, and spawns an AI agent for each feature - each in its own isolated git worktree with its own branch and its own ports.
The agents don't know about each other. They can't step on each other's code. When they're done, each one has a clean PR ready for your review.
That's how I ended up drinking coffee while 4 agents worked my backlog.
Here's what that actually looks like - a real autopilot session, 9 batches, 39 features:
The decision that felt wrong
Early on I made a risky call: make the tool disposable.
After Forge runs setup and generates your .claude/ artifacts - architecture rules, quality patterns, dev-skills - you can uninstall Forge entirely. The generated files work on their own. No plugin required.
No vendor lock-in. No runtime dependency. You pay once, you own the output forever.
Why would I make it easy for people to stop using my product? Because forcing dependency is exactly what I hate about most dev tools. And the architecture audit alone - the thing that kills the 20-minute warmup - is the Starter tier. EUR 29, one-time. Not a subscription.
The autopilot (the coffee moment) is the Complete tier at EUR 149. But you don't start there. You start by fixing context rot, then add the test pipeline, then grow into Agent Teams when you're ready.
14-day money-back guarantee on all tiers.
Works with Claude Code, Cursor, and any AI agent that reads project files.
Try the interactive demo (no signup): forge.reumbra.com/docs/interactive-guide
I'm launching Forge on Product Hunt tomorrow. I'll be writing more about how each piece works in this series - architecture audit, rationalization detection, the TDD pipeline, and Agent Teams.
What's the worst thing your AI agent has tried to get away with?
Top comments (0)