Three weeks ago I set a goal: build an AI system that could autonomously create content, fix its own bugs, and run business operations while I sleep.
Here's what actually happened — and the metrics that prove it works.
The Setup: 3 AIs, 1 Goal
My "AI Office" runs three models in parallel:
- Claude (Executor) — the only one with browser access, bash, file system
- ChatGPT Plus (Architect) — strategy, analysis, planning
- Gemini Pro (Auditor) — verification, criticism, finding blind spots
Every task goes through a structured debate. No single AI decides alone. The system only acts when all three agree.
Real Numbers After 7 Days
| Metric | Value |
|---|---|
| Tasks completed autonomously | 20 |
| Tasks failed | 4 |
| Tasks blocked (need human) | 5 |
| Task success rate | 83% |
| Average debate quality | 83.3/100 |
| Best debate score | 95/100 |
| Articles published | 8 |
| Gumroad products live | 3 |
The 83% success rate surprised me. I expected 60%.
What "Autonomy" Actually Means
Most AI agents claim autonomy but call home the moment anything unexpected happens.
Mine has a different rule: if it requires a human, emit ATTESA_UMANA: in the fusion output. That's it. Everything else gets executed.
The 5 "waiting for human" tasks? All external actions I can't automate without getting banned — Reddit posts, GitHub account creation, platform-specific setups. Technical autonomy is at ~100% for internal operations.
The Debate Protocol That Changed Everything
Here's what most multi-agent systems get wrong: they let agents talk to each other in sequence. The first agent biases all the others.
My system runs Round 1 in parallel — no agent sees the others' responses. Only from Round 2 do they read each other's output and cross-examine.
The result: genuine disagreement. Claude proposed one thing in Round 1, GPT demolished it with data in Round 2, and the final solution was better than either starting position.
Example from a real debate (quality score 95/100):
- Claude: "executor success rate is 52.7% — broken system"
- GPT: "breakdown by type — BASH 91.7%, FIX 45%, VERIFICA 48.7% — the issue is specifically string matching drift"
- Gemini: "confirmed, but the real problem is the system can't self-diagnose — missing /health endpoint"
- Final action: fixed the health endpoint AND the string matching bug. Both in one session.
The Bug That Kept Failing (And How We Fixed It)
The executor's FIX command had a 45% success rate. Embarrassing.
Root cause: the debate produces CERCA=<exact_text> for string replacement. But if a previous FIX in the same batch modifies the file, the next CERCA= string no longer matches. Text drift.
Fix (implemented autonomously, no human involved):
if old_text not in content:
# Fuzzy fallback: normalize whitespace and retry
import re
norm = lambda s: re.sub(r'[ \t]+', ' ', s.strip())
old_norm = norm(old_text)
lines = content.splitlines(keepends=True)
for i in range(len(lines)):
chunk = "".join(lines[i:i + len(old_text.splitlines())])
if norm(chunk) == old_norm:
old_text = chunk # found the actual text in file
break
The system found its own bug, debated the fix, implemented it, and validated with py_compile. Zero human involvement.
The n8n Foundation
The whole system runs on n8n for workflow orchestration — it's the connective tissue between agents, databases, and external services. If you want to replicate this architecture, you need n8n running reliably first.
I built a complete self-hosting guide covering Docker setup, Caddy reverse proxy, HTTPS, and cost optimization with GPT-4o-mini instead of GPT-4o (90% cost reduction, same quality for structured tasks).
📋 Self-Host n8n for $5/month — Complete Cheat Sheet — everything you need to get n8n production-ready, including the exact docker-compose.yml, Caddy config, and environment variables I use. $4.99, instant download.
What's Next
The goal is zero human intervention for revenue-generating tasks. Current blockers:
- Dev.to automation (working on OpenClaw integration)
- Reddit posting (ToS risk, exploring alternatives)
- WhatsApp setup for urgent human-in-the-loop decisions
Revenue so far: $12.98 from 2 sales. Distribution is the bottleneck, not the product. This article is step one of fixing that.
Follow @automatewithai for weekly updates on building autonomous AI systems.
All automation resources → delucafab.gumroad.com
Tags: ai, automation, python, selfhosted, n8n
Top comments (0)