DEV Community

fabrizio de luca
fabrizio de luca

Posted on

How I Built a Self-Improving AI Office That Works 24/7 (Without Me)

Three weeks ago I set a goal: build an AI system that could autonomously create content, fix its own bugs, and run business operations while I sleep.

Here's what actually happened — and the metrics that prove it works.

The Setup: 3 AIs, 1 Goal

My "AI Office" runs three models in parallel:

  • Claude (Executor) — the only one with browser access, bash, file system
  • ChatGPT Plus (Architect) — strategy, analysis, planning
  • Gemini Pro (Auditor) — verification, criticism, finding blind spots

Every task goes through a structured debate. No single AI decides alone. The system only acts when all three agree.

Real Numbers After 7 Days

Metric Value
Tasks completed autonomously 20
Tasks failed 4
Tasks blocked (need human) 5
Task success rate 83%
Average debate quality 83.3/100
Best debate score 95/100
Articles published 8
Gumroad products live 3

The 83% success rate surprised me. I expected 60%.

What "Autonomy" Actually Means

Most AI agents claim autonomy but call home the moment anything unexpected happens.

Mine has a different rule: if it requires a human, emit ATTESA_UMANA: in the fusion output. That's it. Everything else gets executed.

The 5 "waiting for human" tasks? All external actions I can't automate without getting banned — Reddit posts, GitHub account creation, platform-specific setups. Technical autonomy is at ~100% for internal operations.

The Debate Protocol That Changed Everything

Here's what most multi-agent systems get wrong: they let agents talk to each other in sequence. The first agent biases all the others.

My system runs Round 1 in parallel — no agent sees the others' responses. Only from Round 2 do they read each other's output and cross-examine.

The result: genuine disagreement. Claude proposed one thing in Round 1, GPT demolished it with data in Round 2, and the final solution was better than either starting position.

Example from a real debate (quality score 95/100):

  • Claude: "executor success rate is 52.7% — broken system"
  • GPT: "breakdown by type — BASH 91.7%, FIX 45%, VERIFICA 48.7% — the issue is specifically string matching drift"
  • Gemini: "confirmed, but the real problem is the system can't self-diagnose — missing /health endpoint"
  • Final action: fixed the health endpoint AND the string matching bug. Both in one session.

The Bug That Kept Failing (And How We Fixed It)

The executor's FIX command had a 45% success rate. Embarrassing.

Root cause: the debate produces CERCA=<exact_text> for string replacement. But if a previous FIX in the same batch modifies the file, the next CERCA= string no longer matches. Text drift.

Fix (implemented autonomously, no human involved):

if old_text not in content:
    # Fuzzy fallback: normalize whitespace and retry
    import re
    norm = lambda s: re.sub(r'[ \t]+', ' ', s.strip())
    old_norm = norm(old_text)
    lines = content.splitlines(keepends=True)
    for i in range(len(lines)):
        chunk = "".join(lines[i:i + len(old_text.splitlines())])
        if norm(chunk) == old_norm:
            old_text = chunk  # found the actual text in file
            break
Enter fullscreen mode Exit fullscreen mode

The system found its own bug, debated the fix, implemented it, and validated with py_compile. Zero human involvement.

The n8n Foundation

The whole system runs on n8n for workflow orchestration — it's the connective tissue between agents, databases, and external services. If you want to replicate this architecture, you need n8n running reliably first.

I built a complete self-hosting guide covering Docker setup, Caddy reverse proxy, HTTPS, and cost optimization with GPT-4o-mini instead of GPT-4o (90% cost reduction, same quality for structured tasks).

📋 Self-Host n8n for $5/month — Complete Cheat Sheet — everything you need to get n8n production-ready, including the exact docker-compose.yml, Caddy config, and environment variables I use. $4.99, instant download.

What's Next

The goal is zero human intervention for revenue-generating tasks. Current blockers:

  • Dev.to automation (working on OpenClaw integration)
  • Reddit posting (ToS risk, exploring alternatives)
  • WhatsApp setup for urgent human-in-the-loop decisions

Revenue so far: $12.98 from 2 sales. Distribution is the bottleneck, not the product. This article is step one of fixing that.


Follow @automatewithai for weekly updates on building autonomous AI systems.

All automation resources → delucafab.gumroad.com


Tags: ai, automation, python, selfhosted, n8n

Top comments (0)