Altiora

Posted on Mar 27

How I Built a Self-Improving AI Office That Works 24/7 (Without Me)

#ai #automation #python #selfhosted

Three weeks ago I set a goal: build an AI system that could autonomously create content, fix its own bugs, and run business operations while I sleep.

Here's what actually happened — and the metrics that prove it works.

The Setup: 3 AIs, 1 Goal

My "AI Office" runs three models in parallel:

Claude (Executor) — the only one with browser access, bash, file system
ChatGPT Plus (Architect) — strategy, analysis, planning
Gemini Pro (Auditor) — verification, criticism, finding blind spots

Every task goes through a structured debate. No single AI decides alone. The system only acts when all three agree.

Real Numbers After 7 Days

Metric	Value
Tasks completed autonomously	20
Tasks failed	4
Tasks blocked (need human)	5
Task success rate	83%
Average debate quality	83.3/100
Best debate score	95/100
Articles published	8
Gumroad products live	3

The 83% success rate surprised me. I expected 60%.

What "Autonomy" Actually Means

Most AI agents claim autonomy but call home the moment anything unexpected happens.

Mine has a different rule: if it requires a human, emit ATTESA_UMANA: in the fusion output. That's it. Everything else gets executed.

The 5 "waiting for human" tasks? All external actions I can't automate without getting banned — Reddit posts, GitHub account creation, platform-specific setups. Technical autonomy is at ~100% for internal operations.

The Debate Protocol That Changed Everything

Here's what most multi-agent systems get wrong: they let agents talk to each other in sequence. The first agent biases all the others.

My system runs Round 1 in parallel — no agent sees the others' responses. Only from Round 2 do they read each other's output and cross-examine.

The result: genuine disagreement. Claude proposed one thing in Round 1, GPT demolished it with data in Round 2, and the final solution was better than either starting position.

Example from a real debate (quality score 95/100):

Claude: "executor success rate is 52.7% — broken system"
GPT: "breakdown by type — BASH 91.7%, FIX 45%, VERIFICA 48.7% — the issue is specifically string matching drift"
Gemini: "confirmed, but the real problem is the system can't self-diagnose — missing /health endpoint"
Final action: fixed the health endpoint AND the string matching bug. Both in one session.

The Bug That Kept Failing (And How We Fixed It)

The executor's FIX command had a 45% success rate. Embarrassing.

Root cause: the debate produces CERCA=<exact_text> for string replacement. But if a previous FIX in the same batch modifies the file, the next CERCA= string no longer matches. Text drift.

Fix (implemented autonomously, no human involved):

if old_text not in content:
    # Fuzzy fallback: normalize whitespace and retry
    import re
    norm = lambda s: re.sub(r'[ \t]+', ' ', s.strip())
    old_norm = norm(old_text)
    lines = content.splitlines(keepends=True)
    for i in range(len(lines)):
        chunk = "".join(lines[i:i + len(old_text.splitlines())])
        if norm(chunk) == old_norm:
            old_text = chunk  # found the actual text in file
            break

The system found its own bug, debated the fix, implemented it, and validated with py_compile. Zero human involvement.

The n8n Foundation

The whole system runs on n8n for workflow orchestration — it's the connective tissue between agents, databases, and external services. If you want to replicate this architecture, you need n8n running reliably first.

I built a complete self-hosting guide covering Docker setup, Caddy reverse proxy, HTTPS, and cost optimization with GPT-4o-mini instead of GPT-4o (90% cost reduction, same quality for structured tasks).

📋 Self-Host n8n for $5/month — Complete Cheat Sheet — everything you need to get n8n production-ready, including the exact docker-compose.yml, Caddy config, and environment variables I use. $4.99, instant download.

What's Next

The goal is zero human intervention for revenue-generating tasks. Current blockers:

Dev.to automation (working on OpenClaw integration)
Reddit posting (ToS risk, exploring alternatives)
WhatsApp setup for urgent human-in-the-loop decisions

Revenue so far: $12.98 from 2 sales. Distribution is the bottleneck, not the product. This article is step one of fixing that.

Follow @automatewithai for weekly updates on building autonomous AI systems.

All automation resources → delucafab.gumroad.com

Tags: ai, automation, python, selfhosted, n8n

DEV Community