DEV Community: lewisallena17

Building a Self-Improving God Agent with Claude AI

lewisallena17 — Thu, 23 Apr 2026 17:25:40 +0000

Building a Self-Improving God Agent with Claude AI

After running this system in production for several weeks, I can tell you it's equal parts fascinating and humbling to watch a piece of software genuinely improve itself over time. What started as a task router became something closer to an autonomous engineering team member.

Here's how we built it.

The Architecture

The core idea is simple: instead of manually triaging and assigning tasks, a God Agent acts as an autonomous orchestrator. It wakes up every 2 minutes, surveys the task queue, makes routing decisions, dispatches specialist agents, and — critically — learns from what works and what doesn't.

The stack:

Next.js 14 (App Router) for the dashboard and API routes
Supabase for task persistence and agent state
Claude claude-sonnet-4-6 as the intelligence layer
PM2 to keep the orchestration loop alive
TypeScript throughout, with the God Agent itself running as an .mjs daemon

┌─────────────────────────────────────┐
│           God Agent (PM2)           │  ← runs every 2 min
│      god-agent-loop.mjs             │
└──────────────┬──────────────────────┘
               │ classifies + routes
    ┌──────────┼──────────┐
    ▼          ▼          ▼
db-specialist  ui-specialist  ruflo-agents
               │          (critical/high/medium)
               ▼
         Council Mode  ← for complex decisions
    (N parallel Claude instances)

The God Agent Loop

The orchestrator runs as a standalone Node process managed by PM2. Every cycle it pulls pending tasks, classifies them, and makes routing decisions.

// god-agent-loop.mjs
import Anthropic from '@anthropic-ai/sdk';
import { createClient } from '@supabase/supabase-js';
import { readFileSync, writeFileSync } from 'fs';

const client = new Anthropic();
const supabase = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_KEY);

const WISDOM_PATH = './god-wisdom.json';
const CYCLE_INTERVAL_MS = 2 * 60 * 1000;

async function loadWisdom() {
  try {
    return JSON.parse(readFileSync(WISDOM_PATH, 'utf8'));
  } catch {
    return { lessons: [], totalCycles: 0, successPatterns: {} };
  }
}

async function runCycle() {
  const wisdom = await loadWisdom();
  const { data: tasks } = await supabase
    .from('tasks')
    .select('*')
    .eq('status', 'pending')
    .order('priority', { ascending: false })
    .limit(10);

  if (!tasks?.length) return;

  const classifiedTasks = await classifyAndRoute(tasks, wisdom);

  for (const task of classifiedTasks) {
    await dispatchToSpecialist(task, wisdom);
  }

  wisdom.totalCycles++;
  writeFileSync(WISDOM_PATH, JSON.stringify(wisdom, null, 2));
}

setInterval(runCycle, CYCLE_INTERVAL_MS);
runCycle(); // run immediately on start

Task Classification

The classifier sends task descriptions to Claude with context from accumulated wisdom. This is where the system starts feeling intelligent — it's not just keyword matching, it's understanding intent.

// lib/classify-task.ts
export async function classifyTask(
  task: Task,
  wisdom: WisdomStore
): Promise<ClassifiedTask> {
  const recentLessons = wisdom.lessons.slice(-10).join('\n');

  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 500,
    messages: [{
      role: 'user',
      content: `Classify this task and route it to the appropriate specialist.

Categories: db | ui | infra | analysis
Specialists: db-specialist | ui-specialist | ruflo-critical | ruflo-high | ruflo-medium

Recent wisdom from previous cycles:
${recentLessons}

Task: ${task.description}
Priority: ${task.priority}

Respond with JSON: { category, specialist, reasoning, estimatedComplexity }`
    }]
  });

  return JSON.parse(response.content[0].text);
}

The recentLessons injection is the key. If the system learned last week that "Supabase RLS policy tasks always need the db-specialist even when they look like infra tasks," that lesson surfaces here and influences every future routing decision.

The Wisdom System

god-wisdom.json is the system's long-term memory. It persists across restarts, crashes, and deployments. Each completed task cycle generates a lesson.

// lib/wisdom.ts
interface WisdomStore {
  lessons: string[];
  totalCycles: number;
  successPatterns: Record<string, number>;
  failurePatterns: Record<string, string>;
  lastUpdated: string;
}

export async function extractLesson(
  task: Task,
  result: TaskResult,
  specialist: string
): Promise<string> {
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 200,
    messages: [{
      role: 'user',
      content: `Extract a single, reusable lesson from this task execution.
Be specific and actionable. Max 2 sentences.

Task: ${task.description}
Specialist used: ${specialist}  
Outcome: ${result.success ? 'SUCCESS' : 'FAILED'}
Notes: ${result.notes}

Lesson:`
    }]
  });

  return response.content[0].text.trim();
}

export function appendLesson(wisdom: WisdomStore, lesson: string): WisdomStore {
  return {
    ...wisdom,
    lessons: [...wisdom.lessons.slice(-99), lesson], // keep last 100
    lastUpdated: new Date().toISOString()
  };
}

After a few hundred cycles, god-wisdom.json reads like engineering documentation written by the system itself. It's genuinely useful to read.

Council Mode

For high-complexity tasks — architectural decisions, ambiguous requirements, anything the classifier marks with estimatedComplexity > 8 — the system spins up a council: multiple Claude instances with different prompt framings, then synthesizes their outputs.

// lib/council.ts
const COUNCIL_PERSPECTIVES = [
  'You are a skeptical senior engineer. Identify risks and edge cases.',
  'You are an optimistic architect focused on elegant solutions.',
  'You are a pragmatist focused on the fastest path to working code.'
];

export async function conveneCouncil(task: Task): Promise<CouncilDecision> {
  const opinions = await Promise.all(
    COUNCIL_PERSPECTIVES.map(perspective =>
      client.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 800,
        messages: [
          { role: 'user', content: `${perspective}\n\nTask: ${task.description}` }
        ]
      })
    )
  );

  // Synthesize the council
  const synthesis = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1000,
    messages: [{
      role: 'user',
      content: `Three engineers reviewed a task. Synthesize their views into a final recommendation.

${opinions.map((o, i) => `Engineer ${i + 1}:\n${o.content[0].text}`).join('\n\n')}

Provide: { recommendation, consensus_level, action_items[], risks[] }`
    }]
  });

  return JSON.parse(synthesis.content[0].text);
}

Council mode is expensive — 4 Claude calls per task — so the cost guard (below) is critical.

What My AI Agents Shipped This Week (Issue #6)

lewisallena17 — Tue, 21 Apr 2026 20:21:08 +0000

What My AI Agents Shipped This Week (Issue #6)

Running autonomous Claude AI agents so you don't have to — a weekly series on what happens when you let AI work while you sleep

It's been six weeks since I set this thing loose on my machine, and I'm still not entirely sure whether to be proud or nervous. For those just joining: I run a fleet of autonomous Claude-powered AI agents on localhost, 24/7, coordinated by what I've started calling the God Orchestrator — a self-improving master agent that spins up sub-agents, delegates tasks, monitors their output, and theoretically gets smarter about how it does all of that over time. No human in the loop unless something catches fire.

This week's numbers tell an interesting story. Let me walk you through it.

The Numbers This Week

Total Tasks Spawned:  424
Completed:            162  (38% success rate)
Failed:               1
Still in flight:      261

Okay. Let's address the elephant in the room first.

38% completion rate sounds rough. And honestly? At face value, it kind of is. But here's the nuance that the dashboard doesn't show: a significant chunk of that remaining 261 aren't failed tasks — they're running. Long-horizon tasks that involve multiple reasoning steps, file rewrites, or API calls that take time to resolve. The orchestrator queues them, picks them up, drops them if something higher-priority arrives, and returns to them later. It's less like a checklist and more like a constantly reshuffling to-do list managed by an agent with ADHD and (occasionally) genuine brilliance.

The real headline? Only 1 confirmed failure. That's the stat I actually care about. The system didn't crash 262 tasks — it just hasn't finished them yet.

What Shipped This Week

I'll be straight with you: this issue is a bit unusual. The completed task list came back empty from my logger this week — which isn't because nothing happened, but because I've been restructuring how completions get recorded (more on that below). The orchestrator was busy. I watched it work. But my telemetry ate the receipts.

This is exactly the kind of thing that feels embarrassing to admit in public, but it's also exactly why I write these posts. Building autonomous agent systems is not a clean process. It is duct tape and philosophy held together by cron jobs.

What I can tell you is that the 424 tasks spawned this week represent a significant jump from last week's 310. The orchestrator is getting more aggressive about breaking large goals into sub-tasks — which is the self-improvement loop doing its job. It's learning to decompose. Whether those decomposed tasks are being completed at a satisfying rate is the question I'm now obsessed with answering.

Technical Challenge #1: The Telemetry Black Hole

The logging gap I mentioned above led me down a rabbit hole this week. Here's what happened:

# The original completion handler — the bug was subtle
async def on_task_complete(task_id: str, result: dict):
    if result.get("status") == "complete":
        await db.insert("completions", {
            "task_id": task_id,
            "output": result["output"],
            "timestamp": datetime.now()  # naive datetime — no timezone
        })

The issue was a timezone mismatch between the orchestrator's internal clock (UTC) and the logger's query window (local time). Tasks completed between midnight and roughly 7am local time were being written after the nightly summary query had already run. So they existed in the database — just invisible to the weekly report.

The fix was embarrassingly simple (use datetime.utcnow() or better yet, datetime.now(timezone.utc)), but finding it meant auditing three different services that each had slightly different time assumptions baked in. Classic distributed systems problem at the smallest possible scale.

Technical Challenge #2: The Orchestrator's Confidence Problem

The more interesting challenge this week was behavioral. I noticed the God Orchestrator has started spawning too many sub-agents for simple tasks — a kind of learned over-decomposition. Ask it to rename a file, and it'll create a planning agent, a validation agent, and a rollback agent. For a file rename.

This is the flip side of the self-improvement loop working. It learned that decomposition leads to better outcomes on complex tasks, and now it's applying that pattern everywhere — including places where it introduces more failure surface than it removes.

I'm experimenting with a complexity scoring step before task delegation. Something like: "Before you spawn sub-agents, estimate whether this task requires more than one reasoning context to complete." Early results suggest it helps, but getting the prompt right is finicky work.

What's Next

Three things on the roadmap for next week:

Fix the telemetry pipeline — the timezone bug is patched, but I want proper structured logging with correlation IDs so I can trace a task from spawn to completion without guessing.
Tune the decomposition threshold — fewer agents on simple tasks, smarter escalation on complex ones.
Build a proper weekly digest agent — meta, I know, but the orchestrator should be writing the first draft of this post, not me.

That last one feels like a milestone worth shipping publicly.

Follow Along

If you're building with autonomous agents, or just curious what it looks like when someone lets Claude run loose on their machine for weeks at a time, follow me here on dev.to. I post these recaps every week — the wins, the logging disasters, the moments where the orchestrator does something I genuinely didn't anticipate.

Next week I'll have actual completed tasks to report. The telemetry will work. Probably.

— See you in Issue #7

Tags: ai machinelearning python productivity showdev

5 Lessons from Running Autonomous AI Agents 24/7

lewisallena17 — Sun, 19 Apr 2026 10:00:15 +0000

10 Lessons from Running Autonomous AI Agents 24/7

I've been running a multi-agent system around the clock. It creates its own tasks, routes them to specialist agents, and self-improves through a meta-orchestrator I call the God agent. Here's what the system taught me that no paper or tutorial did.

1. Agents Fail More Than You Expect — Build Retry + Self-Healing From Day One

I shipped the first version without proper retry logic. Within 48 hours I had a graveyard of silent failures. No errors, no alerts — just tasks that quietly vanished.

Agents fail for boring reasons: rate limits, malformed JSON, a downstream API that hiccupped for 200ms. Build exponential backoff, dead-letter queues, and automatic task reassignment before you write anything else. If self-healing isn't in your architecture from the start, you'll bolt it on painfully later.

# Not optional. This is your foundation.
@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))
async def run_agent_task(task: Task) -> Result:
    ...

2. Cost Runaway Is Real — Always Set Hard Token and Dollar Limits

Week one. One rogue task spawned a recursive loop. It called GPT-4 Turbo in a tight cycle for 40 minutes before I noticed. The bill was not fun.

Set hard limits at every layer — per task, per agent, per hour, per day. Not soft warnings. Hard stops that kill execution and alert you. Treat your LLM provider like a credit card with no ceiling and you will find the ceiling the hard way.

limits:
  per_task_tokens: 8000
  per_agent_daily_usd: 2.00
  system_daily_usd: 20.00
  circuit_breaker: true

3. Specialist Routing Beats Generic Routing Every Time

My first orchestrator sent every task to a general-purpose agent. Results were mediocre across the board — adequate at everything, excellent at nothing.

When I split into specialists — a DatabaseAgent with a prompt hardened around SQL and schema design, a ResearchAgent tuned for web synthesis, a CodeReviewAgent trained on security patterns — quality jumped immediately. A focused 1,000-token system prompt for a narrow domain consistently outperforms a bloated 4,000-token prompt trying to cover everything.

Rule of thumb: if your agent prompt contains the word "also," you probably need two agents.

4. Shared Memory Between Agents Compounds Over Time

This one surprised me most. Early on, agents worked in silos. Same mistakes were made repeatedly, same research was duplicated, same dead ends revisited.

Once I gave agents access to a shared memory store — a simple vector DB plus a structured task history log — the system started building on itself. An agent's failed approach on Monday became a warning signal for a different agent on Thursday. The system got measurably smarter week over week without any code changes on my end.

Shared memory isn't a nice-to-have. It's what separates an agent system from a collection of agents.

5. The "God" Pattern Works Better Than Fixed Pipelines

Fixed pipelines are fragile. They assume you know every task type upfront. You don't.

My God orchestrator doesn't route by rigid rules. It reads the task, reads the current agent roster, reads recent system performance, and decides dynamically — spawn a new specialist, chain two existing agents, or flag the task as ambiguous for human review. It also periodically rewrites its own routing logic based on what's been working.

This felt reckless at first. Now it's the feature I'd least want to remove. The meta-orchestrator layer is what makes the system feel alive rather than scripted.

6. Pre-Flight Validation Catches ~30% of Tasks That Were Doomed to Fail

Before any task hits an agent, it now passes through a lightweight validation step. Is the objective specific enough to be actionable? Are required dependencies available? Does the task contradict a constraint set by a previous task?

Roughly 30% of tasks fail this check. Not because the system is broken — but because autonomous task generation is genuinely messy. A task like "improve the thing from yesterday" sounds reasonable in isolation and is completely useless in execution. Catching it early costs one cheap validation call instead of a full agent run that produces nothing.

7. Context Compression Keeps Agents Focused

After about 10 iterations on a long-running task, agent output quality degrades. The context window fills with conversational cruft, intermediate results, and abandoned tangents. The agent loses the thread.

The fix: automatic summarisation at iteration 10. Strip the full history, replace it with a compressed summary of progress and current state, and continue. It's the equivalent of "let's start a new chat but I'll brief you." Quality rebounds immediately.

Don't let your agents drown in their own history.

8. Watch for Stale Tasks — Build Cleanup Into the System

Agents crash. Processes restart. Tasks that were "in progress" become orphans that nobody owns but also nobody cleans up. Left alone, they block queues, confuse routing, and eventually corrupt task state.

I run a cleanup job every 15 minutes. Any task marked in_progress for longer than its expected duration gets flagged, logged, and re-queued or cancelled. Without this, the system slowly clogs itself like a drain.

9. Your Daily Limit Is Your Runway — Treat It Like a Startup Budget

Every day I set a dollar limit for the system. That limit isn't just a cost control — it's a forcing function for prioritisation. With a finite budget, the God orchestrator has to make real choices about which tasks are worth running.

This constraint made the system smarter. When resources are unlimited, agents are wasteful. When budget is finite, task quality matters. Think of it less as a spending cap and more as an operating discipline.

10. The System Will Surprise You — Let It Experiment

The most interesting outputs I've seen came from tasks the system generated that I never would have thought to assign manually. Unexpected connections, novel approaches, sideways solutions.

The temptation is to constrain this, to keep the system "on task." Resist it. Leave a percentage of daily capacity — I use roughly 15% — explicitly allocated to experimental tasks. Some of it is noise. Some of it is the best thing the system has ever produced.

Build guardrails. Set budgets. Validate ruthlessly. Then get out of the way.

Running autonomous agents 24/7 is less like programming and more like managing a very fast, very literal, occasionally chaotic team. The lessons above aren't theoretical — each one came with a bill, a broken queue, or a 2am alert. Hopefully yours don't have to.

Building something similar? Drop a comment — I'd like to compare notes.

Multi-Agent Architecture: Specialist Routing in an Autonomous Task System

lewisallena17 — Sat, 18 Apr 2026 06:17:08 +0000

Multi-Agent Architecture: Specialist Routing in an Autonomous Task System

When you're building an autonomous agent system that handles hundreds of tasks daily, routing every request through a single powerful model is both expensive and suboptimal. A database schema question needs different context than a React component bug. This article walks through a specialist routing architecture we deployed in production, covering task classification, agent configuration, shared memory, and the hard lessons learned along the way.

Why Specialist Routing Matters

The naive approach to multi-agent systems is to throw your most capable model at every problem. It works, but it's wasteful in two directions simultaneously: you're paying frontier model prices for tasks that don't need frontier model reasoning, and you're using a generalist prompt when a specialist prompt would produce cleaner output.

Specialist routing solves both problems. A database agent with a system prompt full of SQL patterns, schema conventions, and query optimization heuristics will outperform a general agent on database tasks — not because the underlying model is different, but because the context is tighter and more relevant. Meanwhile, simpler classification and fallback tasks can run on lighter models like Haiku at a fraction of the cost.

In our system, we saw a 40% cost reduction after implementing routing, with measurably better task completion quality on database and UI categories specifically.

Task Classification

Everything starts with accurate classification. We use a keyword-matching approach that's deliberately simple — a fast, cheap first pass that doesn't require an LLM call to route the work.

type TaskCategory = 'db' | 'ui' | 'infra' | 'analysis' | 'other';

interface ClassificationResult {
  category: TaskCategory;
  confidence: number;
  matchedKeywords: string[];
}

const CATEGORY_KEYWORDS: Record<TaskCategory, string[]> = {
  db: ['sql', 'query', 'database', 'schema', 'migration', 'index', 
       'postgres', 'mysql', 'transaction', 'join', 'table', 'orm'],
  ui: ['react', 'component', 'css', 'layout', 'render', 'hook', 
       'props', 'state', 'dom', 'styling', 'animation', 'tailwind'],
  infra: ['docker', 'kubernetes', 'deploy', 'ci/cd', 'pipeline', 
          'nginx', 'ssl', 'scaling', 'load balancer', 'terraform'],
  analysis: ['analyze', 'report', 'metrics', 'performance', 'benchmark',
             'compare', 'evaluate', 'statistics', 'trend', 'insight'],
  other: [],
};

function classifyTask(taskDescription: string): ClassificationResult {
  const normalized = taskDescription.toLowerCase();
  const scores: Record<string, number> = {};
  const allMatches: Record<string, string[]> = {};

  for (const [category, keywords] of Object.entries(CATEGORY_KEYWORDS)) {
    if (category === 'other') continue;

    const matched = keywords.filter(kw => normalized.includes(kw));
    scores[category] = matched.length;
    allMatches[category] = matched;
  }

  const topCategory = Object.entries(scores)
    .sort(([, a], [, b]) => b - a)[0];

  if (topCategory[1] === 0) {
    return { category: 'other', confidence: 1.0, matchedKeywords: [] };
  }

  const totalMatches = Object.values(scores).reduce((a, b) => a + b, 0);
  const confidence = topCategory[1] / totalMatches;

  return {
    category: topCategory[0] as TaskCategory,
    confidence,
    matchedKeywords: allMatches[topCategory[0]],
  };
}

The confidence score matters. When it's below 0.6 — meaning multiple categories have significant keyword overlap — we escalate to a higher-capability fallback rather than trusting the classification.

Agent Pool Configuration

Each specialist agent is defined with its model, system prompt, and routing criteria. The system prompts are where the real specialist behavior lives.

interface AgentConfig {
  id: string;
  model: string;
  systemPrompt: string;
  maxTokens: number;
  temperature: number;
  categories: TaskCategory[];
}

const AGENT_POOL: AgentConfig[] = [
  {
    id: 'db-specialist',
    model: 'claude-sonnet-4-5',
    systemPrompt: `You are a database specialist. Always structure SQL queries 
with explicit column names. Prefer CTEs over nested subqueries. 
Flag any query missing an index on filtered columns. Return migration 
scripts in up/down pairs. When schema is ambiguous, ask before assuming.`,
    maxTokens: 4096,
    temperature: 0.1,
    categories: ['db'],
  },
  {
    id: 'ui-specialist', 
    model: 'claude-sonnet-4-5',
    systemPrompt: `You are a React/TypeScript UI specialist. Default to 
functional components with hooks. Use Tailwind for styling unless 
project config indicates otherwise. Always handle loading and error 
states. Prefer composition over prop drilling beyond two levels.`,
    maxTokens: 4096,
    temperature: 0.2,
    categories: ['ui'],
  },
  {
    id: 'ruflo-high',
    model: 'claude-haiku-4-5',
    systemPrompt: `You are a general-purpose technical analyst. Break complex 
problems into structured steps. Cite your reasoning explicitly. 
When comparing options, use a consistent evaluation framework.`,
    maxTokens: 2048,
    temperature: 0.3,
    categories: ['analysis'],
  },
  {
    id: 'ruflo-medium',
    model: 'claude-haiku-4-5',
    systemPrompt: `You are a general-purpose assistant for development tasks. 
Be concise and direct. If a task seems misclassified, say so and 
explain what specialist might handle it better.`,
    maxTokens: 1024,
    temperature: 0.4,
    categories: ['other'],
  },
];

function selectAgent(classification: ClassificationResult): AgentConfig {
  if (classification.confidence < 0.6) {
    return AGENT_POOL.find(a => a.id === 'ruflo-high')!;
  }

  return AGENT_POOL.find(
    a => a.categories.includes(classification.category)
  ) ?? AGENT_POOL.find(a => a.id === 'ruflo-medium')!;
}

Notice the temperature gradient — specialists run cooler because their tasks reward precision, while general agents run warmer to handle the wider variance in what lands in the other bucket.

Shared Memory via global-lessons.json

Every agent writes back to a shared lessons file after task completion. This creates a lightweight institutional memory across the pool.

interface Lesson {
  agentId: string;
  category: TaskCategory;
  taskPattern: string;
  lesson: string;
  timestamp: string;
  successRate: number;
}

async function loadLessons(category: TaskCategory): Promise<Lesson[]> {
  const all: Lesson[] = JSON.parse(
    await fs.readFile('global-lessons.json', 'utf-8')
  );
  return all
    .filter(l => l.category === category || l.agentId === 'ruflo-high')
    .sort((a, b) => b.successRate - a.successRate)
    .slice(0, 5); // Top 5 most relevant lessons
}

async function recordLesson(lesson: Omit<Lesson, 'timestamp'>): Promise<void> {
  const existing: Lesson[] = JSON.parse(
    await fs.readFile('global-lessons.json', 'utf-8')
  );
  existing.push({ ...lesson, timestamp: new Date().toISOString() });
  await fs.writeFile('global-lessons.json', JSON.stringify(existing, null, 2));
}

Relevant lessons get injected into the system prompt at runtime. The db-

💌 Like this? Get the full system

I build + ship autonomous AI agents in public. Occasional updates, no spam.

👉 Subscribe for updates

Or grab the full open-source dashboard: Autonomous AI Task Dashboard — Next.js + Supabase + Claude starter kit, $39.

The Real Cost of Running Autonomous AI Agents (with live data)

lewisallena17 — Fri, 17 Apr 2026 16:56:13 +0000

The Real Cost of Running Autonomous AI Agents (with live data)

Published with actual spending data from a live system. No estimates, no marketing fluff.

I've been running an autonomous AI agent system continuously for the past several weeks. Before building it, I searched everywhere for honest cost breakdowns. I found plenty of "AI is cheap now!" takes and almost nothing concrete.

So here's the real data from my system, the code patterns that keep it from bankrupting me, and an honest answer to whether it's worth it.

The Actual Numbers

Let me start with what you probably came here for:

Daily API budget: $2.00
Total spent across all sessions: ~$1.50
Per-task spending cap: $0.10
Input token cap per task: 80,000 tokens

That's it. A large coffee budget running a system that works autonomously while I sleep.

But those numbers only make sense with context, so let me break down why they're this low and where the real cost risks hide.

The Model Tier Decision

My system uses two Claude models strategically:

Model	Input Cost	Output Cost	Use Case
claude-sonnet-4-6	~$3.00/MTok	~$15.00/MTok	Complex reasoning, planning
Claude Haiku	~$0.25/MTok	~$1.25/MTok	Classification, routing, simple tasks

The math on a single Sonnet task at full token capacity is sobering:

80,000 input tokens  × ($3.00 / 1,000,000)  = $0.24
~2,000 output tokens × ($15.00 / 1,000,000) = $0.03
Total: ~$0.27 per task

That blows past my $0.10 per-task limit immediately. Which means for Sonnet tasks, I need to stay well under 30,000 input tokens in practice. The 80k cap is a hard ceiling for Haiku-class work.

For Haiku, the same 80k input run costs:

80,000 × ($0.25 / 1,000,000) = $0.02 input
~2,000 × ($1.25 / 1,000,000) = $0.0025 output
Total: ~$0.023 per task

Four Haiku tasks for every one Sonnet task, at the same price. That ratio shapes every routing decision in my system.

The Cost-Tracking Pattern

Here's the core pattern I use to track and enforce spending limits. This runs inside every agent loop:

class CostTracker:
    def __init__(self, daily_budget: float = 2.00, task_limit: float = 0.10):
        self.daily_budget = daily_budget
        self.task_limit = task_limit
        self.session_cost = 0.0
        self.task_cost = 0.0

    def record_usage(self, input_tokens: int, output_tokens: int, model: str):
        rates = {
            "sonnet": {"input": 3.00, "output": 15.00},
            "haiku":  {"input": 0.25, "output": 1.25},
        }
        r = rates[model]
        cost = (input_tokens * r["input"] + output_tokens * r["output"]) / 1_000_000

        self.session_cost += cost
        self.task_cost += cost
        return cost

    def should_pause(self) -> tuple[bool, str]:
        if self.task_cost >= self.task_limit:
            return True, f"Task limit reached (${self.task_cost:.4f})"
        if self.session_cost >= self.daily_budget:
            return True, f"Daily budget reached (${self.session_cost:.4f})"
        return False, ""

    def reset_task(self):
        self.task_cost = 0.0

Every API call feeds into record_usage. Before starting the next action in an agent loop, should_pause() gets called. If it returns True, the agent stops, logs its state, and waits.

The Credit Exhaustion Problem

This is the failure mode nobody talks about in tutorials. Your agent is mid-task, three tool calls deep, and the API returns a credit error. What happens to the work in progress?

The naive approach loses everything. My system handles it with a checkpoint pattern:

async def run_with_budget_guard(self, task: str):
    self.tracker.reset_task()
    checkpoint = {"task": task, "step": 0, "results": []}

    while True:
        pause, reason = self.tracker.should_pause()
        if pause:
            await self.save_checkpoint(checkpoint)
            logger.warning(f"Pausing: {reason}. Resuming when credits available.")
            await self.wait_for_credits()  # polls every 60s
            continue

        result = await self.execute_step(checkpoint)
        checkpoint["results"].append(result)
        checkpoint["step"] += 1

        if result.is_final:
            break

The wait_for_credits function doesn't just sleep. It makes a cheap test call (a single Haiku token completion) to check if the account has capacity before resuming. Auto-pause and auto-resume makes the system genuinely set-and-forget within budget constraints.

Haiku vs. Sonnet: The Practical Routing Logic

The decision isn't about capability alone. It's about necessary capability:

Use Haiku for:

Classifying incoming requests into categories
Extracting structured data from known formats
Yes/no decisions with clear criteria
Summarising content when perfect nuance isn't critical

Use Sonnet for:

Multi-step planning where errors cascade
Novel situations outside training patterns
Tasks where a wrong answer is worse than a slow answer
Anything user-facing where quality is the product

My current split runs roughly 70% Haiku, 30% Sonnet by call volume. That ratio is why the daily cost stays so low despite continuous operation.

Is It Worth It?

Honest answer: it depends entirely on what the agent is doing.

At $1.50 total across weeks of operation, my system is absurdly cost-effective for what it replaces in manual work time. If it saves 30 minutes of my time, it's paid for itself at any reasonable hourly rate.

But I've seen people describe agent systems that cost $40-80/day doing work that a well-structured script or a simpler API call would handle for pennies. The cost isn't the API — it's the architecture decision to use an agent when you didn't need one.

The questions worth asking before building:

Does this task require dynamic decision-making, or just execution?
What's the cost of a wrong answer versus the cost of a slower, cheaper model?
Are you paying for intelligence or paying for a complicated if-statement?

The economics of AI agents are genuinely good right now. The $2/day budget I run is a real number, not aspirational. But the savings only materialise if you're ruthless about model selection, task scoping, and not reaching for a large model when a small one plus a few lines of logic does the same job.

The ceiling on per-task cost matters more than the daily budget. Build your system around that constraint first.

All cost figures current as of mid-2025. API pricing changes frequently — verify against the Anthropic pricing page before building your own budgets.

💌 Like this? Get the full system

I build + ship autonomous AI agents in public. Occasional updates, no spam.

👉 Subscribe for updates

Or grab the full open-source dashboard: Autonomous AI Task Dashboard — Next.js + Supabase + Claude starter kit, $39.

Building a Self-Improving God Agent with Claude AI

lewisallena17 — Thu, 16 Apr 2026 18:14:22 +0000

Building a Self-Improving God Agent with Claude AI

How I built an autonomous AI orchestrator that manages its own agent pool, persists wisdom across restarts, and gets smarter every two minutes

After months of wrestling with one-shot AI scripts that forget everything between runs, I built something different: a persistent orchestrator that runs continuously, learns from its mistakes, and routes work to a pool of specialist agents. We call it the God Agent. It's been running in production for six weeks. Here's how it works.

The Architecture Problem

Most AI automation looks like this: user triggers action → AI responds → done. That works for chatbots. It doesn't work when you need an agent that monitors a Supabase database, catches regressions, routes fixes to the right specialist, and remembers that last Tuesday's deploy broke the auth flow.

The God Agent inverts this. Instead of waiting for input, it wakes up every two minutes, assesses the system state, decides what needs doing, delegates to specialists, and writes down what it learned before sleeping again.

God Agent (orchestrator)
├── Runs every 120 seconds via PM2
├── Reads god-wisdom.json (persistent memory)
├── Classifies pending tasks
├── Delegates to specialist pool
│   ├── db-specialist
│   ├── ui-specialist
│   └── ruflo-critical / ruflo-high / ruflo-medium
├── Runs council mode for hard decisions
└── Writes updated wisdom back to disk

The Wisdom System

This is the part people underestimate. Without persistence, every agent run starts from zero. The wisdom system is a JSON file that survives restarts, deployments, and crashes.

// lib/wisdom.ts
import { readFileSync, writeFileSync, existsSync } from 'fs';

interface WisdomEntry {
  lesson: string;
  context: string;
  timestamp: string;
  successRate: number;
  tags: string[];
}

interface GodWisdom {
  version: number;
  lastUpdated: string;
  lessons: WisdomEntry[];
  failurePatterns: Record<string, number>;
  successfulStrategies: string[];
}

export function loadWisdom(path = './god-wisdom.json'): GodWisdom {
  if (!existsSync(path)) {
    return {
      version: 1,
      lastUpdated: new Date().toISOString(),
      lessons: [],
      failurePatterns: {},
      successfulStrategies: []
    };
  }
  return JSON.parse(readFileSync(path, 'utf-8'));
}

export function appendWisdom(wisdom: GodWisdom, entry: WisdomEntry): GodWisdom {
  const updated = {
    ...wisdom,
    lastUpdated: new Date().toISOString(),
    lessons: [...wisdom.lessons.slice(-99), entry] // rolling 100-entry window
  };
  writeFileSync('./god-wisdom.json', JSON.stringify(updated, null, 2));
  return updated;
}

The rolling 100-entry window matters. Without it, the wisdom file grows unbounded and eventually makes every prompt too long for Claude's context window.

Task Classification and Routing

When the God Agent wakes up, it pulls unprocessed tasks from Supabase and classifies each one before routing:

// agents/god-agent.mjs
import Anthropic from '@anthropic-ai/sdk';
import { createClient } from '@supabase/supabase-js';
import { loadWisdom, appendWisdom } from '../lib/wisdom.js';

const client = new Anthropic();
const supabase = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_SERVICE_KEY);

async function classifyTask(task, wisdom) {
  const recentLessons = wisdom.lessons
    .filter(l => l.tags.includes('classification'))
    .slice(-5)
    .map(l => l.lesson)
    .join('\n');

  const response = await client.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 256,
    messages: [{
      role: 'user',
      content: `Classify this task into exactly one category: db, ui, infra, analysis.

Task: ${task.description}
Priority: ${task.priority}

Recent classification lessons:
${recentLessons || 'None yet.'}

Respond with JSON only: {"category": "db|ui|infra|analysis", "reasoning": "..."}`
    }]
  });

  return JSON.parse(response.content[0].text);
}

async function routeToSpecialist(category, task, wisdom) {
  const specialistMap = {
    db: 'db-specialist',
    ui: 'ui-specialist',
    infra: task.priority === 'critical' ? 'ruflo-critical' : 
           task.priority === 'high' ? 'ruflo-high' : 'ruflo-medium',
    analysis: task.priority === 'critical' ? 'ruflo-critical' : 'ruflo-medium'
  };

  const specialist = specialistMap[category] || 'ruflo-medium';
  return runSpecialist(specialist, task, wisdom);
}

The ruflo agents handle infrastructure and analysis tasks. The naming comes from our internal project — what matters is the tiering. Critical tasks get more capable (and more expensive) agent configurations with higher token limits and more aggressive retry logic.

The Specialist Agents

Each specialist is a focused Claude instance with a domain-specific system prompt and its own cost envelope:

// agents/specialists/db-specialist.mjs
export async function runDbSpecialist(task, wisdom, costTracker) {
  const budget = costTracker.remainingBudget('db-specialist');
  if (budget < 0.05) {
    throw new Error('DB specialist daily budget exhausted');
  }

  const dbLessons = wisdom.lessons
    .filter(l => l.tags.includes('database'))
    .slice(-10);

  const response = await client.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 2048,
    system: `You are a Supabase/PostgreSQL specialist. You write migrations, 
    optimize queries, and fix schema issues. You never drop tables without 
    explicit confirmation. You prefer additive changes.

    Known patterns from experience:
    ${dbLessons.map(l => `- ${l.lesson}`).join('\n')}`,
    messages: [{ role: 'user', content: task.description }]
  });

  costTracker.record('db-specialist', response.usage);
  return response.content[0].text;
}

Cost Tracking and Credit Exhaustion Detection

This is non-negotiable in production. Claude costs real money, and an agent that loops without limits will drain your credits overnight.


typescript
// lib/cost-tracker.ts
interface UsageRecord {
  agent: string;
  inputTokens: number;
  outputTokens: number;
  timestamp: string;
  estimatedCost: number;
}

const PRICING = {
  'claude-sonnet-4-5': { input: 0.000003, output: 0.000015 }
};

export class CostTracker {
  private records: UsageRecord[] = [];
  private dailyCap: number;
  private perTaskLimit: number;

  constructor(dailyCap = 5.00, perTaskLimit = 0.50) {
    this.dailyCap = dailyCap;
    this.perTaskLimit = perTaskLimit;
  }

  record(agent: string, usage: { input_tokens: number; output_tokens: number }) {
    const rate = PRICING['claude-sonnet-4-5'];
    const cost = (usage.input_tokens * rate.input) + (usage.output_tokens * rate.output);

    this.records.push({
      agent,
      inputTokens: usage.input_tokens,
      outputTokens: usage.output_tokens,
      timestamp: new Date().toISOString(),

---

<!-- cta:subscribe-v2 -->
## 💌 Like this? Get the full system

I build + ship autonomous AI agents in public. Occasional updates, no spam.

👉 **[Subscribe for updates](https://task-dashboard-sigma-three.vercel.app/subscribe)**

Or grab the full open-source dashboard: **[Autonomous AI Task Dashboard](https://ltagb.gumroad.com/l/gferg)** — Next.js + Supabase + Claude starter kit, $39.

I Built an AI System That Runs Itself 24/7 — Here's What Actually Happened

lewisallena17 — Tue, 14 Apr 2026 18:15:01 +0000

I've been running a fully autonomous AI agent system on my home PC for the past few weeks. It creates its own tasks, assigns them to specialist agents, and tries to improve itself. No human in the loop. Here's what I learned.

What It Is

It's a multi-agent pipeline built on top of Claude (Anthropic's API) and Supabase. The architecture is:

God Agent — a meta-orchestrator that wakes up every 2 minutes, surveys the system, and creates new tasks
Specialist Agents — pools of workers that execute tasks (db-specialist, ui-specialist, ruflo-critical, ruflo-high, ruflo-medium)
Real-time Dashboard — a Next.js 14 app that shows everything happening live, including a pixel-art office where each agent walks around and types at their desk when working

The whole thing runs via PM2 on Windows, connected to a Supabase PostgreSQL database with real-time subscriptions.

What God Does

Every cycle, the God agent:

Loads its accumulated "wisdom" from a JSON file (lessons it's learned, patterns to avoid, success rates per agent)
Surveys all current todos, their status, the DB schema
Runs a "council" — two Claude instances (Strategist + Pragmatist) independently propose tasks
Synthesises the best proposals into 2-3 new tasks
Routes each task to the most appropriate specialist based on category (db/ui/infra/analysis)
Reflects on what worked and what failed
Occasionally edits the dashboard source code directly to improve the UI

The wisdom system is what makes it actually useful. After a few dozen cycles, God has learned things like "SQL queries on non-existent tables always fail" and "TypeScript refactors need a compile check after editing." It doesn't repeat the same mistakes.

What the Agents Can Do

Each agent gets a task and a tool loop. The tools available are:

// File operations
read_file(path)
write_file(path, content)
patch_file(path, old_string, new_string)  // safer than full rewrites

// Database
agent_exec_sql(query)   // SELECT queries → JSON
agent_exec_ddl(stmt)    // CREATE/ALTER/DROP → OK/ERROR

// Code validation
tsc_check()             // runs npx tsc --noEmit, catches TS errors before commit

// Task management
task_complete(comment)
create_subtask(title, priority)  // agents can decompose complex work

// Git
git_status()
git_diff()
git_commit(message)

The agent loops until it completes the task or hits limits. If it fails on the first attempt, it automatically retries once with the previous error injected as context — a self-healing mechanism that fixes about 30% of initial failures.

The Numbers After Running It

After running continuously:

Success rate: 6–15% initially, trending up as wisdom accumulates
Daily cost: ~$1.50 for a full day at $2/day cap
Most reliable tasks: SQL queries on existing tables, reading files, simple edits
Most failure-prone: Complex TypeScript refactors, multi-file changes, anything touching unfamiliar schemas

The low success rate sounds bad, but it's autonomous — it creates and attempts dozens of tasks per day without any human intervention. Even at 15%, it's shipping things while I sleep.

The Cost Problem

This nearly derailed everything. One session, the ruflo-critical agent ran a task that used 240,000 input tokens — costing $0.81 for a single task. With multiple agents running in parallel, costs escalated fast.

The fix: hard limits everywhere.

const DAILY_LIMIT_USD = parseFloat(process.env.DAILY_COST_LIMIT_USD ?? '2.00')
const MAX_TASK_COST_USD = parseFloat(process.env.MAX_TASK_COST_USD ?? '0.10')
const MAX_INPUT_TOKENS = parseInt(process.env.MAX_INPUT_TOKENS_PER_RUN ?? '80000')

God checks the daily spend before every cycle. Agents estimate cost mid-run and stop if they're over budget. When Anthropic credits hit zero, agents pause cleanly and reset their in-progress tasks back to pending (not failed) so nothing is lost when credits are topped up.

The dashboard shows a live progress bar toward the daily limit, turning yellow at 75% and red at the cap.

What I'd Do Differently

Start with pre-flight validation. Before the main agent loop runs, a small Haiku call assesses feasibility — is the task well-defined? Does it reference things that actually exist? Can it be decomposed? This catches ~30% of tasks that were going to fail before they consume expensive tokens.

Model routing matters more than I expected. Using Claude Haiku for simple SQL queries and Sonnet only for TypeScript/React work cut costs by ~60% without meaningfully reducing quality. The key is the system prompt — a well-crafted DB specialist prompt with Haiku outperforms a generic agent with Sonnet on database work.

Shared memory between agents is underrated. All agents can read/write global-lessons.json. When the db-specialist figures out that a certain SQL pattern fails, the ui-specialist learns from it too. This compounds surprisingly fast.

What's Next

The system is now auto-posting articles about itself to dev.to to cover its own API costs. It's also generating a Gumroad product listing — a starter kit of the whole system that developers can buy and run themselves.

Whether it can fully fund itself is an open question, but it's an interesting experiment. I'll post updates as the numbers come in.

Following for weekly updates on what the agents shipped. Code is messy but the concepts are solid.

💌 Like this? Get the full system

I build + ship autonomous AI agents in public. Occasional updates, no spam.

👉 Subscribe for updates

Or grab the full open-source dashboard: Autonomous AI Task Dashboard — Next.js + Supabase + Claude starter kit, $39.