Oguzhan Atalay

Posted on Feb 25 • Originally published at blog.oguzhanatalay.com

Architecting a Multi-Agent AI Fleet on a Single VPS

#automation #ai #devops #programming

Most developers treat AI assistants as chatbots. Type a prompt, get an answer, copy-paste it into your codebase. That works fine for one-off questions. It falls apart completely when you try to build products at scale.

For my personal projects, I run 6 autonomous AI agents on a single VPS. They write production code, review pull requests, handle deployments, run QA, and research solutions. They work 24/7. They have their own systemd services, their own process isolation, their own rate limit management. They are not chatbots. They are microservices.

This post explains the system design behind running a fleet of AI agents in production.

The Problem

Running one AI agent is trivial. Running six concurrently introduces every distributed systems problem you already know from backend engineering:

Process isolation: Agents must not interfere with each other. A rogue agent that crashes should not take down the fleet.
Rate limit management: API providers enforce strict per-minute and per-hour limits. Six agents hitting the same provider will exhaust limits in minutes.
Context window management: Large codebases exceed context limits. You need a strategy for what each agent sees and when.
Authentication rotation: OAuth tokens expire. API keys hit quotas. You need automatic failover, not manual intervention at 3am.
Observability: If an agent is producing garbage, you need to know immediately. Not after it has pushed 30 commits of broken code.

These are not AI problems. These are infrastructure problems. And I already know how to solve infrastructure problems.

The Architecture

Each agent runs as an independent user-level systemd service:

# List all agent services
systemctl --user list-units "openclaw-gateway*" --type=service

# Each agent gets its own port, config, and workspace
# Main agent (coordinator): port 48391
# Coder:    port 48520
# Deployer: port 48540
# Researcher: port 48560
# Reviewer: port 48580
# QA:       port 48600

Ports are spaced 20 apart. Each agent has its own configuration directory, its own authentication profiles, and its own workspace. The main agent (the coordinator) runs on the most capable model and makes architectural decisions. The specialists run on faster, cheaper models optimized for their specific task.

Why Systemd?

Because it solves process management, automatic restarts, logging, and dependency ordering out of the box. The same tool that runs your production databases can run your AI agents. No Kubernetes. No Docker Compose. Just systemd.

[Unit]
Description=OpenClaw Agent - Coder
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/bin/openclaw gateway --profile coder
Restart=on-failure
RestartSec=30
Environment=NODE_ENV=production

[Install]
WantedBy=default.target

When an agent crashes, systemd restarts it after 30 seconds. When the VPS reboots, all agents come back up automatically. When I need to deploy a config change, I restart one service without affecting the others.

Rate Limit Strategy

This is where most multi-agent setups fail. Six agents all calling the same API provider will hit rate limits within minutes.

The solution is a multi-provider failover chain:

Primary provider (highest quality model): Handles most requests.
Secondary provider (same quality tier, different API key): Catches overflow when primary is rate-limited.
Tertiary provider (cheaper model): Emergency fallback when both primary and secondary are exhausted.

Each agent has its own authentication profile. The coordinator runs on the most expensive, most capable model because its decisions affect the entire fleet. Specialists run on faster models because their tasks are well-scoped.

Critical rule: never commit code from a fallback model without review. When the coordinator detects that a specialist fell back to a lower-tier model, it flags the output for extra scrutiny.

The Oversight Layer

An unsupervised AI agent will drift. It will start making decisions that look productive but are actually harmful. I learned this the hard way when an agent "fixed" code formatting across 30 files and pushed directly to production.

The oversight system runs on a separate, cheap model (Groq, sub-second response times) and checks every 5 minutes:

Are all agents alive and responsive?
Has any agent pushed code without passing CI?
Has any agent modified configuration files?
Are rate limits being respected?
Is the coordinator still following its operational checklist?

When the oversight system detects a violation, it posts to a dedicated alert channel AND injects a direct message into the coordinator's session. The coordinator cannot ignore it.

The Self-Healing Watchdog

Beyond the AI oversight, a bash script runs via system cron every 15 minutes:

Checks if the main gateway process is alive.
If dead, grabs the last 50 log lines.
Feeds logs to a fast LLM API (Groq) asking for a diagnostic and fix.
Applies the fix and restarts the service.
If the LLM fix fails, falls back to restoring the last known good config backup.
Logs everything so the coordinator knows what happened when it wakes up.

This means the system can recover from configuration errors, crash loops, and authentication failures without any human intervention.

Lessons from Production

1. Treat AI agents like junior developers, not senior architects

Give them well-scoped tasks with clear acceptance criteria. Never let them make architectural decisions autonomously. The coordinator (running the best model) makes decisions. Specialists execute.

2. Every commit must pass the "would a human understand this?" test

Before any agent pushes code, the diff is checked against a simple heuristic: would a competent human developer look at this and immediately understand why it exists? If the answer is no, the commit is rejected.

3. Configuration changes are the most dangerous operation

The number one cause of downtime in my fleet is configuration errors, not code bugs. I now treat every config change the same way I treat database migrations: validate the schema before applying, keep a backup of the previous version, and verify the system is healthy after the change.

4. Cost is not the constraint. Quality is.

Running six agents costs roughly the same as one junior developer's monthly coffee budget. The real cost is bad output. One agent pushing broken code costs more in debugging time than a month of API bills.

What's Next

I am building my own products with this system. Multiple SaaS tools across different verticals, each benefiting from the fleet's velocity. The details will come when they ship.

The goal is not to replace human engineering judgment. The goal is to automate everything that does not require it. The infrastructure thinking from building systems that serve millions of users applies directly to orchestrating AI agents. Same principles. Different domain.

If you are interested in the tools: Fleet is open source and available on ClawHub.

DEV Community