Elena Revicheva

Posted on May 9 • Originally published at aideazz.hashnode.dev

Running Multi-Agent AI Systems on $0 Infrastructure: Production Reality

#ai #programming #machinelearning

Originally published on AIdeazz — cross-posted here with canonical link.

Most multi-agent AI system discussions focus on architecture diagrams and theoretical capabilities. Let me show you what actually happens when you run production agents on Oracle's Always Free tier, manage them with systemd and PM2, and route between Groq and Claude APIs while keeping infrastructure costs at zero.

The Zero-Dollar Infrastructure Stack

Oracle's Always Free tier gives you 4 ARM cores and 24GB RAM split across compute instances. That's enough for a multi-agent AI system if you understand the constraints.

My current setup runs five distinct agents:

WhatsApp customer service bot (Node.js, 120MB RAM)
Telegram automation assistant (Python, 180MB RAM)
Email classifier and router (Node.js, 90MB RAM)
Document processor with OCR pipeline (Python, 400MB RAM)
Orchestrator managing agent communication (Node.js, 150MB RAM)

Each agent runs as a systemd service with PM2 handling process management inside containers. The orchestrator coordinates through Redis (60MB) running on the same instance.

Here's the actual systemd unit file for the WhatsApp agent:

[Unit]
Description=WhatsApp Customer Agent
After=network.target redis.service

[Service]
Type=forking
User=agent-runner
Environment="NODE_ENV=production"
ExecStart=/usr/bin/pm2 start /opt/agents/whatsapp/ecosystem.config.js
ExecReload=/usr/bin/pm2 reload whatsapp-agent
ExecStop=/usr/bin/pm2 stop whatsapp-agent
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

PM2 handles memory limits, auto-restarts, and log rotation. When an agent hits its memory ceiling, PM2 restarts it before the OOM killer intervenes. This happens 2-3 times daily for the document processor during OCR peaks.

API Routing Economics and Failure Modes

The multi-agent AI system routes between Groq (Llama 3 70B) and Claude 3.5 Sonnet based on task complexity and cost. Groq's free tier covers most routine interactions. Claude handles complex reasoning when Groq's context window isn't sufficient.

Real routing logic from production:

def select_llm_provider(message_context):
    # Groq: 6,000 requests/day free tier
    if daily_groq_requests < 5800:  # 200 buffer
        if len(message_context) < 6000:  # Well within 8k window
            return "groq"

    # Claude: $3/million input tokens
    if requires_complex_reasoning(message_context):
        if monthly_claude_spend < budget_limit:
            return "claude"

    # Fallback: queue for later or return cached response
    return "queue"

Failure modes I've encountered:

Groq rate limits hit during business hours (happens 2-3 times per week)
Claude API timeouts on long contexts (1-2 times daily)
Both providers down simultaneously (twice in six months)

The system maintains a 24-hour response cache and queues non-urgent requests when providers are unavailable. Critical messages trigger SMS alerts to my phone.

Memory Management Under Hard Constraints

With 24GB total RAM and multiple agents, memory leaks kill production fast. Each agent operates under strict memory budgets enforced by PM2:

module.exports = {
  apps: [{
    name: 'whatsapp-agent',
    script: './src/index.js',
    max_memory_restart: '120M',
    instances: 1,
    exec_mode: 'fork',
    autorestart: true,
    watch: false,
    error_file: '/var/log/agents/whatsapp-error.log',
    out_file: '/var/log/agents/whatsapp-out.log',
    log_date_format: 'YYYY-MM-DD HH:mm:ss Z'
  }]
};

The document processor is the memory hog. OCR operations can spike to 800MB for large PDFs. I run it in a separate cgroup with hard limits:

# /etc/systemd/system/document-processor.service.d/override.conf
[Service]
MemoryMax=500M
MemoryHigh=400M

When memory pressure hits, the kernel OOM killer targets the document processor first, preserving customer-facing agents. The processor queues documents to S3 (Oracle Object Storage free tier: 10GB) and retries after restart.

Agent Communication Architecture

Agents communicate through Redis pub/sub channels. No fancy message queues — Redis on the same host eliminates network latency and stays within free tier limits.

Channel structure:

agent:whatsapp:incoming - Raw messages from WhatsApp
agent:telegram:incoming - Raw messages from Telegram
orchestrator:classify - Messages needing classification
orchestrator:route - Classified messages with routing
agent:{name}:process - Agent-specific processing queues

The orchestrator subscribes to all incoming channels, classifies intent, and publishes to appropriate processing queues. Agents acknowledge receipt within 5 seconds or messages return to queue.

Inter-agent message format:

{
  "id": "msg_1234567890",
  "source": "whatsapp",
  "timestamp": 1707345234567,
  "retry_count": 0,
  "max_retries": 3,
  "content": {
    "text": "I need to update my shipping address",
    "from": "+1234567890",
    "metadata": {}
  },
  "classification": {
    "intent": "address_update",
    "confidence": 0.94,
    "requires_auth": true
  }
}

Operational Reality: Monitoring and Debugging

PM2's built-in monitoring shows real-time memory and CPU per agent:

pm2 monit

But that's not enough for production. I pipe all agent logs to a single file and run a lightweight log analyzer:

# log_monitor.py - runs every 5 minutes via cron
import re
from collections import Counter

error_pattern = re.compile(r'ERROR|CRITICAL|Failed|Timeout')
stats = Counter()

with open('/var/log/agents/combined.log') as f:
    for line in f:
        if error_pattern.search(line):
            # Extract agent name and error type
            agent = line.split()[3]  
            error = error_pattern.search(line).group()
            stats[f"{agent}:{error}"] += 1

# Alert if any counter exceeds threshold
for key, count in stats.items():
    if count > 10:  # 10 errors per 5-minute window
        send_telegram_alert(f"High error rate: {key} = {count}")

Debugging distributed issues across agents requires correlation IDs. Every incoming message gets a UUID that follows it through the entire system. When customers report issues, I grep logs for their message ID across all agents.

Scaling Constraints and Workarounds

The free tier's 4 ARM cores hit CPU limits before memory becomes an issue. During peak hours (10am-2pm Panama time), CPU usage hovers around 85%.

Optimization strategies that actually work:

Move regex operations to compiled patterns (20% CPU reduction)
Cache LLM responses for common questions (30% reduction in API calls)
Batch similar requests to Groq (15% improvement in throughput)
Pre-classify messages with simple rules before hitting LLMs

What doesn't work:

Running multiple instances of the same agent (thrashing)
Complex caching strategies (Redis memory overhead)
Kubernetes on free tier (resource overhead kills you)

Production Incidents and Recovery

Real incidents from the past six months:

Incident 1: Document processor memory leak

Cause: PDF library didn't release memory after processing
Impact: Agent restarted every 30 minutes for a week
Fix: Moved PDF processing to child process that exits after each document

Incident 2: Redis maxed out memory

Cause: Forgot to set TTL on cached responses
Impact: All agents hung waiting for Redis
Fix: Added 24-hour TTL to all keys, reduced cache size

Incident 3: Groq API change broke parsing

Cause: Unannounced API response format change
Impact: 6 hours of failed message processing
Fix: Added response format validation and fallback parser

Recovery procedures are automated via systemd:

# /usr/local/bin/agent-recovery.sh
#!/bin/bash
systemctl stop agent-orchestrator
redis-cli FLUSHDB
systemctl restart redis
sleep 5
systemctl start agent-orchestrator
systemctl restart agent-whatsapp agent-telegram agent-email agent-document

Cost Reality Check

"Zero infrastructure" doesn't mean zero costs:

Claude API: $30-50/month for complex queries
Groq: Free tier sufficient for 95% of requests
Twilio (WhatsApp): $0.005 per message
Domain and SSL: $15/year
My time debugging at 2am: Priceless

Total monthly cost: $35-65 depending on Claude usage. The infrastructure genuinely costs $0, but API and messaging fees are unavoidable.

Future-Proofing Within Constraints

Oracle's free tier is generous but could change. I maintain Docker images for all agents and test monthly on a $5 DigitalOcean droplet. Full migration would take under 2 hours.

The multi-agent AI system architecture deliberately avoids lock-in:

Agents communicate via Redis (portable)
No Oracle-specific services except compute
All data exports to S3-compatible storage
Configuration in environment variables

The real constraint isn't technical — it's operational. Running production systems on free tier means you're the SRE, developer, and support team. Every optimization matters. Every byte of memory counts. Every CPU cycle has a purpose.

But it works. My agents handle 500+ customer interactions daily, process 50+ documents, and maintain 99.5% uptime. Not because the architecture is elegant, but because every component is tuned for the reality of free-tier constraints.

— Elena Revicheva · AIdeazz · Portfolio

DEV Community