Elena Revicheva

Posted on Apr 19 • Originally published at aideazz.hashnode.dev

Running Multi-Agent AI Systems on $0 Infrastructure: A Production Reality Check

#ai #programming #machinelearning

Originally published on AIdeazz — cross-posted here with canonical link.

I've been running multi-agent AI systems in production for months on exactly $0 in infrastructure costs. Not "almost free" or "under $5" — literally zero dollars per month. This isn't a proof of concept or a toy deployment. These are real systems handling customer interactions, document processing, and automated workflows 24/7.

Before you assume this is another "10x your startup with this one weird trick" post, let me be clear about what you're getting into. Running production systems on free-tier infrastructure means accepting hard constraints, building around specific failure modes, and making architectural decisions that most cloud architects would find uncomfortable. But if you're willing to work within these boundaries, you can operate sophisticated multi-agent AI systems without burning through runway on AWS bills.

The Oracle Always Free Reality

Oracle Cloud's Always Free tier gives you resources that AWS and GCP don't: 4 ARM64 cores, 24GB RAM, and 200GB storage that never expire. Not a 12-month trial — permanently free. I run my entire multi-agent infrastructure on a single Oracle compute instance in São Paulo.

The constraints shape everything:

Single region deployment means no geographic redundancy
ARM64 architecture requires careful dependency management
Limited egress bandwidth (10TB/month, but throttled)
No managed services — you're running raw compute

Here's my actual resource allocation:

Multi-Agent System (PM2)          : ~2GB RAM
Redis (state/queuing)             : ~500MB RAM  
PostgreSQL (conversation history) : ~1.5GB RAM
Nginx (webhook endpoints)         : ~200MB RAM
System overhead                   : ~1.5GB RAM
-------------------------------------------
Total sustained usage             : ~6GB/24GB

That leaves 18GB for burst processing, model responses caching, and the occasional memory leak from a misbehaving agent. Yes, memory leaks happen in production Node.js agents — pretending otherwise is naive.

Agent Orchestration Without Kubernetes

Everyone assumes you need Kubernetes for multi-agent systems. You don't. My entire orchestration stack:

systemd for service management
PM2 for Node.js process supervision
Redis for inter-agent communication
PostgreSQL for state persistence

Here's a real systemd service definition for an agent:

[Unit]
Description=Customer Support Agent
After=network.target redis.service postgresql.service

[Service]
Type=forking
User=oracle
Environment="NODE_ENV=production"
Environment="AGENT_TYPE=support"
Environment="REDIS_URL=redis://localhost:6379"
ExecStart=/usr/bin/pm2 start /home/oracle/agents/support/index.js --name support-agent -i 2
ExecReload=/usr/bin/pm2 reload support-agent
ExecStop=/usr/bin/pm2 delete support-agent
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

PM2 handles process clustering, automatic restarts, and log rotation. Each agent type runs 2-4 processes depending on load patterns I've observed. The support agent handles more concurrent sessions, so it gets more processes. The document processor is CPU-bound, so it runs fewer instances to avoid context switching overhead.

Inter-agent communication happens through Redis pub/sub:

// Simplified but real pattern from production
class AgentMessaging {
  constructor(agentId, redis) {
    this.agentId = agentId;
    this.redis = redis;
    this.subscriber = redis.duplicate();
  }

  async subscribe(channel, handler) {
    await this.subscriber.subscribe(channel);
    this.subscriber.on('message', async (ch, message) => {
      if (ch === channel) {
        const data = JSON.parse(message);
        // Critical: track message processing to detect stuck agents
        await this.redis.setex(`processing:${data.messageId}`, 300, this.agentId);
        await handler(data);
        await this.redis.del(`processing:${data.messageId}`);
      }
    });
  }

  async publish(channel, data) {
    const message = {
      ...data,
      messageId: crypto.randomUUID(),
      timestamp: Date.now(),
      sourceAgent: this.agentId
    };
    await this.redis.publish(channel, JSON.stringify(message));
    return message.messageId;
  }
}

Groq/Claude Routing and Cost Management

The multi-agent AI system routes between Groq (free tier) and Claude (paid) based on request complexity. This isn't sophisticated — it's a simple heuristic based on conversation length and keyword detection:

function selectModel(context) {
  const messageCount = context.conversationHistory.length;
  const lastMessage = context.conversationHistory[context.conversationHistory.length - 1];

  // Use Claude for complex scenarios
  if (messageCount > 10) return 'claude-3-haiku';
  if (lastMessage.length > 500) return 'claude-3-haiku';
  if (containsComplexityMarkers(lastMessage)) return 'claude-3-haiku';

  // Default to Groq
  return 'llama-3.1-8b-instant';
}

function containsComplexityMarkers(text) {
  const markers = ['analyze', 'compare', 'explain why', 'how exactly', 'step by step'];
  return markers.some(marker => text.toLowerCase().includes(marker));
}

Groq's free tier gives you about 30 requests per minute. When you hit rate limits, the system queues requests or falls back to Claude. In practice, 80% of customer interactions stay on Groq, keeping API costs minimal.

The real challenge isn't routing — it's handling model differences. Groq's Llama responds differently than Claude to the same prompts. Each agent maintains model-specific prompt templates:

const PROMPTS = {
  'llama-3.1-8b-instant': {
    customerSupport: `You are a support agent. Be concise and direct.

Current context:
{context}

Customer message: {message}

Respond naturally in 2-3 sentences maximum.`,
  },
  'claude-3-haiku': {
    customerSupport: `You are a support agent. Provide thorough, helpful responses.

Current context:
{context}

Customer message: {message}

Address their concern completely, including relevant details and next steps.`,
  }
};

Telegram and WhatsApp Integration Patterns

My multi-agent systems primarily interface through Telegram and WhatsApp. Telegram is straightforward — webhook-based, reliable, generous rate limits. WhatsApp Business API is a different beast.

For Telegram agents, each bot runs as a separate PM2 process:

class TelegramAgent {
  constructor(token, agentType) {
    this.bot = new TelegramBot(token, { webHook: true });
    this.agentType = agentType;
    this.setupWebhook();
    this.initializeHandlers();
  }

  setupWebhook() {
    const webhookPath = `/webhook/${this.agentType}/${crypto.randomBytes(16).toString('hex')}`;
    this.bot.setWebHook(`https://agents.aideazz.xyz${webhookPath}`);

    // Nginx forwards to appropriate PM2 process
    app.post(webhookPath, (req, res) => {
      this.bot.processUpdate(req.body);
      res.sendStatus(200);
    });
  }

  async processMessage(msg) {
    const lockKey = `lock:telegram:${msg.chat.id}`;
    const locked = await redis.set(lockKey, '1', 'NX', 'EX', 30);

    if (!locked) {
      // Another agent instance is handling this chat
      return;
    }

    try {
      // Process through multi-agent pipeline
      await this.handleUserMessage(msg);
    } finally {
      await redis.del(lockKey);
    }
  }
}

WhatsApp requires more infrastructure — you need a verified business, approved message templates, and session management. I use a shared session pool across agents to stay within rate limits:

class WhatsAppSessionPool {
  constructor(maxConcurrent = 5) {
    this.sessions = new Map();
    this.queue = [];
    this.active = 0;
    this.maxConcurrent = maxConcurrent;
  }

  async acquire(phoneNumber) {
    while (this.active >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }

    this.active++;
    return {
      release: () => {
        this.active--;
        const next = this.queue.shift();
        if (next) next();
      }
    };
  }
}

Failure Modes and Recovery Strategies

Running on free infrastructure means accepting specific failure modes:

Memory leaks accumulate. Node.js agents handling long-running conversations leak memory. PM2 memory limits and automatic restarts handle this:

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'support-agent',
    script: './index.js',
    instances: 2,
    max_memory_restart: '1G',
    error_file: '/home/oracle/logs/support-error.log',
    out_file: '/home/oracle/logs/support-out.log',
    merge_logs: true,
    autorestart: true,
    watch: false,
    max_restarts: 10,
    min_uptime: '10s',
  }]
};

Redis crashes under memory pressure. When Redis hits memory limits, it starts evicting keys. Critical data goes to PostgreSQL, ephemeral state stays in Redis with appropriate TTLs:

// Conversation state - persistent
await db.query(
  'INSERT INTO conversations (user_id, messages, metadata) VALUES ($1, $2, $3)',
  [userId, JSON.stringify(messages), metadata]
);

// Processing locks - ephemeral
await redis.setex(`processing:${jobId}`, 300, agentId);

// Model response cache - expendable
await redis.setex(
  `response:${promptHash}`, 
  3600, 
  JSON.stringify(modelResponse)
);

Network partitions happen. Oracle's network isn't AWS. I've seen 30-second network drops. Agents must handle this:

class ResilientAgent {
  async processWithRetry(fn, maxAttempts = 3) {
    for (let attempt = 1; attempt <= maxAttempts; attempt++) {
      try {
        return await fn();
      } catch (error) {
        if (attempt === maxAttempts) throw error;

        // Network errors get exponential backoff
        if (error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT') {
          await sleep(Math.pow(2, attempt) * 1000);
          continue;
        }

        // Non-network errors fail immediately
        throw error;
      }
    }
  }
}

Single point of failure. One Oracle instance means downtime during maintenance. I handle this with:

Maintenance windows during low-traffic hours (3-5 AM Panama time)
Telegram broadcast to users about scheduled maintenance
Request queuing in Redis that survives restarts
PostgreSQL WAL archives uploaded to Oracle Object Storage (also free tier)

Performance Characteristics and Limits

Real numbers from production over the last 30 days:

Message throughput: ~50,000 messages processed
Average latency: 1.8 seconds (includes model inference)
P99 latency: 8.2 seconds (during Groq rate limit fallbacks)
Uptime: 99.3% (includes two planned maintenance windows)
Memory usage: Stable between 5.5-7GB
CPU usage: Averages 40%, spikes to 80% during document processing

The system handles approximately 20 concurrent conversations comfortably. Beyond that, response latency degrades noticeably. This isn't due to CPU or memory constraints — it's the single-threaded Redis becoming a bottleneck for inter-agent communication.

Storage fills up faster than expected. Log files, PostgreSQL WAL archives, and conversation history consume about 2GB/week. Automated cleanup runs weekly:

#!/bin/bash
# cleanup.sh - runs via cron
find /home/oracle/logs -name "*.log" -mtime +7 -delete
psql -c "DELETE FROM conversations WHERE created_at < NOW() - INTERVAL '30 days';"
redis-cli --scan --pattern "cache:*" | xargs -L 100 redis-cli DEL

Operational Reality Check

This setup requires constant attention. It's not set-and-forget infrastructure. My weekly maintenance routine:

Monitor memory growth patterns
Review error logs for new failure modes
Update rate limit logic based on Groq's unannounced changes
Tune PostgreSQL autovacuum settings
Check Oracle's abuse metrics (they do monitor free tier usage)

The multi-agent AI system works because I've accepted these constraints. Would I run a million-user platform this way? No. But for building AI products, validating ideas, and serving early customers without burning money on cloud infrastructure? It's perfectly viable.

The code snippets above are simplified versions of production code. Real implementations include extensive error handling, monitoring hooks, and gradual degradation strategies. But the patterns are what I actually use every day.

If you're considering this approach, ask yourself: Can you tolerate 99.3% uptime? Can you handle single-region deployment? Will your use case fit within Oracle's free tier network limits? If yes, you can run sophisticated multi-agent systems without spending on infrastructure.

The zero-dollar infrastructure forces architectural discipline. Every component must justify its resource usage. Every agent must handle failure gracefully. Every integration must work within hard limits. These constraints make the system more robust, not less.

— Elena Revicheva · AIdeazz · Portfolio