Running Multi-Agent AI Systems on $0/Month Infrastructure

#ai #programming #machinelearning

Originally published on AIdeazz — cross-posted here with canonical link.

I run a multi-agent AI system handling real production workloads on Oracle Cloud's Always Free tier. Zero monthly infrastructure cost. This isn't theoretical — AIdeazz agents process thousands of messages daily across Telegram and WhatsApp, orchestrating between Groq, Claude, and local models. Here's the operational reality of extreme infrastructure constraints.

The Always Free Reality Check

Oracle gives you 4 ARM cores, 24GB RAM, and 200GB storage forever. That's it. No scaling. No bursting. When your agents hit capacity, they queue or drop requests.

My setup runs 6 concurrent agents on this single VM:

2 Telegram bots (customer support, lead qualification)
2 WhatsApp Business API agents
1 orchestrator managing model routing
1 analytics collector

Each agent runs as a systemd service with PM2 handling Node.js processes. Memory allocation is brutal: ~3GB per agent leaves minimal headroom. One memory leak takes down the entire system.

The constraints force architectural decisions most multi-agent systems avoid. No Kubernetes. No microservices. No distributed tracing. Just Unix processes and careful resource management.

Agent Architecture Under Constraints

Traditional multi-agent AI system architectures assume elastic compute. Mine assumes the opposite. Every design decision optimizes for fixed resources.

Process isolation via systemd:

# /etc/systemd/system/agent-telegram-support.service
[Unit]
Description=Telegram Support Agent
After=network.target

[Service]
Type=simple
User=agent
WorkingDirectory=/opt/agents/telegram-support
ExecStart=/usr/bin/node --max-old-space-size=2048 index.js
Restart=on-failure
RestartSec=10
MemoryLimit=3G
CPUQuota=50%

[Install]
WantedBy=multi-user.target

Each agent gets hard memory and CPU limits. When an agent approaches limits, systemd kills it. PM2 restarts it. This controlled failure is better than system-wide OOM.

Message queueing without infrastructure:
No Redis. No RabbitMQ. SQLite with write-ahead logging handles inter-agent communication:

// Shared message bus using SQLite
class MessageBus {
  constructor(dbPath) {
    this.db = new Database(dbPath);
    this.db.pragma('journal_mode = WAL');
    this.db.pragma('busy_timeout = 5000');
  }

  async publish(topic, message) {
    const stmt = this.db.prepare(
      'INSERT INTO messages (topic, payload, created_at) VALUES (?, ?, ?)'
    );
    stmt.run(topic, JSON.stringify(message), Date.now());
  }

  async consume(topic, handler) {
    // Polling-based consumption with row locking
    setInterval(async () => {
      const messages = this.db.prepare(
        'SELECT * FROM messages WHERE topic = ? AND processed = 0 LIMIT 10'
      ).all(topic);

      for (const msg of messages) {
        await handler(JSON.parse(msg.payload));
        this.db.prepare('UPDATE messages SET processed = 1 WHERE id = ?').run(msg.id);
      }
    }, 1000);
  }
}

This handles 10K messages/day without external dependencies. Not web-scale, but sufficient for SMB workloads.

Model Routing and Fallback Strategies

Running multiple AI models on zero budget means aggressive routing and caching. My orchestrator agent manages this complexity.

Cost-based routing logic:

class ModelRouter {
  async route(prompt, context) {
    // Check cache first
    const cached = await this.cache.get(this.hashPrompt(prompt));
    if (cached && !context.requiresFresh) return cached;

    // Groq for simple queries (free tier: 30 req/min)
    if (this.isSimpleQuery(prompt) && this.groqQuota.available()) {
      try {
        return await this.groqComplete(prompt);
      } catch (e) {
        // Groq fails often under load
      }
    }

    // Claude for complex queries (via API key)
    if (this.requiresReasoning(prompt)) {
      if (this.claudeCredits > 0) {
        return await this.claudeComplete(prompt);
      }
    }

    // Local Llama model as last resort
    return await this.localComplete(prompt);
  }
}

Groq's free tier is generous but unreliable. Rate limits hit randomly. Errors spike during peak hours. Claude API calls cost money, so they're reserved for high-value interactions. The local Llama 3.1 7B model runs on 2 CPU cores — slow but always available.

Cache hit rate determines viability. I maintain 85%+ through aggressive prompt normalization and semantic deduplication. Every cache miss costs either money (Claude) or latency (local model).

Operational Failure Modes

Zero-budget infrastructure fails in predictable ways. Here are the patterns I've learned to manage.

Memory pressure cascades:
Node.js garbage collection pauses spike when memory exceeds 80%. One agent's GC pause delays message processing. Delayed messages accumulate. Memory usage increases. More GC pauses. System spirals.

Solution: Proactive agent recycling. PM2 restarts each agent every 6 hours, staggered to maintain availability.

Groq API degradation:
Free tier gets deprioritized during load. Response times jump from 200ms to 10+ seconds. Timeout handlers are critical:

async groqCompleteWithTimeout(prompt, maxWait = 3000) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), maxWait);

  try {
    const response = await fetch(this.groqEndpoint, {
      signal: controller.signal,
      // ... request config
    });
    clearTimeout(timeout);
    return response;
  } catch (e) {
    if (e.name === 'AbortError') {
      this.metrics.increment('groq.timeout');
      throw new ModelTimeoutError();
    }
    throw e;
  }
}

SQLite lock contention:
Multiple agents writing to the same database creates lock timeouts. Write-ahead logging helps but isn't magic. I batch writes and use async queues:

class BatchedWriter {
  constructor(db, batchSize = 100, flushInterval = 1000) {
    this.queue = [];
    this.db = db;
    this.batchSize = batchSize;

    setInterval(() => this.flush(), flushInterval);
  }

  async write(data) {
    this.queue.push(data);
    if (this.queue.length >= this.batchSize) {
      await this.flush();
    }
  }

  async flush() {
    if (this.queue.length === 0) return;

    const batch = this.queue.splice(0, this.batchSize);
    const stmt = this.db.prepare(
      'INSERT INTO events (data) VALUES (?)'
    );

    const transaction = this.db.transaction((items) => {
      for (const item of items) {
        stmt.run(JSON.stringify(item));
      }
    });

    transaction(batch);
  }
}

Monitoring on Zero Budget

No Datadog. No New Relic. Monitoring happens through systemd journals and custom SQLite tables.

Metrics collection:

class MetricsCollector {
  constructor(dbPath) {
    this.db = new Database(dbPath);
    this.buffer = new Map();

    // Flush metrics every 10 seconds
    setInterval(() => this.flush(), 10000);
  }

  increment(metric, value = 1) {
    const current = this.buffer.get(metric) || 0;
    this.buffer.set(metric, current + value);
  }

  async flush() {
    const timestamp = Date.now();
    const stmt = this.db.prepare(
      'INSERT INTO metrics (metric, value, timestamp) VALUES (?, ?, ?)'
    );

    for (const [metric, value] of this.buffer.entries()) {
      stmt.run(metric, value, timestamp);
    }

    this.buffer.clear();
  }
}

Health checks via systemd:

#!/bin/bash
# /opt/agents/health-check.sh

# Check each agent endpoint
agents=("telegram-support:3001" "whatsapp-sales:3002" "orchestrator:3003")

for agent in "${agents[@]}"; do
  response=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:${agent#*:}/health")
  if [ "$response" != "200" ]; then
    systemctl restart "agent-${agent%:*}.service"
    echo "$(date): Restarted ${agent%:*}" >> /var/log/agent-restarts.log
  fi
done

Run this every minute via cron. Basic but effective.

Production Learnings

After 8 months running this multi-agent AI system in production:

What works:

PM2 cluster mode with 1 worker per agent provides isolation without containers
SQLite handles 50K events/day reliably with proper indexing
Semantic caching reduces AI API calls by 85%+
Groq free tier handles 70% of simple queries
Local Llama models provide reliable fallback

What doesn't:

Complex orchestration patterns (actor model, event sourcing) need real infrastructure
Debugging distributed flows across agents is painful without proper tracing
SQLite write locks become a bottleneck beyond 100 writes/second
CPU-based local inference is too slow for real-time requirements
No redundancy means 10-15 minutes downtime monthly for updates

Hard limits discovered:

6 concurrent agents maximum before context switching kills performance
3GB memory per Node.js process before GC pauses impact latency
1000 messages/minute aggregate throughput across all agents
30-second maximum processing time before Telegram/WhatsApp webhooks timeout

This architecture serves 200+ daily active users across messaging platforms. Response times average 1.2 seconds for cached queries, 8 seconds for complex Claude routing. Not Silicon Valley scale, but viable for bootstrapped AI products.

The constraint of free infrastructure forces focus. Every component must justify its resource usage. Every optimization matters. There's elegance in building multi-agent systems that run forever on hardware you never pay for.

— Elena Revicheva · AIdeazz · Portfolio