DEV Community

Merzouk Ayaden
Merzouk Ayaden

Posted on

I Ran 23 AI Agents 24/7 for 6 Months: Here's What Actually Broke (and How I Fixed It)

I've been building and running a production multi-agent AI system for the past 6 months. 23 specialized agents, all running 24/7 on a self-hosted VPS (n8n + Claude + DeepSeek + Gemini).

This is not a tutorial. This is the honest breakdown of what actually broke in production, and how I fixed each problem.

The Setup

  • Platform: Self-hosted n8n (Docker, VPS)
  • Agents: 23 specialized agents (trading, content, monitoring, outreach, research, operations)
  • Models: Claude Sonnet/Haiku, GPT-4o/mini, DeepSeek Chat, Gemini Flash
  • Infrastructure: Traefik reverse proxy, Redis, PostgreSQL, Slack for alerts

Problem 1: API Costs Exploded

What happened: All agents defaulted to GPT-4o. Month 2 costs hit $180/month.

Root cause: No model routing. Every query — whether it was "format this JSON" or "analyze this 5000-word document" — went to the most expensive model.

The fix: A query classification layer before every LLM call.

// Complexity classifier (n8n Code node)
const query = $input.first().json.query;
const wordCount = query.split(' ').length;
const hasCode = /```
{% endraw %}
|function|class |import /.test(query);
const isComplex = hasCode || wordCount > 150;
const isMedium = wordCount > 50 && !isComplex;

if (isComplex) return [{ json: { model: 'claude-sonnet-4-5' } }];
if (isMedium) return [{ json: { model: 'claude-haiku-3-5' } }];
return [{ json: { model: 'deepseek-chat' } }];
{% raw %}

Enter fullscreen mode Exit fullscreen mode

Result: 78% of queries classified as simple → DeepSeek at $0.001/query. New monthly cost: $22/month.

Problem 2: No Fallback = Total Outages

What happened: DeepSeek had a 2-hour outage. All agents using it stopped completely.

Root cause: Single provider, no fallback chain.

The fix: Primary → Secondary → Tertiary chain for every agent.


plaintext
DeepSeek (primary) → Gemini Flash (secondary) → Claude Haiku (tertiary)


Enter fullscreen mode Exit fullscreen mode

One afternoon of work. 3 outages prevented since implementation.

Problem 3: Agents Got Stuck in Infinite Loops

What happened: An agent failed on step 3 of 7, retried indefinitely, consumed 40k tokens in 20 minutes.

Root cause: No max-attempt counter, no exit condition for failed states.

The fix:

  • Max attempt counter per task (configurable per agent)
  • Dead letter queue: failed tasks after max attempts go to a review channel
  • Alert on Slack for any task hitting the dead letter queue

Problem 4: Memory Was Fragile

What happened: After a VPS restart, agents lost all context. They had to "re-learn" their current state from scratch.

Root cause: Relying on in-context memory only. No persistent state.

The fix: PostgreSQL for all agent state. Every agent checks its state table on startup.


sql
CREATE TABLE agent_state (
  agent_id TEXT PRIMARY KEY,
  current_task JSONB,
  last_checkpoint TIMESTAMP,
  attempt_count INT DEFAULT 0
);


Enter fullscreen mode Exit fullscreen mode

Agents now survive restarts cleanly.

Problem 5: Silent Failures Were Invisible

What happened: An agent stopped sending reports. Nobody noticed for 3 days.

Root cause: No monitoring. No alerts. No health checks.

The fix: A monitoring layer built into n8n:

  • Agent health check: cron every 5 minutes
  • Webhook availability check: every 10 minutes
  • Cost tracking: daily report with per-agent breakdown
  • Slack alert on any anomaly

First week after setup: Found 2 silent failures I had no idea existed.

What I'd Do Differently From Day 1

  1. Design for failure first — assume every API, agent, and connection will fail
  2. Add observability before adding agents — you can't fix what you can't see
  3. Use a model router from the start — the cost savings pay for the implementation time in week 1
  4. Define strict agent boundaries — scope creep in agent responsibilities creates unpredictable behavior
  5. Build the dead letter queue early — failed tasks need a destination, not an infinite retry loop

Current State

After fixing all 5 problems:

  • Cost: $22/month (down from $180)
  • Uptime: 99.3% over last 30 days
  • Agents: 58 running (scaled up from 23 after costs were under control)
  • Monitoring: Full visibility, Slack alerts on any anomaly

Building multi-agent systems in production is nothing like the tutorials. The hard part isn't connecting the nodes — it's making the system resilient when things go wrong (and they will).

What's been the hardest part of your production agent setup? Happy to go deeper on any of the solutions above.

Top comments (0)