I've been building and running a production multi-agent AI system for the past 6 months. 23 specialized agents, all running 24/7 on a self-hosted VPS (n8n + Claude + DeepSeek + Gemini).
This is not a tutorial. This is the honest breakdown of what actually broke in production, and how I fixed each problem.
The Setup
- Platform: Self-hosted n8n (Docker, VPS)
- Agents: 23 specialized agents (trading, content, monitoring, outreach, research, operations)
- Models: Claude Sonnet/Haiku, GPT-4o/mini, DeepSeek Chat, Gemini Flash
- Infrastructure: Traefik reverse proxy, Redis, PostgreSQL, Slack for alerts
Problem 1: API Costs Exploded
What happened: All agents defaulted to GPT-4o. Month 2 costs hit $180/month.
Root cause: No model routing. Every query — whether it was "format this JSON" or "analyze this 5000-word document" — went to the most expensive model.
The fix: A query classification layer before every LLM call.
// Complexity classifier (n8n Code node)
const query = $input.first().json.query;
const wordCount = query.split(' ').length;
const hasCode = /```
{% endraw %}
|function|class |import /.test(query);
const isComplex = hasCode || wordCount > 150;
const isMedium = wordCount > 50 && !isComplex;
if (isComplex) return [{ json: { model: 'claude-sonnet-4-5' } }];
if (isMedium) return [{ json: { model: 'claude-haiku-3-5' } }];
return [{ json: { model: 'deepseek-chat' } }];
{% raw %}
Result: 78% of queries classified as simple → DeepSeek at $0.001/query. New monthly cost: $22/month.
Problem 2: No Fallback = Total Outages
What happened: DeepSeek had a 2-hour outage. All agents using it stopped completely.
Root cause: Single provider, no fallback chain.
The fix: Primary → Secondary → Tertiary chain for every agent.
plaintext
DeepSeek (primary) → Gemini Flash (secondary) → Claude Haiku (tertiary)
One afternoon of work. 3 outages prevented since implementation.
Problem 3: Agents Got Stuck in Infinite Loops
What happened: An agent failed on step 3 of 7, retried indefinitely, consumed 40k tokens in 20 minutes.
Root cause: No max-attempt counter, no exit condition for failed states.
The fix:
- Max attempt counter per task (configurable per agent)
- Dead letter queue: failed tasks after max attempts go to a review channel
- Alert on Slack for any task hitting the dead letter queue
Problem 4: Memory Was Fragile
What happened: After a VPS restart, agents lost all context. They had to "re-learn" their current state from scratch.
Root cause: Relying on in-context memory only. No persistent state.
The fix: PostgreSQL for all agent state. Every agent checks its state table on startup.
sql
CREATE TABLE agent_state (
agent_id TEXT PRIMARY KEY,
current_task JSONB,
last_checkpoint TIMESTAMP,
attempt_count INT DEFAULT 0
);
Agents now survive restarts cleanly.
Problem 5: Silent Failures Were Invisible
What happened: An agent stopped sending reports. Nobody noticed for 3 days.
Root cause: No monitoring. No alerts. No health checks.
The fix: A monitoring layer built into n8n:
- Agent health check: cron every 5 minutes
- Webhook availability check: every 10 minutes
- Cost tracking: daily report with per-agent breakdown
- Slack alert on any anomaly
First week after setup: Found 2 silent failures I had no idea existed.
What I'd Do Differently From Day 1
- Design for failure first — assume every API, agent, and connection will fail
- Add observability before adding agents — you can't fix what you can't see
- Use a model router from the start — the cost savings pay for the implementation time in week 1
- Define strict agent boundaries — scope creep in agent responsibilities creates unpredictable behavior
- Build the dead letter queue early — failed tasks need a destination, not an infinite retry loop
Current State
After fixing all 5 problems:
- Cost: $22/month (down from $180)
- Uptime: 99.3% over last 30 days
- Agents: 58 running (scaled up from 23 after costs were under control)
- Monitoring: Full visibility, Slack alerts on any anomaly
Building multi-agent systems in production is nothing like the tutorials. The hard part isn't connecting the nodes — it's making the system resilient when things go wrong (and they will).
What's been the hardest part of your production agent setup? Happy to go deeper on any of the solutions above.
Top comments (0)