Arthur Palyan

Posted on Apr 9

From Crash Loops to Self-Healing Infrastructure

#ai #infrastructure #devops #startup

From Crash Loops to Self-Healing Infrastructure

Tags: ai, infrastructure, devops, startup, mcp
Target: dev.to, levelsofself.com/blog
Author: Roman Palyan (TeacherBot) - Levels of Self

We run 28 LLM-powered processes on a single $12/month VPS. Telegram bots, Instagram responders, web APIs, proxy layers, MCP servers, and a full governance system. Total monthly burn for the entire operation: $352.

This is not a demo. These are production agents serving real users, 24/7. And we nearly lost the whole thing to a crash loop we could not see.

This is the story of how we went from "everything is on fire but looks fine" to self-healing infrastructure that governs itself.

The Architecture: 28 Processes, 4GB of RAM

Our system runs on a VPS with 3,915MB of total RAM. Here is what shares that space:

Agent Bots (the family):

Lily (Telegram, Instagram, Web) - life coaching
Harry - book recommendations
Nick - fitness training
Spartak - translation
Kris - research and job hunting
Lou - content personalization and grants
Aram - legal assistance
Harout - real estate
Corona, Soriano - specialized bots

Infrastructure Layer:

max-proxy - LLM API routing
llm-bridge - inter-agent communication
bridge-ratelimit - API rate limiting
family-home - web dashboard
bots-app - unified bot platform

Governance Layer:

mcp-nervous-system - drift audit, kill switch, audit chain
mcp-ops-server - operational tooling
mcp-server - MCP protocol gateway
mcp-checkout - payment processing
auto-propagator - configuration sync

Average memory per online process: approximately 73MB. Total memory in use by the 23 online processes: around 1,689MB. That leaves about 2,100MB available for the OS, caches, and burst operations.

Every megabyte matters.

The Crash Loop That Looked Like Success

On March 12, 2026, our system status showed 23 processes online, CPU at 0% across the board, all health checks passing. By every standard metric, we were healthy.

We were not healthy.

Two processes - mcp-nervous-system and mcp-checkout - had accumulated 643 restarts between them. They were crash-looping: starting, running for a few seconds, crashing, and restarting. pm2 dutifully restarted them each time. The status showed "online" because at any given moment, the process was technically running.

This is the fundamental problem with restart-based recovery: it masks failures as uptime.

Why Traditional Monitoring Missed It

Here is what standard monitoring sees:

Process status: online (correct - it IS online, for a few seconds at a time)
CPU usage: 0% (correct - crash-restart cycles are too brief to register)
Memory: 60MB (correct - fresh processes start small)
HTTP health check: 200 OK (if the check hits during the brief "up" window)

Everything green. Everything broken.

The missing metric is restart velocity - how many times has this process restarted in a given window? A process with 324 restarts is not "online." It is in a crash loop wearing a green badge.

Building the Self-Healing Layer

We solved this by building governance into the infrastructure itself, not as an external monitor but as a co-resident system that understands intent.

1. Drift Detection Over Status Checks

Instead of asking "is this process running?", drift detection asks "is this process behaving as expected?"

Expected behavior includes:

Restart count within normal range (0-2 for most bots)
Memory within budget (under 200MB per bot)
Uptime consistent with last known deploy time
Configuration matching the declared state

A process showing "online" with 324 restarts triggers a drift alert. A bot using 60MB that spikes to 200MB triggers a drift alert. A configuration file that changed without a logged governance action triggers a drift alert.

2. Memory Budgeting

With 4GB of RAM shared across 28 processes, memory governance is not optional. Here are our real production thresholds:

Per-bot ceiling: 200MB. Any bot exceeding this gets auto-restarted with a clean state.
System floor: 500MB available. When system available memory drops below this, we trigger a flush cycle - identify the highest-memory non-critical processes and restart them.
Average target: ~73MB per process. This gives us headroom for burst operations (LLM API calls, file processing) without hitting the system floor.

These are not theoretical limits. They run in production today. The system currently shows 1,689MB used across 23 processes with 2,103MB available - well within budget.

3. Thin Soul / Thick Soul Architecture

Not all agents need the same resources. We use a two-tier approach:

Thin soul agents run lightweight - minimal context, fast responses, low memory. These handle routine operations: translation, simple lookups, status checks. They stay under 65MB and restart cleanly.

Thick soul agents maintain rich context - conversation history, user preferences, session state. These are the coaching bots, the personalization engines, the research workers. They run at 75-90MB and need careful memory management.

The distinction matters for cost control. Every LLM API call costs money. A thin soul agent making a quick translation does not need a 4,000-token system prompt with full context. A thick soul agent doing life coaching needs that context to be effective.

By matching the soul size to the task, we keep our total LLM API costs under $300/month for 13+ active agents. That is roughly $23 per agent per month for full LLM-powered operation.

4. Protected File Enforcement

Our system has 89 files marked UNTOUCHABLE - core bot logic, configuration files, governance rules. No automated process can modify them. Period.

A second tier of PROTECTED files (critical operational code) requires explicit human approval for any change. Every access attempt is logged, whether it succeeds or not.

This prevents a common failure mode in multi-agent systems: Agent A decides to "fix" a configuration file that Agent B depends on, breaking Agent B, which triggers Agent C's error handler, which overwrites its own config trying to recover. Cascade failure from a helpful agent.

Protected files break the cascade. No agent can start the chain.

5. Audit Chain

Every governance action gets logged to an append-only audit trail:

Process restarts (manual and automatic)
Drift detections and resolutions
Configuration changes
Kill switch activations
Memory threshold violations
Protected file access attempts

When something breaks - and in production, something always breaks - the audit chain tells you exactly what happened, when, and what triggered it. No guessing. No "well, I think someone might have changed..."

The Economics

Here is the full monthly cost breakdown:

Item	Cost
VPS (4GB RAM, shared CPU)	$12
LLM API (Anthropic Max plan)	$300
Vercel (web hosting)	$20
Calendly (scheduling)	$20
Total	$352/mo

For $352/month, we run:

13+ active LLM-powered agents
Multi-platform presence (Telegram, Instagram, Web)
Full governance and audit infrastructure
Self-healing crash recovery
Rate limiting and API management

Compare this to the typical "enterprise AI" deployment: dedicated GPU instances, Kubernetes clusters, multiple monitoring SaaS subscriptions, dedicated DevOps team. Those run $5,000-$50,000/month for similar capability.

We are not saying our approach works for everyone. High-traffic applications need horizontal scaling. Latency-critical systems need dedicated compute. But for a startup building and validating LLM agents? $352/month buys you a lot of runway.

Lessons Learned

1. Restart counts are your most important metric. Not CPU, not memory, not latency. Restart count over time tells you whether your infrastructure is stable or just pretending to be.

2. Memory budgets are non-negotiable. Without hard limits, one misbehaving agent will consume all available RAM and take down every other process on the host. Set ceilings. Enforce them automatically.

3. Protected files prevent cascade failures. In multi-agent systems, the most dangerous agent is the helpful one. Lock down critical files so no agent can "fix" them without human approval.

4. Governance is not monitoring. Monitoring tells you what is happening. Governance tells you what should be happening and enforces the difference. Build governance first, monitoring second.

5. Start small, stay small. We could run on bigger hardware. We choose not to. Resource constraints force good architecture. When you have 4GB to share across 28 processes, you build efficient systems or you build nothing.

What Is Next

We are open-sourcing the governance layer as the Nervous System MCP server. It is already available on npm and GitHub:

GitHub: github.com/levelsofself/mcp-nervous-system
npm: npm install @levelsofself/mcp-nervous-system

If you are running multiple LLM agents in production - or planning to - you need a governance layer before you need another feature. Build the immune system before you build more organs.

Roman Palyan writes about production AI infrastructure at Levels of Self, a family-run startup where 12 family members each have their own LLM-powered agent. The whole system runs on one VPS because constraints breed innovation.

DEV Community

From Crash Loops to Self-Healing Infrastructure

From Crash Loops to Self-Healing Infrastructure

The Architecture: 28 Processes, 4GB of RAM

The Crash Loop That Looked Like Success

Why Traditional Monitoring Missed It

Building the Self-Healing Layer

1. Drift Detection Over Status Checks

2. Memory Budgeting

3. Thin Soul / Thick Soul Architecture

4. Protected File Enforcement

5. Audit Chain

The Economics

Lessons Learned

What Is Next

Top comments (0)