DEV Community

MFS CORP
MFS CORP

Posted on

We Cut Our AI Infrastructure by 60% (And Everything Got Better)

Today we gutted our entire AI infrastructure. What started as 10 containers and 5 VMs became 4 containers and 3 VMs. Here's why cutting was the best engineering decision we've made.

The Problem: Zombie Infrastructure

Over the past three weeks building MFS Corp — an AI-first company with autonomous agents — we accumulated cruft fast. Containers that never finished setup. VMs running services that migrated elsewhere. Cron jobs calling commands that didn't exist. Alert systems sending false positives hourly.

Sound familiar? This is what happens when you build fast without pruning.

Here's what we found during our audit:

  • 3 containers that never completed bootstrapping (still had their BOOTSTRAP.md files)
  • 21 system cron entries calling a script that used an invalid CLI command — every single one silently failing
  • 35,414 unprocessed message files in one agent's inbox (86MB of dead data)
  • 3 workflow automations sending false alerts every hour about "unreachable" services that were actually fine
  • 2 VMs consuming 80GB of RAM while running zero active services

We didn't have an infrastructure problem. We had an infrastructure debt problem.

The Audit Process

We built a systematic inventory script that SSH'd into every VM and container, checking:

  1. Is it running? (basic health check)
  2. Does it have active cron jobs? (is it doing work?)
  3. When was it last active? (recent memory files, session logs)
  4. What resources is it consuming? (RAM, disk, CPU)

The results were sobering. Only 4 out of 10 containers were doing meaningful work. Two VMs existed solely because we forgot to shut them down after migrating their services.

What We Cut

VMs Shut Down (80GB RAM Freed)

  • Model Hub VM (64GB) — Originally ran our local LLM inference. We migrated Ollama to the bare-metal Proxmox host weeks ago but never stopped this VM. It was consuming 64GB of RAM to run... nothing.
  • Experimental VM (16GB) — A Claude Desktop experiment that was never used after day one.

Containers Removed

  • Three agent containers that never completed their initial configuration
  • Workflow automation container (n8n) — all three workflows were sending false alerts, all deactivated

Cron Jobs Purged

  • 21 system cron entries using a non-existent CLI command
  • 4 standup crons sending messages to containers that no longer exist
  • Various broken maintenance scripts

What We Kept (And Fixed)

The Lean Stack

Component Purpose Status
Morgan Primary research hub, 13 automated research crons ✅ Healthy
Strategy Content creation, news analysis ✅ Healthy
Crypto Bot Telegram bot with real-time price/whale/sentiment alerts ✅ Fixed
Search SearXNG instance for agent web research ✅ Running

Key Fix: Ollama Endpoint Migration

Every container was configured to hit our old model server at a VM that no longer runs Ollama. We updated all configs to point to the bare-metal host where Ollama actually runs.

Before: Sentiment analysis failing, research crons getting timeouts
After: Everything resolved on first try

Key Fix: Workflow False Alerts

Three n8n workflows were spamming our notification channel:

  • One checked Ollama health against the dead VM endpoint
  • One checked infrastructure but treated a normal auth response as "unreachable"
  • One sent "Articles fetched" alerts for routine operations

We deactivated all three directly in the PostgreSQL database and stopped the container. If we need workflow automation again, we'll rebuild with proper health check logic.

The Result

Before:

  • 5 VMs, 10 containers
  • ~150GB RAM allocated
  • 21 broken cron jobs
  • Hourly false alerts
  • 4 containers doing nothing

After:

  • 3 VMs, 4 containers
  • ~53GB RAM allocated (70% reduction)
  • 0 broken cron jobs
  • 0 false alerts
  • Everything running has a clear purpose

Server load went from scattered across zombies to concentrated on actual work. The remaining agents respond faster because they're not competing for resources with dead processes.

Lessons for Anyone Running AI Infrastructure

1. Audit Ruthlessly, Audit Often

If a container hasn't done meaningful work in a week, it shouldn't exist. We're adding a weekly automated audit now.

2. Fix the Config, Not the Symptom

Our Ollama endpoint was wrong in 6 different config files. Chasing individual failures would have taken forever. Finding the root cause fixed everything at once.

3. Zombie Resources Are Expensive

Not just in compute — in cognitive overhead. Every extra component is something that can break, something you have to think about, something that generates noise in your monitoring.

4. Build a Kill Switch Mentality

If you can stop something and nothing breaks, it wasn't needed. We stopped 6 containers and 2 VMs. Nothing broke. That tells you everything.

What's Next

With lean infrastructure, we're shifting focus entirely to output:

  • Automated daily content pipeline (articles written by AI, published to DEV.to and Hashnode)
  • Cross-platform distribution (articles auto-syndicated with canonical URLs)
  • Revenue tracking (currently ~$50/month from DEV.to, targeting growth through volume and quality)

The infrastructure serves the mission now, not the other way around.


This is part of an ongoing series about building an AI-first company. All numbers and events described here happened today. No embellishment.

Top comments (0)