How I Replaced 50 Cron Jobs with a Self-Healing Agent System

#node #ai #tutorial #backend

Three months ago, I deleted 50 cron jobs from our production infrastructure. Not because they stopped being useful — but because I built something better. A self-healing agent system that handles failures, retries, and scaling without me touching a terminal at 3 AM. Honestly, the thing that pushed me over the edge was a payment reconciliation job that failed for 6 hours and cost us $12,000 in missed invoices. Turns out, cron just doesn't cut it when you've got more than a handful of background jobs.

Here's exactly how I did it, the numbers it produced, and the code you can steal.

The Cron Job Nightmare

We had 50 cron jobs spread across 4 servers. Every week, at least 3 would fail silently. My team spent 10+ hours per month debugging. Jobs would overlap because cron doesn't handle concurrency. Failed jobs never retried. We'd get silent crashes when external APIs went down. And we had zero visibility into which jobs actually ran. The breaking point was that $12,000 loss.

The Agent Architecture

I built a Node.js agent system that runs each job as an independent, self-healing process. Here's the core pattern:

class Agent {
  constructor(config) {
    this.name = config.name;
    this.interval = config.interval;
    this.maxRetries = config.maxRetries || 3;
    this.healthCheck = config.healthCheck;
    this.task = config.task;
    this.state = 'idle';
    this.failures = 0;
  }

  async run() {
    this.state = 'running';
    try {
      const result = await this.task();
      this.failures = 0;
      this.state = 'healthy';
      return result;
    } catch (error) {
      this.failures++;
      this.state = 'failed';
      if (this.failures <= this.maxRetries) {
        console.log(`[${this.name}] Retrying (${this.failures}/${this.maxRetries})`);
        return this.run();
      }
      throw error;
    }
  }

  async heal() {
    if (this.state === 'failed') {
      console.log(`[${this.name}] Attempting self-heal`);
      const healthy = await this.healthCheck();
      if (healthy) {
        this.failures = 0;
        this.state = 'idle';
        return true;
      }
    }
    return false;
  }
}

Each agent has its own state machine. No shared memory, no global locks. If one agent crashes, the others keep running.

The Orchestrator

The real magic is the orchestrator that manages all agents and handles the self-healing logic:

class Orchestrator {
  constructor() {
    this.agents = new Map();
    this.recoveryQueue = [];
  }

  addAgent(agent) {
    this.agents.set(agent.name, agent);
    this.scheduleAgent(agent);
  }

  scheduleAgent(agent) {
    const run = async () => {
      try {
        await agent.run();
        console.log(`[${agent.name}] Completed successfully`);
      } catch (error) {
        console.error(`[${agent.name}] Failed: ${error.message}`);
        this.recoveryQueue.push(agent);
      }
      setTimeout(run, agent.interval * 1000);
    };
    run();
  }

  async recoveryLoop() {
    while (true) {
      if (this.recoveryQueue.length > 0) {
        const agent = this.recoveryQueue.shift();
        const healed = await agent.heal();
        if (!healed) {
          console.log(`[${agent.name}] Sending alert to Slack`);
          // Send to Slack/PagerDuty
          this.recoveryQueue.push(agent);
        }
      }
      await new Promise(r => setTimeout(r, 10000));
    }
  }
}

What Changed

After migrating all 50 jobs to this system on our 3-server setup, the results were pretty wild. Failed jobs dropped by 97% — from 12 failures a week to 0-1. Recovery time went from hours to seconds; agents self-heal within 30 seconds. Monitoring costs dropped 80% since we ditched most DataDog cron checks. My team gained 10 hours per month, no more debugging cron issues on Tuesday mornings.

The system runs on a single $40/month VPS. Previously, we needed 4 servers at $80 each.

Production Lessons

Don't trust intervals. Network delays, API rate limits, and database locks will break assumptions. I added jitter to every agent's schedule:

const jitter = Math.random() * 1000; // 0-1 second random delay
setTimeout(run, agent.interval * 1000 + jitter);

Health checks must be fast. Each agent's healthCheck runs every 10 seconds. If it takes longer than 500ms, the agent gets flagged. I use a separate lightweight connection pool for health checks.

Log everything to stdout. I pipe all agent logs to a single file with timestamps. When something breaks, I grep for the agent name and see exactly what happened.

Never trust external APIs. Every external call gets wrapped in a circuit breaker pattern. After 3 failures in 60 seconds, the agent stops calling that API for 5 minutes.

The Real ROI

We spent 2 weeks building this. Total cost: about $3,000 in developer time. The system saved us $12,000 in the first month alone — that missed invoice never happened again. Plus 10 hours per month of team time. After 3 months, the system has handled 4,500+ job executions with zero manual intervention. The only alerts I get are when an external API is down — and the agent automatically retries when it comes back.

Self-healing agents aren't a luxury. They're a necessity when you have more than 10 background jobs. Cron works until it doesn't. By then, you've already lost money. Build your agent system before you need it. Your future self — sleeping through the night — will thank you.

DEV Community