DEV Community

Кирилл
Кирилл

Posted on

How I Replaced 50 Cron Jobs with a Self-Healing Agent System

How I Replaced 50 Cron Jobs with a Self‑Healing Agent System

Tags: nodejs, ai, tutorial, backend

When our SaaS blew up from a handful of customers to a few thousand, the cron farm that kept everything alive turned into a nightmare. I was juggling 50 separate cron entries across three servers, each one doing a tiny piece of business logic: sending reminder emails, cleaning up stale sessions, rebuilding search indexes, rotating logs—you name it.

The problems were obvious:

Issue Impact
Drift – Different servers ran slightly different versions of the same script. 12 % of jobs failed silently each month.
Scaling – Adding a new job meant editing every crontab manually. 2 hours of coordination per release.
Visibility – No central dashboard; we only discovered failures when a customer complained. Mean Time To Detect (MTTD) ≈ 4 h.
Cost – Three VMs just to host the cron daemons, plus the overhead of monitoring them. $350 / month extra on a $1.2k bill.

Honestly, I decided to rip it all out and build a self‑healing agent system on Node.js with a tiny AI “coach” that watches the agents, restarts them, and learns from repeated failures. Below is the blueprint that took me from 50 brittle crons to a single resilient service that now runs ≈ 30 % faster and saves ~$300 / month.


1. The Core Idea – One Agent, Many Tasks

Instead of dozens of independent cron processes, I wrote a single Node service that loads task definitions from a JSON file (or a DB) and schedules them with node-cron. Each task runs inside its own worker thread so a runaway job never blocks the rest.

// scheduler.js – the heart of the system
const { Worker } = require('worker_threads');
const cron = require('node-cron');
const tasks = require('./tasks.json'); // [{name, schedule, script, enabled}]

function startTask(task) {
  const worker = new Worker(`./workers/${task.script}`, { workerData: task });
  worker.on('error', err => AgentCoach.reportFailure(task.name, err));
  worker.on('exit', code => {
    if (code !== 0) AgentCoach.reportFailure(task.name, new Error(`Exit code ${code}`));
  });
}

// schedule every enabled task
tasks.filter(t => t.enabled).forEach(task => {
  cron.schedule(task.schedule, () => startTask(task), { timezone: 'UTC' });
});
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Isolation – Each job runs in its own thread, so a memory leak in one task can’t affect the others.
  • Observability – The main process only needs to listen for error/exit events, making failure detection trivial.

2. Adding a Self‑Healing Coach

The “coach” is a lightweight AI module that learns which tasks fail repeatedly and applies corrective actions automatically: restart, back‑off, or disable until a human steps in.

// agentCoach.js
const fs = require('fs');
const path = './coach-state.json';
let state = {};

function loadState() {
  if (fs.existsSync(path)) state = JSON.parse(fs.readFileSync(path));
}
function saveState() {
  fs.writeFileSync(path, JSON.stringify(state, null, 2));
}
function reportFailure(taskName, err) {
  const now = Date.now();
  const record = state[taskName] || { failures: [], backoff: 0 };
  record.failures.push(now);
  // keep last 5 failures only
  record.failures = record.failures.slice(-5);

  // simple heuristic: if 3 failures in <10 min → back‑off
  if (record.failures.length >= 3 &&
      now - record.failures[0] < 10 * 60 * 1000) {
    record.backoff = Math.min(record.backoff + 1, 5); // max 5 min
    console.warn(`[Coach] Back‑off ${taskName} for ${record.backoff} min`);
    setTimeout(() => {
      // ask the scheduler to re‑enable the task after back‑off
      // (implementation omitted for brevity)
    }, record.backoff * 60 * 1000);
  }

  state[taskName] = record;
  saveState();
}
loadState();
module.exports = { reportFailure };
Enter fullscreen mode Exit fullscreen mode

The coach writes its state to disk, so a restart of the scheduler doesn’t wipe the failure history. After a few weeks in production it knocked repeated failures down from 12 % to 2 % and sliced the average MTTR from 4 h to 15 min.


3. Migrating Existing Crons

I didn’t yank the old crons out in one go. Instead I:

  1. Exported each script into workers/ and added a thin wrapper that accepts workerData.
  2. Created a task entry in tasks.json mirroring the original schedule.
  3. Disabled the old cron line and re‑loaded the scheduler (zero downtime thanks to PM2).

Turns out the migration of 50 jobs took 3 days instead of the weeks I’d feared.

// tasks.json  a sample entry
{
  "name": "send‑reminder‑emails",
  "schedule": "0 */2 * * *", // every 2 hours
  "script": "emailReminder.js",
  "enabled": true
}
Enter fullscreen mode Exit fullscreen mode

4. Monitoring & Alerting

Because each worker reports back to the coach, I hooked the coach into Datadog with a simple metric push:

// inside reportFailure()
datadog.gauge('agent.task.failures', record.failures.length, {task: taskName});
Enter fullscreen mode Exit fullscreen mode

The dashboard now shows real‑time failure counts per task, and a Slack webhook pings us when a task is back‑off for more than 5 minutes. Since deployment, the average alert noise dropped from 30 alerts/day to 2.


5. Results after 8 weeks in production

Metric Before After
Avg. job latency 1.8 s 1.2 s (≈ 30 % faster)
Failed runs / month 60 12
Ops time spent on cron maintenance 8 h / month 1 h (mostly adding new tasks)
Monthly cost (VMs + monitoring) $1,200 $900 (saved $300)
MTTR for broken jobs 4 h 15 min

The biggest surprise was the developer velocity gain: adding a new background task is now just a PR that updates tasks.json and drops a new worker file. No more editing crontabs on three machines.


6. Lessons Learned

  • Keep the agent stateless – Only the coach holds state; the scheduler can be restarted anytime without losing schedule data.
  • Thread per job is cheap – On a 2‑core VM I could run 30 concurrent workers without choking the CPU; Node’s worker threads are lightweight enough for most I/O‑bound cron jobs.
  • Simple heuristics beat complex ML – A rule‑based back‑off (3 failures in 10 min) solved 95 % of the pain points; trying to train a full reinforcement‑learning model would have been overkill.
  • Expose metrics early – Hooking the coach into the existing observability stack saved countless debugging hours later.

Last Tuesday I added a brand‑new nightly report task by just dropping a file in workers/ and pushing a tiny JSON entry. It worked straight away, no crontab edits required.

If you still have a sprawling crontab, give this approach a shot: replace the brittle collection with a single Node service, isolated workers, and a tiny self‑healing loop. The upfront effort pays off quickly in reliability, cost, and sanity.

One‑line takeaway: Replace brittle crons with a self‑healing agent and watch failures drop 80 % while saving hundreds of dollars each month.

Top comments (0)