Prodini Admin

Posted on Mar 8

How I Built a Self-Healing Node.js System That Fixes Production Bugs While I Sleep

#node #typescript #ai #devops

So I had this problem. I run a couple of Node.js services and every few days something would break in production — a bad query, a null reference, some edge case nobody thought of. I'd find out from logs way too late, ssh in, figure out what happened, write a fix, test it, push it. Every. Time.

At some point I thought — what if the system could just... fix itself? Or at least get 90% of the way there and ask me to approve?

Thats how LevAutoFix was born. Its an automated error detection and remediation system that watches production logs, classifies errors by severity, launches Claude Code in headless mode to generate fixes, and sends me a Telegram message to approve or skip. One tap from my phone, PR gets created.

Heres how the whole thing works.

The Architecture — Two Processes

The system is split into two separate processes:

Watcher — runs on the production server. Its only job is watching log files, detecting errors, and relaying them.

Fixer — runs on my dev machine. Receives classified errors, manages the fix queue, runs Claude, handles Telegram interactions.

This separation is intentional. I dont want anything doing code generation or git operations anywhere near production. The watcher is lightweight and read-only.

┌─────────────────┐          ┌─────────────────────┐
│  PROD SERVER     │          │  DEV MACHINE         │
│                  │          │                       │
│  Log Watcher     │───relay──│  Fix Queue            │
│  Error Classifier│          │  Claude Code Headless │
│                  │          │  Telegram Bot         │
│  (read-only)     │          │  Git Worktrees        │
└─────────────────┘          └─────────────────────┘

Step 1: Error Detection & Fingerprinting

The watcher tails log files using a simple file watcher. When it spots an error, it generates a fingerprint — a hash of the error message + first 3 stack trace lines:

export function generateFingerprint(message: string, stack: string): string {
  const cleanMessage = stripTimestamp(message);
  const stackLines = (stack || '')
    .split('\n')
    .slice(0, 3)
    .map(stripTimestamp)
    .join('\n');
  const input = `${cleanMessage}|${stackLines}`;
  return crypto.createHash('sha256').update(input).digest('hex').slice(0, 16);
}

The fingerprint lets us group duplicate errors. If mongo goes down, you dont want 200 separate error events — you want one event that says "this happened 200 times in the last minute."

Step 2: The Settle Window

This was one of the most important design decisions. When an error comes in, we dont act on it immediately. We wait 30 seconds.

Why? Because errors come in cascades. One mongo timeout triggers 15 failed queries which trigger 30 API errors. If you act on the first one, you're fixing a symptom. Wait 30 seconds, group everything by fingerprint, and you see the real picture.

settleTimers.set(
  fingerprint,
  setTimeout(async () => {
    const settledEvent = await ErrorEvent.findById(eventId);
    if (settledEvent && settledEvent.status === 'settling') {
      settledEvent.severity = classifySeverity(
        settledEvent.errorMessage,
        settledEvent.occurrenceCount
      );
      if (meetsMinSeverity(settledEvent.severity)) {
        settledEvent.status = 'detected';
        await settledEvent.save();
        onSettleCallback(settledEvent);
      }
    }
  }, config.detection.settleDelayMs)
);

Each new occurrence of the same error resets the timer. So if errors keep coming, we keep waiting. Only when things calm down do we classify and decide what to do.

Step 3: Severity Classification

The classifier is intentionally simple. No ML, no fancy scoring — just regex patterns and occurrence counts:

const CRITICAL_PATTERNS = [
  /mongo.*(?:error|fail|refused|timeout)/i,
  /ECONNREFUSED/i,
  /jwt.*(?:secret|undefined|invalid)/i,
  /database.*(?:down|unavailable|timeout)/i,
];

export function classifySeverity(
  message: string,
  occurrenceCount: number
): ErrorSeverity {
  for (const pattern of CRITICAL_PATTERNS) {
    if (pattern.test(message)) return 'critical';
  }
  if (occurrenceCount >= 10) return 'high';
  if (occurrenceCount >= 3) return 'medium';
  return 'low';
}

Infrastructure errors (db down, auth broken) are always critical. Everything else escalates by count. Below medium gets ignored — if an error happened once, its probably not worth an automated fix.

Important: the watcher classifies, the fixer trusts. When the watcher relays an error to the fixer, it sends the severity and occurrence count along with it. The fixer doesnt re-classify. This avoids a nasty bug where occurrences would need to accumulate twice.

Step 4: Git Worktree Isolation

This is the part that makes the whole thing safe. When a fix is approved, the system creates a git worktree:

/repo/.worktrees/hotfix/auto-<fingerprint>/

A worktree is a full working copy of the repo on a separate branch. Claude can read, write, edit, run tests — whatever it needs. If the fix is wrong, you delete the worktree and nothing happened. Main branch is never touched.

Early versions of this didn't have worktree isolation. Claude was committing to main at 3am. I woke up to some interesting git histories. Never again.

Step 5: Claude Code Headless

Heres where the AI comes in. The fixer launches Claude Code CLI in headless mode with a scoped prompt:

Fix this production error in the codebase.
Error: MongoServerError: connection pool closed
Stack: at MongoClient.connect (mongo-client.ts:88)
Path: POST /api/products/list
Severity: CRITICAL

Claude gets access to the full repo through the worktree and a set of tools — Read, Write, Edit, Glob, Grep, Bash. It explores the codebase, traces the error, and writes a fix.

The scoping is crucial. Early on I tried just saying "fix my app" and the quality was terrible. Giving Claude the exact error, stack trace, and affected endpoint makes a huge difference. It knows exactly where to start looking.

Honest results so far:

Critical infra errors (db connections, auth) — claude fixes like 70-80% correctly
Logic bugs with clear stack traces — pretty solid
Vague errors without good stacks — hit or miss

Step 6: Telegram Approval

Every fix goes through human approval. The bot sends a message with the error details and two buttons:

🚨 New Error — Approve Fix?

🔴 Severity: CRITICAL
Service: PA
Error: MongoServerError: connection pool closed
Path: POST /api/products/list

[✅ Approve Fix]  [❌ Skip]

I tap Approve from my phone, the system creates a PR from the worktree branch. Skip and it gets shelved.

I also built an interactive dashboard so I can check the overall system status without typing commands:

🏠 LevAutoFix Dashboard

[📋 Queue Status]  [🚨 Recent Errors]
[📊 System Status] [🔄 Refresh]

Each button edits the same message in-place with the requested view. No chat flooding. The errors view looks like:

🔴 [PA] MongoServerError: connection pool closed...
   🔧 fixing • 5m ago

🟠 [PA] jwt secret undefined - authentication broken...
   ⏳ detected • 12m ago

🟡 [GA] Cannot read property tenantId of undefined
   ✅ fixed • 2h ago

The Stack

Nothing exotic:

Typescript + Express — API server for both watcher and fixer
MongoDB — error events, fix queue, metadata
node-telegram-bot-api — bot with inline keyboards and callback handlers
Claude Code CLI — headless mode for automated code generation
Git worktrees — isolation for each fix attempt

What I Learned

The settle window is everything. Without it, cascade failures generate dozens of duplicate fix attempts. 30 seconds of patience saves hours of cleanup.

Scope the AI prompt tightly. Don't say "fix my app." Give it the exact error, the exact stack, the exact endpoint. The difference in fix quality is night and day.

Isolate with worktrees. Let the AI experiment freely in a sandbox. If it works, merge it. If it doesn't, delete it. Zero risk to your main branch.

Keep the watcher dumb. The production component should be as simple as possible. Tail logs, classify, relay. All the complex stuff happens on the dev machine.

Human in the loop matters. Auto-fixing without approval sounds cool until Claude "fixes" your auth middleware at 3am. The Telegram approval step takes 2 seconds and has saved me multiple times.

Whats Next

Auto-merge for low-risk fixes (small diff + tests pass + no sensitive files touched)
Web dashboard for fix history analytics
Open sourcing the watcher component — its generic enough to work with any Node.js app

If you're interested in trying something similar, the core idea is surprisingly simple: tail logs → fingerprint → settle → classify → worktree → AI → approve. Each piece is straightforward on its own. The magic is in how they connect.

The whole thing is free and open source — just needs a Claude subscription. Planning to put the repo on GitHub soon.

Happy to answer questions about any part of the setup.