DEV Community

Cover image for How I Built a Self-Healing Node.js System That Fixes Production Bugs While I Sleep
Prodini Admin
Prodini Admin

Posted on

How I Built a Self-Healing Node.js System That Fixes Production Bugs While I Sleep

So I had this problem. I run a couple of Node.js services and every few days something would break in production — a bad query, a null reference, some edge case nobody thought of. I'd find out from logs way too late, ssh in, figure out what happened, write a fix, test it, push it. Every. Time.

At some point I thought — what if the system could just... fix itself? Or at least get 90% of the way there and ask me to approve?

Thats how LevAutoFix was born. Its an automated error detection and remediation system that watches production logs, classifies errors by severity, launches Claude Code in headless mode to generate fixes, and sends me a Telegram message to approve or skip. One tap from my phone, PR gets created.

Heres how the whole thing works.


The Architecture — Two Processes

The system is split into two separate processes:

Watcher — runs on the production server. Its only job is watching log files, detecting errors, and relaying them.

Fixer — runs on my dev machine. Receives classified errors, manages the fix queue, runs Claude, handles Telegram interactions.

This separation is intentional. I dont want anything doing code generation or git operations anywhere near production. The watcher is lightweight and read-only.

┌─────────────────┐          ┌─────────────────────┐
│  PROD SERVER     │          │  DEV MACHINE         │
│                  │          │                       │
│  Log Watcher     │───relay──│  Fix Queue            │
│  Error Classifier│          │  Claude Code Headless │
│                  │          │  Telegram Bot         │
│  (read-only)     │          │  Git Worktrees        │
└─────────────────┘          └─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Step 1: Error Detection & Fingerprinting

The watcher tails log files using a simple file watcher. When it spots an error, it generates a fingerprint — a hash of the error message + first 3 stack trace lines:

export function generateFingerprint(message: string, stack: string): string {
  const cleanMessage = stripTimestamp(message);
  const stackLines = (stack || '')
    .split('\n')
    .slice(0, 3)
    .map(stripTimestamp)
    .join('\n');
  const input = `${cleanMessage}|${stackLines}`;
  return crypto.createHash('sha256').update(input).digest('hex').slice(0, 16);
}
Enter fullscreen mode Exit fullscreen mode

The fingerprint lets us group duplicate errors. If mongo goes down, you dont want 200 separate error events — you want one event that says "this happened 200 times in the last minute."


Step 2: The Settle Window

This was one of the most important design decisions. When an error comes in, we dont act on it immediately. We wait 30 seconds.

Why? Because errors come in cascades. One mongo timeout triggers 15 failed queries which trigger 30 API errors. If you act on the first one, you're fixing a symptom. Wait 30 seconds, group everything by fingerprint, and you see the real picture.

settleTimers.set(
  fingerprint,
  setTimeout(async () => {
    const settledEvent = await ErrorEvent.findById(eventId);
    if (settledEvent && settledEvent.status === 'settling') {
      settledEvent.severity = classifySeverity(
        settledEvent.errorMessage,
        settledEvent.occurrenceCount
      );
      if (meetsMinSeverity(settledEvent.severity)) {
        settledEvent.status = 'detected';
        await settledEvent.save();
        onSettleCallback(settledEvent);
      }
    }
  }, config.detection.settleDelayMs)
);
Enter fullscreen mode Exit fullscreen mode

Each new occurrence of the same error resets the timer. So if errors keep coming, we keep waiting. Only when things calm down do we classify and decide what to do.


Step 3: Severity Classification

The classifier is intentionally simple. No ML, no fancy scoring — just regex patterns and occurrence counts:

const CRITICAL_PATTERNS = [
  /mongo.*(?:error|fail|refused|timeout)/i,
  /ECONNREFUSED/i,
  /jwt.*(?:secret|undefined|invalid)/i,
  /database.*(?:down|unavailable|timeout)/i,
];

export function classifySeverity(
  message: string,
  occurrenceCount: number
): ErrorSeverity {
  for (const pattern of CRITICAL_PATTERNS) {
    if (pattern.test(message)) return 'critical';
  }
  if (occurrenceCount >= 10) return 'high';
  if (occurrenceCount >= 3) return 'medium';
  return 'low';
}
Enter fullscreen mode Exit fullscreen mode

Infrastructure errors (db down, auth broken) are always critical. Everything else escalates by count. Below medium gets ignored — if an error happened once, its probably not worth an automated fix.

Important: the watcher classifies, the fixer trusts. When the watcher relays an error to the fixer, it sends the severity and occurrence count along with it. The fixer doesnt re-classify. This avoids a nasty bug where occurrences would need to accumulate twice.


Step 4: Git Worktree Isolation

This is the part that makes the whole thing safe. When a fix is approved, the system creates a git worktree:

/repo/.worktrees/hotfix/auto-<fingerprint>/
Enter fullscreen mode Exit fullscreen mode

A worktree is a full working copy of the repo on a separate branch. Claude can read, write, edit, run tests — whatever it needs. If the fix is wrong, you delete the worktree and nothing happened. Main branch is never touched.

Early versions of this didn't have worktree isolation. Claude was committing to main at 3am. I woke up to some interesting git histories. Never again.


Step 5: Claude Code Headless

Heres where the AI comes in. The fixer launches Claude Code CLI in headless mode with a scoped prompt:

Fix this production error in the codebase.
Error: MongoServerError: connection pool closed
Stack: at MongoClient.connect (mongo-client.ts:88)
Path: POST /api/products/list
Severity: CRITICAL
Enter fullscreen mode Exit fullscreen mode

Claude gets access to the full repo through the worktree and a set of tools — Read, Write, Edit, Glob, Grep, Bash. It explores the codebase, traces the error, and writes a fix.

The scoping is crucial. Early on I tried just saying "fix my app" and the quality was terrible. Giving Claude the exact error, stack trace, and affected endpoint makes a huge difference. It knows exactly where to start looking.

Honest results so far:

  • Critical infra errors (db connections, auth) — claude fixes like 70-80% correctly
  • Logic bugs with clear stack traces — pretty solid
  • Vague errors without good stacks — hit or miss

Step 6: Telegram Approval

Every fix goes through human approval. The bot sends a message with the error details and two buttons:

🚨 New Error — Approve Fix?

🔴 Severity: CRITICAL
Service: PA
Error: MongoServerError: connection pool closed
Path: POST /api/products/list

[✅ Approve Fix]  [❌ Skip]
Enter fullscreen mode Exit fullscreen mode

I tap Approve from my phone, the system creates a PR from the worktree branch. Skip and it gets shelved.

I also built an interactive dashboard so I can check the overall system status without typing commands:

🏠 LevAutoFix Dashboard

[📋 Queue Status]  [🚨 Recent Errors]
[📊 System Status] [🔄 Refresh]
Enter fullscreen mode Exit fullscreen mode

Each button edits the same message in-place with the requested view. No chat flooding. The errors view looks like:

🔴 [PA] MongoServerError: connection pool closed...
   🔧 fixing • 5m ago

🟠 [PA] jwt secret undefined - authentication broken...
   ⏳ detected • 12m ago

🟡 [GA] Cannot read property tenantId of undefined
   ✅ fixed • 2h ago
Enter fullscreen mode Exit fullscreen mode

The Stack

Nothing exotic:

  • Typescript + Express — API server for both watcher and fixer
  • MongoDB — error events, fix queue, metadata
  • node-telegram-bot-api — bot with inline keyboards and callback handlers
  • Claude Code CLI — headless mode for automated code generation
  • Git worktrees — isolation for each fix attempt

What I Learned

The settle window is everything. Without it, cascade failures generate dozens of duplicate fix attempts. 30 seconds of patience saves hours of cleanup.

Scope the AI prompt tightly. Don't say "fix my app." Give it the exact error, the exact stack, the exact endpoint. The difference in fix quality is night and day.

Isolate with worktrees. Let the AI experiment freely in a sandbox. If it works, merge it. If it doesn't, delete it. Zero risk to your main branch.

Keep the watcher dumb. The production component should be as simple as possible. Tail logs, classify, relay. All the complex stuff happens on the dev machine.

Human in the loop matters. Auto-fixing without approval sounds cool until Claude "fixes" your auth middleware at 3am. The Telegram approval step takes 2 seconds and has saved me multiple times.


Whats Next

  • Auto-merge for low-risk fixes (small diff + tests pass + no sensitive files touched)
  • Web dashboard for fix history analytics
  • Open sourcing the watcher component — its generic enough to work with any Node.js app

If you're interested in trying something similar, the core idea is surprisingly simple: tail logs → fingerprint → settle → classify → worktree → AI → approve. Each piece is straightforward on its own. The magic is in how they connect.

The whole thing is free and open source — just needs a Claude subscription. Planning to put the repo on GitHub soon.

Happy to answer questions about any part of the setup.

Top comments (2)

Collapse
 
nyrok profile image
Hamza KONTE

Self-healing systems are appealing but the prompt reliability problem you're touching on is the hard part — the LLM generates a patch, but the quality of that patch is a direct function of how well the diagnostic context is structured in the prompt.

If the error log + code context is fed as a blob of text, the model has to infer what kind of failure it is, what constraints apply (don't change the API surface, don't touch the auth layer), and what a valid fix looks like. That inference process is where "fixes" that break something else come from.

The structured approach — explicitly separating the error context (Input block), the constraints (don't modify X, Y, Z), and the expected output format (patch only, no refactoring) — makes the self-healing behavior much more predictable and auditable.

I built flompt (flompt.dev) to do exactly this structuring for any LLM prompt. For a self-healing system, a compiled structured prompt per error type would make the patch quality much more consistent. Free, open-source, and there's an MCP server if you want to integrate it into your pipeline: github.com/Nyrok/flompt.

Collapse
 
meekvinit profile image
Meek

The secret isn’t “AI fixing bugs.”

The secret is:
Controlled automation + isolation + human gating + rollback strategy.