How I Built a Self-Healing Node.js System That Fixes Production Bugs While I Sleep

Prodini Admin — Sun, 08 Mar 2026 08:25:22 +0000

So I had this problem. I run a couple of Node.js services and every few days something would break in production — a bad query, a null reference, some edge case nobody thought of. I'd find out from logs way too late, ssh in, figure out what happened, write a fix, test it, push it. Every. Time.

At some point I thought — what if the system could just... fix itself? Or at least get 90% of the way there and ask me to approve?

Thats how LevAutoFix was born. Its an automated error detection and remediation system that watches production logs, classifies errors by severity, launches Claude Code in headless mode to generate fixes, and sends me a Telegram message to approve or skip. One tap from my phone, PR gets created.

Heres how the whole thing works.

The Architecture — Two Processes

The system is split into two separate processes:

Watcher — runs on the production server. Its only job is watching log files, detecting errors, and relaying them.

Fixer — runs on my dev machine. Receives classified errors, manages the fix queue, runs Claude, handles Telegram interactions.

This separation is intentional. I dont want anything doing code generation or git operations anywhere near production. The watcher is lightweight and read-only.

┌─────────────────┐          ┌─────────────────────┐
│  PROD SERVER     │          │  DEV MACHINE         │
│                  │          │                       │
│  Log Watcher     │───relay──│  Fix Queue            │
│  Error Classifier│          │  Claude Code Headless │
│                  │          │  Telegram Bot         │
│  (read-only)     │          │  Git Worktrees        │
└─────────────────┘          └─────────────────────┘

Step 1: Error Detection & Fingerprinting

The watcher tails log files using a simple file watcher. When it spots an error, it generates a fingerprint — a hash of the error message + first 3 stack trace lines:

export function generateFingerprint(message: string, stack: string): string {
  const cleanMessage = stripTimestamp(message);
  const stackLines = (stack || '')
    .split('\n')
    .slice(0, 3)
    .map(stripTimestamp)
    .join('\n');
  const input = `${cleanMessage}|${stackLines}`;
  return crypto.createHash('sha256').update(input).digest('hex').slice(0, 16);
}

The fingerprint lets us group duplicate errors. If mongo goes down, you dont want 200 separate error events — you want one event that says "this happened 200 times in the last minute."

Step 2: The Settle Window

This was one of the most important design decisions. When an error comes in, we dont act on it immediately. We wait 30 seconds.

Why? Because errors come in cascades. One mongo timeout triggers 15 failed queries which trigger 30 API errors. If you act on the first one, you're fixing a symptom. Wait 30 seconds, group everything by fingerprint, and you see the real picture.

settleTimers.set(
  fingerprint,
  setTimeout(async () => {
    const settledEvent = await ErrorEvent.findById(eventId);
    if (settledEvent && settledEvent.status === 'settling') {
      settledEvent.severity = classifySeverity(
        settledEvent.errorMessage,
        settledEvent.occurrenceCount
      );
      if (meetsMinSeverity(settledEvent.severity)) {
        settledEvent.status = 'detected';
        await settledEvent.save();
        onSettleCallback(settledEvent);
      }
    }
  }, config.detection.settleDelayMs)
);

Each new occurrence of the same error resets the timer. So if errors keep coming, we keep waiting. Only when things calm down do we classify and decide what to do.

Step 3: Severity Classification

The classifier is intentionally simple. No ML, no fancy scoring — just regex patterns and occurrence counts:

const CRITICAL_PATTERNS = [
  /mongo.*(?:error|fail|refused|timeout)/i,
  /ECONNREFUSED/i,
  /jwt.*(?:secret|undefined|invalid)/i,
  /database.*(?:down|unavailable|timeout)/i,
];

export function classifySeverity(
  message: string,
  occurrenceCount: number
): ErrorSeverity {
  for (const pattern of CRITICAL_PATTERNS) {
    if (pattern.test(message)) return 'critical';
  }
  if (occurrenceCount >= 10) return 'high';
  if (occurrenceCount >= 3) return 'medium';
  return 'low';
}

Infrastructure errors (db down, auth broken) are always critical. Everything else escalates by count. Below medium gets ignored — if an error happened once, its probably not worth an automated fix.

Important: the watcher classifies, the fixer trusts. When the watcher relays an error to the fixer, it sends the severity and occurrence count along with it. The fixer doesnt re-classify. This avoids a nasty bug where occurrences would need to accumulate twice.

Step 4: Git Worktree Isolation

This is the part that makes the whole thing safe. When a fix is approved, the system creates a git worktree:

/repo/.worktrees/hotfix/auto-<fingerprint>/

A worktree is a full working copy of the repo on a separate branch. Claude can read, write, edit, run tests — whatever it needs. If the fix is wrong, you delete the worktree and nothing happened. Main branch is never touched.

Early versions of this didn't have worktree isolation. Claude was committing to main at 3am. I woke up to some interesting git histories. Never again.

Step 5: Claude Code Headless

Heres where the AI comes in. The fixer launches Claude Code CLI in headless mode with a scoped prompt:

Fix this production error in the codebase.
Error: MongoServerError: connection pool closed
Stack: at MongoClient.connect (mongo-client.ts:88)
Path: POST /api/products/list
Severity: CRITICAL

Claude gets access to the full repo through the worktree and a set of tools — Read, Write, Edit, Glob, Grep, Bash. It explores the codebase, traces the error, and writes a fix.

The scoping is crucial. Early on I tried just saying "fix my app" and the quality was terrible. Giving Claude the exact error, stack trace, and affected endpoint makes a huge difference. It knows exactly where to start looking.

Honest results so far:

Critical infra errors (db connections, auth) — claude fixes like 70-80% correctly
Logic bugs with clear stack traces — pretty solid
Vague errors without good stacks — hit or miss

Step 6: Telegram Approval

Every fix goes through human approval. The bot sends a message with the error details and two buttons:

🚨 New Error — Approve Fix?

🔴 Severity: CRITICAL
Service: PA
Error: MongoServerError: connection pool closed
Path: POST /api/products/list

[✅ Approve Fix]  [❌ Skip]

I tap Approve from my phone, the system creates a PR from the worktree branch. Skip and it gets shelved.

I also built an interactive dashboard so I can check the overall system status without typing commands:

🏠 LevAutoFix Dashboard

[📋 Queue Status]  [🚨 Recent Errors]
[📊 System Status] [🔄 Refresh]

Each button edits the same message in-place with the requested view. No chat flooding. The errors view looks like:

🔴 [PA] MongoServerError: connection pool closed...
   🔧 fixing • 5m ago

🟠 [PA] jwt secret undefined - authentication broken...
   ⏳ detected • 12m ago

🟡 [GA] Cannot read property tenantId of undefined
   ✅ fixed • 2h ago

The Stack

Nothing exotic:

Typescript + Express — API server for both watcher and fixer
MongoDB — error events, fix queue, metadata
node-telegram-bot-api — bot with inline keyboards and callback handlers
Claude Code CLI — headless mode for automated code generation
Git worktrees — isolation for each fix attempt

What I Learned

The settle window is everything. Without it, cascade failures generate dozens of duplicate fix attempts. 30 seconds of patience saves hours of cleanup.

Scope the AI prompt tightly. Don't say "fix my app." Give it the exact error, the exact stack, the exact endpoint. The difference in fix quality is night and day.

Isolate with worktrees. Let the AI experiment freely in a sandbox. If it works, merge it. If it doesn't, delete it. Zero risk to your main branch.

Keep the watcher dumb. The production component should be as simple as possible. Tail logs, classify, relay. All the complex stuff happens on the dev machine.

Human in the loop matters. Auto-fixing without approval sounds cool until Claude "fixes" your auth middleware at 3am. The Telegram approval step takes 2 seconds and has saved me multiple times.

Whats Next

Auto-merge for low-risk fixes (small diff + tests pass + no sensitive files touched)
Web dashboard for fix history analytics
Open sourcing the watcher component — its generic enough to work with any Node.js app

If you're interested in trying something similar, the core idea is surprisingly simple: tail logs → fingerprint → settle → classify → worktree → AI → approve. Each piece is straightforward on its own. The magic is in how they connect.

The whole thing is free and open source — just needs a Claude subscription. Planning to put the repo on GitHub soon.

Happy to answer questions about any part of the setup.

How We Built an AI Product Manager That Actually Learns Your Team's Templates

Prodini Admin — Sat, 28 Feb 2026 05:40:06 +0000

Writing PRDs shouldn't feel like punishment.

If you're a product manager, you know the drill: spend 4 hours writing a PRD, share it with the team, get told "that's not our format," then spend another hour reformatting.

Generic AI tools make it worse — they produce outputs that sound good but completely miss your team's conventions, terminology, and documentation standards.

The Problem with Generic AI for Product Management

We tried every AI assistant on the market. The results were consistently the same:

Generic structure that doesn't match our templates
Hallucinated edge cases that waste engineering time
No awareness of past decisions or product context
Constant re-explaining of how our team works

Our Approach: RAG + Integration-First Architecture

We built Prodini with a fundamentally different approach. Instead of prompt engineering, we use Retrieval-Augmented Generation (RAG) to ingest your actual documentation:

Connect your tools — Jira, Confluence, Figma, GitHub
Learn your templates — Prodini analyzes your existing PRDs, guidelines, and writing style
Generate in context — Every output is grounded in YOUR documentation

The result? PRDs that match your team's format from the first draft. No reformatting. No re-explaining.

Edge Case Detection — My Favorite Feature

Here's what keeps me up at night as a builder: edge cases that slip through planning and explode in production.

Prodini analyzes your requirements and automatically flags:

Missing user flow scenarios
Potential conflicts with existing features
Error states nobody thought about
Permission and access control gaps
Integration edge cases between connected systems

In our testing, it consistently catches issues that even senior PMs with 10+ years of experience miss.

The Technical Stack

For the curious:

RAG Pipeline — Ingests and indexes Jira tickets, Confluence pages, Figma designs, and GitHub repos per tenant
LLM Layer — Claude AI for generation with context-aware prompting
MCP Integration — Model Context Protocol for real-time Jira bi-directional sync
SSE Streaming — Real-time agentic chat with file attachment support
Multi-tenant — Complete data isolation per organization

Results

After rolling this out to 700+ product managers:

16x faster PRD creation (15 min vs 4+ hours)
94% edge case coverage detected automatically
Zero reformatting — matches your template from the first draft
Instant Q&A — "What changed last sprint?" answered in under 5 seconds

What's Next

We recently shipped Agentic Chat — an autonomous AI mode where Prodini doesn't just answer questions, it takes actions. Upload a file, ask it to analyze your competitor's PRD, or have it review your sprint plan for gaps.

We're also building based on direct user feedback. Our users literally tell our AI "I wish Prodini could..." and we build it within 5 days. That's our promise.

Try It Free

We're currently in free beta with 250 credits/month, full access to all features, and all integrations included.

Try Prodini →

I'd love to hear from other PMs — what's your biggest pain point with PRD writing? Drop a comment below.

DEV Community: Prodini Admin