Bill Tu

Posted on Apr 7

7 Bugs That Taught Us How to Build Better Diagnostic Tools

#npm #node #javascript

We shipped node-loop-detective v1.1.0 with a feature we were proud of: slow async I/O tracking. It could monkey-patch a running Node.js process, track slow HTTP requests, DNS lookups, and TCP connections, then report them with caller stack traces. All without restarting the target process.

Then we opened 16 GitHub issues against our own code. Seven of them were bugs. Not edge cases. Not "nice to haves." Real bugs that could affect users in production.

This is the story of those 7 bugs, what caused them, and what we learned fixing them. If you build developer tools — especially ones that attach to production processes — these patterns will look familiar.

Bug #1: We Left Our Fingerprints Behind

The problem: When node-loop-detective disconnects from a target process, it should leave no trace. But our I/O tracking feature monkey-patched http.request, https.request, dns.lookup, and net.Socket.connect — and never restored the originals.

// What cleanup looked like before
cleanup: () => {
  delete globalThis.__loopDetectiveIO;
  // That's it. The patched functions stay forever.
}

After loop-detective disconnected, every HTTP request, every DNS lookup, every TCP connection in the target process was still running through our wrapper functions. The wrappers would check a threshold and find nothing to record (the buffer was gone), but the overhead of the function call, the Date.now(), the stack trace capture — all still happening. On every single I/O operation. Forever.

For a diagnostic tool that promises "zero impact on your running app," this was a serious violation of trust.

The fix: Store references to all original functions before patching, and restore them on cleanup.

const originals = {};

function patchHttp(modName) {
  const mod = require(modName);
  const origRequest = mod.request;
  const origGet = mod.get;
  originals[modName + '.request'] = { mod, key: 'request', fn: origRequest };
  originals[modName + '.get'] = { mod, key: 'get', fn: origGet };
  // ... apply patches
}

// On cleanup
cleanup: () => {
  for (const entry of Object.values(originals)) {
    entry.mod[entry.key] = entry.fn;
  }
  delete globalThis.__loopDetectiveIO;
}

Lesson: If you modify something, you own the responsibility of unmodifying it. "Best effort cleanup" is not cleanup.

Bug #2: The Silent Death of Watch Mode

The problem: Watch mode (--watch) runs profiling cycles in a loop. Each cycle captures a CPU profile, analyzes it, and emits the results. The loop looked like this:

async _watchMode() {
  const runCycle = async () => {
    const profile = await this._captureProfile(this.config.duration);
    const analysis = this.analyzer.analyzeProfile(profile);
    this.emit('profile', analysis);

    if (this._running) {
      setTimeout(runCycle, 1000);
    }
  };

  runCycle(); // No await. No .catch().
}

Two problems here. First, runCycle() returns a Promise, but nobody awaits it or catches its rejection. If _captureProfile throws (inspector disconnected, target process crashed, CDP timeout), the error vanishes into the void. Node.js might print an UnhandledPromiseRejection warning, but the user sees nothing useful.

Second, the setTimeout(runCycle, 1000) callback also doesn't handle rejections. If one cycle fails, the entire watch loop dies silently. The user sees the tool just... stop producing output. No error message. No exit. Just silence.

The fix: Wrap each cycle in try/catch, emit errors, and keep going.

const runCycle = async () => {
  if (!this._running) return;

  try {
    const profile = await this._captureProfile(this.config.duration);
    const analysis = this.analyzer.analyzeProfile(profile);
    this.emit('profile', analysis);
  } catch (err) {
    this.emit('error', err);
  }

  if (this._running) {
    setTimeout(runCycle, 1000);
  }
};

await runCycle();

Lesson: Every async function that isn't awaited is a potential silent failure. In long-running loops, this is especially dangerous because the failure mode is "nothing happens" — the hardest kind of bug to notice.

Bug #3: The Double-Stop Race Condition

The problem: stop() could be called from two places simultaneously:

The SIGINT handler: process.on('SIGINT', async () => { await detective.stop(); })
The finally block in _singleRun(): finally { await this.stop(); }

If the user presses Ctrl+C during profiling, both paths fire. The first stop() starts disconnecting the inspector. The second stop() tries to disconnect it again while the first is still in progress. This could cause:

Double disconnected event emission
Attempting to send CDP cleanup commands on an already-closed WebSocket
Unhandled errors from the second disconnect attempt

The fix: A simple _stopping guard flag.

async stop() {
  if (this._stopping) return; // Already stopping
  this._stopping = true;
  this._running = false;

  // ... cleanup logic
}

Lesson: Any method that performs async cleanup and can be triggered by external events (signals, callbacks, error handlers) must be idempotent. This is especially true for stop()/close()/destroy() methods.

Bug #4: 30,000 Timers That Never Die

The problem: Every CDP command sent through the Inspector had a 30-second timeout:

async send(method, params = {}) {
  const id = ++this._id;

  return new Promise((resolve, reject) => {
    this._callbacks.set(id, { resolve, reject });
    this.ws.send(JSON.stringify({ id, method, params }));

    setTimeout(() => {
      if (this._callbacks.has(id)) {
        this._callbacks.delete(id);
        reject(new Error(`CDP command timeout: ${method}`));
      }
    }, 30000);
  });
}

The timeout setTimeout was created but never stored. When a response arrived, the callback was resolved and removed from the map, but the timer kept ticking for up to 30 seconds. When it finally fired, it checked this._callbacks.has(id), found nothing, and did nothing. Harmless? Mostly. But:

Each pending timer holds a reference to the closure, preventing garbage collection
On disconnect(), _callbacks.clear() ran, but all the timers were still pending
If disconnect() raced with a timeout, the Promise could end up never settling

Over a long profiling session with many CDP commands, this meant hundreds of orphaned timers.

The fix: Store the timer reference, clear it on response, and reject all pending callbacks on disconnect.

async send(method, params = {}) {
  const id = ++this._id;

  return new Promise((resolve, reject) => {
    const timer = setTimeout(() => {
      if (this._callbacks.has(id)) {
        this._callbacks.delete(id);
        reject(new Error(`CDP command timeout: ${method}`));
      }
    }, 30000);

    this._callbacks.set(id, { resolve, reject, timer });
    this.ws.send(JSON.stringify({ id, method, params }));
  });
}

// On message received
const { resolve, reject, timer } = this._callbacks.get(msg.id);
clearTimeout(timer);

// On disconnect
for (const { reject, timer } of this._callbacks.values()) {
  clearTimeout(timer);
  try { reject(new Error('Inspector disconnected')); } catch {}
}

Lesson: Every setTimeout should have a corresponding clearTimeout. If you're creating timers in a loop or per-request, you need a cleanup strategy. This is a common source of memory leaks in long-running Node.js processes — ironic for a tool that diagnoses such problems.

Bug #5: The Wrong Call Stack

The problem: The Analyzer builds call stacks for the top CPU-heavy functions. To do this, it needs to find the corresponding node in the V8 profile's call tree. The original code searched by matching function name, URL, and line number:

for (const node of nodeMap.values()) {
  const cf = node.callFrame;
  if (cf.functionName === fn.functionName &&
      cf.url === fn.url &&
      cf.lineNumber === fn.lineNumber - 1) {
    targetNode = node;
    break; // Takes the FIRST match
  }
}

This breaks when:

Two functions have the same name at the same line (minified code)
A function is called from multiple call sites (same function, different positions in the call tree)
Recursive functions appear multiple times in the tree

The break on first match means we might pick the wrong node, producing a call stack that traces back to the wrong call site.

The fix: The timings map already uses node IDs as keys. Pass the node ID through to the call stack builder instead of re-searching by name.

heavyFunctions.push({
  nodeId, // Carry the V8 profile node ID
  functionName: functionName || '(anonymous)',
  // ...
});

// In _buildCallStacks
const targetNode = nodeMap.get(fn.nodeId); // Direct lookup, no searching

Lesson: When you have a unique identifier, use it. String matching is fragile. This is the same principle as using database primary keys instead of searching by name.

Bug #6: The Stack Trace That Lies

The problem: When the lag detector fires, it captures a JavaScript stack trace:

const timer = setInterval(() => {
  const lag = now - lastTime - interval;
  if (lag > threshold) {
    lags.push({ lag, timestamp: now, stack: captureStack() });
  }
}, interval);

The intent was to show "what code was running when the event loop was blocked." But setInterval callbacks fire after the blocking code has finished. By the time our timer runs, the blocking function has returned, the call stack has unwound, and we're back in the event loop's timer phase.

The captured stack trace shows the setInterval callback's own stack — internal Node.js timer machinery — not the blocking code. After filtering out node:internal frames, the result is often empty or misleading.

The fix: This is a fundamental limitation of the setInterval approach, not a code bug. We can't fix it without a completely different detection mechanism (like Debugger.pause via CDP, which has its own tradeoffs). So we documented it clearly:

/**
 * Note on stack traces (Issue #6): The setInterval callback fires AFTER
 * blocking code has finished, so captureStack() captures the timer's own
 * stack, not the blocking code's stack. The lag event stacks are best-effort
 * context. For accurate blocking code identification, use the CPU profile
 * analysis (heavyFunctions + callStacks) which is based on V8 sampling.
 */

Lesson: Sometimes the honest fix is documentation. If a feature has an inherent limitation, say so clearly instead of letting users discover it through confusion. The CPU profile analysis already provides accurate blocking code identification — we just needed to make it clear that lag event stacks serve a different (and less reliable) purpose.

Bug #7: The http.get That Wasn't

The problem: Our patched http.get reimplemented the function instead of wrapping it:

mod.get = function patchedGet(...args) {
  const req = mod.request(...args); // Call our patched request
  req.end();                         // Manually end it
  return req;
};

The real http.get has specific argument normalization logic. It handles (url, options, callback) with various overloads. Our version passed everything to mod.request and called req.end(), which could break if:

A library passed arguments that http.get normalizes differently than http.request
The callback was handled differently between the two functions
req.end() was called at the wrong time relative to the callback setup

The fix: Wrap the original http.get directly instead of reimplementing it.

mod.get = function patchedGetWithTiming(...args) {
  const startTime = Date.now();
  const callerStack = captureCallerStack();

  const req = origGet.apply(this, args); // Call the ORIGINAL http.get

  req.on('response', (res) => {
    const duration = Date.now() - startTime;
    if (duration >= threshold) {
      recordSlowOp({ /* ... */ });
    }
  });

  return req;
};

Lesson: When monkey-patching, wrap the original function — don't reimplement it. You'll never perfectly replicate all the edge cases of the original. The safest patch is: call the original, observe the result, add your behavior around it.

What We Learned

Seven bugs in a tool that's supposed to help you find bugs. Humbling, but instructive. The patterns that emerged:

Clean up after yourself. If you modify state (monkey-patches, timers, event listeners), you must have a complete reversal path. "Best effort" cleanup is a code smell.
Every async path needs error handling. Especially in loops. Especially when triggered by setTimeout. The failure mode of an unhandled async error is silence, which is worse than a crash.
Cleanup methods must be idempotent. If stop() can be called from multiple code paths, it will be. Guard against double execution.
Every timer needs a cleanup path. setTimeout without clearTimeout is a resource leak. In long-running processes, these add up.
Use unique identifiers, not string matching. When you have an ID, use it. Name-based matching is fragile and will break on edge cases.
Document limitations honestly. Not every problem has a code fix. Sometimes the right answer is clear documentation that sets correct expectations.
Wrap, don't reimplement. When patching existing functions, delegate to the original. Your reimplementation will miss edge cases.

These aren't novel insights. They're the basics. But the basics are what break in production.

The fixes are live in v1.2.0.

DEV Community