Bill Tu

Posted on Apr 9

When the Patient Dies on the Table: Handling Target Process Exit During Profiling

#node #npm #eventloop

You're profiling a Node.js process that's been crashing intermittently. You attach loop-detective, start a 60-second capture, and wait. Thirty seconds in, the process crashes again.

What should happen next?

Before v1.7.0, the answer was: nothing good. loop-detective would hang waiting for a CDP response that would never come, or exit with a cryptic WebSocket error. Any lag events and slow I/O data collected during those first 30 seconds? Gone.

This is the story of how we made loop-detective handle the worst-case scenario — the target process dying mid-profiling — and why the solution required changes across every layer of the tool.

The Failure Mode

To understand the problem, you need to know how profiling works internally:

1. Connect to inspector WebSocket
2. Send Profiler.start
3. Sleep for <duration> seconds     ← target can exit here
4. Send Profiler.stop               ← this never gets a response
5. Analyze and report

Step 3 is a setTimeout wrapped in a Promise. Step 4 is a CDP command sent over the WebSocket. If the target process exits during step 3, the WebSocket closes. When step 4 tries to send Profiler.stop, it throws "Inspector not connected." The error bubbles up, and the tool exits without reporting anything useful.

But the real loss is the data. During those 30 seconds before the crash, the lag detector was running. The I/O tracker was recording slow HTTP calls. That data exists in memory — it just never gets reported because the profiling pipeline assumes it will complete normally.

The Fix: Three Layers

Layer 1: Detect the Exit Immediately

The WebSocket close event fires when the target process exits. Previously, this just emitted a disconnected event. Now it also rejects all pending CDP callbacks and emits a targetExit event:

// inspector.js
this.ws.on('close', () => {
  // Reject all pending callbacks — target is gone
  for (const { reject, timer } of this._callbacks.values()) {
    clearTimeout(timer);
    try { reject(new Error('Target process exited')); } catch {}
  }
  this._callbacks.clear();
  this.emit('disconnected');
});

This is critical. Without rejecting pending callbacks, any in-flight CDP command (like Profiler.stop) would hang until its 30-second timeout. That's 30 seconds of the user staring at a frozen terminal.

Layer 2: Cancel the Sleep

The profiling duration is implemented as a sleep:

await this._sleep(duration);

If the target exits 10 seconds into a 60-second profile, we don't want to wait the remaining 50 seconds. The sleep needs to be cancellable.

_sleep(ms) {
  return new Promise((resolve, reject) => {
    this._sleepReject = reject;
    const timer = setTimeout(() => {
      this._sleepReject = null;
      resolve();
    }, ms);
    if (timer.unref) timer.unref();
  });
}

When the disconnected event fires, the Detective cancels the sleep:

this.inspector.on('disconnected', () => {
  if (!this._stopping) {
    this._targetExited = true;
    this.emit('targetExit', { pid, message: `Target process (PID ${pid}) exited during profiling` });

    // Cancel the sleep so _captureProfile can finish
    if (this._sleepReject) {
      this._sleepReject(new Error('Target process exited'));
      this._sleepReject = null;
    }
  }
});

Layer 3: Salvage What You Can

_captureProfile now handles the case where the sleep is cancelled and the profiler can't be stopped:

async _captureProfile(duration) {
  await this.inspector.send('Profiler.enable');
  await this.inspector.send('Profiler.setSamplingInterval', { interval: 100 });
  await this.inspector.send('Profiler.start');

  try {
    await this._sleep(duration);
  } catch {
    // Sleep was cancelled (target exited) — try to stop the profiler anyway
  }

  try {
    const { profile } = await this.inspector.send('Profiler.stop');
    await this.inspector.send('Profiler.disable');
    return profile;
  } catch {
    // Target exited — no profile data available
    return null;
  }
}

The callers check for null:

const profile = await this._captureProfile(this.config.duration);
if (profile) {
  const analysis = this.analyzer.analyzeProfile(profile);
  this.emit('profile', analysis, profile);
}
// If null, targetExit event was already emitted

The key insight: even though we can't get the CPU profile (it's inside the dead process), the lag events and slow I/O events were already emitted in real-time during the profiling period. The user has already seen them scroll by in the terminal. They're not lost.

What the User Sees

Before v1.7.0:

✔ Connected to Node.js process
  Profiling for 60s with 50ms lag threshold...

⚠ Event loop lag: 245ms at 2025-03-15T14:23:45.123Z
🌐 Slow HTTP: 1820ms GET api.example.com/users → 200

<hangs for 30 seconds, then>

✖ Error: Inspector not connected

After v1.7.0:

✔ Connected to Node.js process
  Profiling for 60s with 50ms lag threshold...

⚠ Event loop lag: 245ms at 2025-03-15T14:23:45.123Z
🌐 Slow HTTP: 1820ms GET api.example.com/users → 200

  ✖ Target process (PID 12345) exited during profiling
    Any lag or I/O events collected before the exit are shown above.

✔ Disconnected cleanly

No hang. Clear message. The partial data (lag event + slow HTTP call) is visible above the exit message.

Exit Codes

We introduced a three-code system:

Code	Meaning
0	Success — profiling completed normally
1	Error — connection failed, invalid arguments, etc.
2	Target exited — process died during profiling

This matters for automation. A monitoring script can distinguish between "profiling worked" (0), "loop-detective itself had a problem" (1), and "the target crashed" (2):

loop-detective 12345 -d 60
EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
  echo "Target process crashed during profiling — check partial output"
  # Trigger incident response
elif [ $EXIT_CODE -eq 1 ]; then
  echo "loop-detective failed to connect"
elif [ $EXIT_CODE -eq 0 ]; then
  echo "Profiling completed successfully"
fi

The Irony of Profiling Crashing Processes

There's an irony here: the processes most likely to exit during profiling are the ones you most need to profile. They're crashing. They're running out of memory. They're hitting unhandled exceptions. You attached loop-detective precisely because something is wrong.

This means the "target exits during profiling" scenario isn't an edge case — it's a primary use case. If your tool can't handle it gracefully, it fails exactly when it's needed most.

The partial data is often enough. If you see three slow HTTP calls to the same endpoint right before the crash, that's a strong signal. If you see event loop lag spiking to 2 seconds, that tells you the process was CPU-bound before it died. You don't always need the full CPU profile to understand what happened.

Watch Mode Behavior

In --watch mode, a target exit stops the watch loop instead of starting a new cycle:

if (this._running && !this._targetExited) {
  setTimeout(runCycle, 1000);
}

There's no point retrying — the process is gone. The tool reports what it has and exits cleanly.

Programmatic API

For tool builders integrating loop-detective:

const { Detective } = require('node-loop-detective');

const detective = new Detective({ pid: 12345, duration: 60000 });

// Collect events as they arrive
const events = { lags: [], slowIO: [] };
detective.on('lag', (data) => events.lags.push(data));
detective.on('slowIO', (data) => events.slowIO.push(data));

// Handle target exit
detective.on('targetExit', (data) => {
  console.log(data.message);
  console.log(`Collected ${events.lags.length} lag events and ${events.slowIO.length} slow I/O events before exit`);
  // Send partial data to your monitoring system
  sendToMonitoring({ partial: true, ...events });
});

detective.on('profile', (analysis) => {
  // Full profile available — target didn't crash
  sendToMonitoring({ partial: false, analysis, ...events });
});

await detective.start();

The Broader Lesson

Diagnostic tools need to be more resilient than the systems they diagnose. If your profiler crashes when the target crashes, you've lost the most valuable debugging moment. Every diagnostic tool that attaches to external processes should answer three questions:

What happens when the target exits? Don't hang. Don't lose data. Report what you have.
How does the user know what happened? Clear messages, distinct exit codes, structured output.
What data survives? Real-time events (lag, I/O) survive because they were already emitted. Batch results (CPU profile) may be lost. Design accordingly.

The fix in v1.7.0 is about 60 lines of code across three files. But it transforms the tool from one that fails at the worst possible moment to one that handles it gracefully.

npm install -g node-loop-detective@1.7.0

Source: github.com/iwtxokhtd83/node-loop-detective

DEV Community