Bill Tu

Posted on Apr 10

Why One Second Wasn't Enough: Adding Retry Logic to a Diagnostic Tool

#node #javascript #npm

For the first seven releases of node-loop-detective, connecting to a target process looked like this:

async _findInspectorPort() {
  await this._sleep(1000); // wait 1 second
  return 9229;
}

Send SIGUSR1. Wait one second. Try to connect. If it fails, give up.

This worked on our development machines. It worked in CI. It worked on lightly loaded staging servers. Then users started running it in production — on machines with 95% CPU utilization, inside containers with throttled resources, on servers handling 50,000 requests per second.

One second wasn't enough.

What Happens After SIGUSR1

When you send SIGUSR1 to a Node.js process, it triggers the V8 Inspector to start. This involves:

The signal handler fires on the next event loop tick
Node.js initializes the inspector agent
A TCP server starts listening on port 9229
The /json/list HTTP endpoint becomes available
WebSocket connections are accepted

On an idle machine, this takes 10-50 milliseconds. On a machine where the event loop is already blocked (which is exactly when you're trying to diagnose it), step 1 might not happen for several seconds. The signal is queued, but the event loop has to process it.

This creates a paradox: the more you need the tool, the longer it takes to connect.

The Failure Mode

With a fixed 1-second wait, users on loaded systems saw:

✖ Error: Cannot connect to inspector at 127.0.0.1:9229.
  Is the Node.js inspector active? (connect ECONNREFUSED 127.0.0.1:9229)

The inspector was starting. It just wasn't ready yet. The user would run the command again, and it would work — because by then the inspector had finished initializing. But "run it again" is not an acceptable UX for a production diagnostic tool.

The Fix: Exponential Backoff

The solution is retry with exponential backoff. Instead of one attempt after a fixed delay, we make up to 5 attempts with increasing wait times:

Attempt	Delay Before	Cumulative Wait
1	500ms	500ms
2	1000ms	1.5s
3	2000ms	3.5s
4	4000ms	7.5s
5	4000ms	11.5s

The delay doubles each time, capped at 4 seconds. Total maximum wait before giving up: about 11.5 seconds.

async _connectWithRetry(host, port) {
  const maxRetries = this.config.inspectorPort ? 1 : 5;
  const baseDelay = 500;
  const maxDelay = 4000;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      this.inspector = new Inspector({ host, port });
      // ... set up event listeners
      await this.inspector.connect();
      return; // success
    } catch (err) {
      if (attempt === maxRetries) {
        throw new Error(
          `Failed to connect to inspector at ${host}:${port} ` +
          `after ${maxRetries} attempts. Last error: ${err.message}`
        );
      }
      // Clean up failed inspector
      if (this.inspector) {
        this.inspector.removeAllListeners();
        this.inspector = null;
      }
      const delay = Math.min(baseDelay * Math.pow(2, attempt - 1), maxDelay);
      this.emit('retry', { attempt, maxRetries, delay, error: err.message });
      await this._sleep(delay);
    }
  }
}

Design Decisions

Why not just wait longer?

We could have changed _sleep(1000) to _sleep(5000). Problem solved, right?

No. On a fast machine, you'd wait 5 seconds every time for something that takes 50ms. The whole point of a diagnostic tool is speed — you're in the middle of an incident, every second counts. Exponential backoff gives you the best of both worlds: fast connection on healthy systems, patient retry on loaded ones.

On a typical machine, the first attempt at 500ms succeeds. The user never sees a retry. On a loaded machine, the second or third attempt usually succeeds. The user sees one or two retry messages and then gets connected.

Why 5 attempts?

Five attempts with our backoff schedule gives a maximum wait of ~11.5 seconds. This is long enough to handle even severely loaded systems (where the event loop might be blocked for 5-10 seconds), but short enough that a genuine failure (wrong PID, process already exited) doesn't leave the user waiting forever.

Why skip retry for --port?

When the user specifies --port, they're connecting to an inspector that's already open. There's no SIGUSR1, no startup delay. If the connection fails, it's because the inspector isn't there — retrying won't help.

const maxRetries = this.config.inspectorPort ? 1 : 5;

One attempt for --port. Five attempts for PID-based connections.

Why emit a retry event?

The CLI shows retry progress:

  Connecting to inspector... attempt 1/5 (retry in 500ms)
  Connecting to inspector... attempt 2/5 (retry in 1000ms)
✔ Connected to Node.js process

Without this feedback, the user would see nothing for up to 11 seconds. In an incident, silence is anxiety. The retry messages tell the user "I'm working on it, the system is just slow."

For the programmatic API, the retry event lets integrators log, alert, or implement their own timeout logic:

detective.on('retry', (data) => {
  logger.warn('Inspector connection retry', {
    attempt: data.attempt,
    maxRetries: data.maxRetries,
    delay: data.delay,
    error: data.error,
  });
});

The Cleanup Problem

There's a subtle issue with retry: each failed attempt creates an Inspector instance with event listeners. If we don't clean up, we accumulate orphaned listeners:

// Each attempt creates a new Inspector
this.inspector = new Inspector({ host, port });
this.inspector.on('disconnected', () => { ... });

// If connect() fails, we need to clean up before retrying
if (this.inspector) {
  this.inspector.removeAllListeners();
  this.inspector = null;
}

Without removeAllListeners(), the disconnected handler from attempt 1 would still be attached when attempt 2 creates a new Inspector. If the target later exits, both handlers would fire, causing duplicate targetExit events.

What Users See Now

On a fast machine (most cases):

✔ Connected to Node.js process
  Profiling for 10s with 50ms lag threshold...

No retry messages. Connection happens on the first attempt at 500ms — faster than the old fixed 1-second wait.

On a loaded machine:

  Connecting to inspector... attempt 1/5 (retry in 500ms)
  Connecting to inspector... attempt 2/5 (retry in 1000ms)
✔ Connected to Node.js process
  Profiling for 10s with 50ms lag threshold...

Two retries, then success. Total wait: about 1.5 seconds.

On a very loaded machine or wrong PID:

  Connecting to inspector... attempt 1/5 (retry in 500ms)
  Connecting to inspector... attempt 2/5 (retry in 1000ms)
  Connecting to inspector... attempt 3/5 (retry in 2000ms)
  Connecting to inspector... attempt 4/5 (retry in 4000ms)

✖ Error: Failed to connect to inspector at 127.0.0.1:9229 after 5 attempts.
  Last error: Cannot connect to inspector — Is the Node.js inspector active?

Clear feedback at every step. The final error message includes the attempt count and the specific failure reason.

The Broader Pattern

Retry with exponential backoff is one of the most well-known patterns in distributed systems. But it's easy to forget when building CLI tools. We think of CLI tools as synchronous, immediate, local. "Run command, get result."

But a diagnostic tool that connects to another process over a network protocol (even localhost) is a distributed system. The target process is an independent actor with its own timing. The inspector startup is an asynchronous operation we don't control. Network connections can fail transiently.

The same principles apply:

Don't assume the first attempt will succeed. Especially under load.
Back off exponentially. Linear retry hammers a system that's already struggling.
Cap the backoff. Waiting 32 seconds between attempts is too long for an interactive tool.
Give feedback. The user needs to know the tool is working, not frozen.
Know when to stop. Five attempts is enough. If the inspector isn't up after 11 seconds, something else is wrong.

Try It

npm install -g node-loop-detective@1.8.0

loop-detective <pid>

On most machines, you won't notice the change — it just connects slightly faster than before (500ms vs 1000ms). On loaded machines, it now works where it previously failed.

Source: github.com/iwtxokhtd83/node-loop-detective