Bill Tu

Posted on Apr 6

Detecting Event Loop Blocking in Production Node.js — Without Touching Your Code

#npm #node #javascript

You're on call. Alerts fire. Your Node.js service is responding slowly — or not at all. You suspect event loop blocking, but the app is running in production. You can't redeploy with --inspect. You can't add require('blocked-at') to the source. You definitely can't restart it.

What do you do?

This is the problem that led me to build node-loop-detective — a diagnostic tool that attaches to a running Node.js process, detects event loop blocking and lag, and tells you exactly which function in which file is causing it. Zero code changes. Zero restarts.

In this article, I'll walk through the problem, the approach, and the internals of how it works.

The Event Loop Problem

Node.js runs JavaScript on a single thread. The event loop is the scheduler — it picks up callbacks, timers, I/O completions, and runs them one at a time. When one of those callbacks takes too long, everything else waits.

This is "event loop blocking," and it's one of the most common performance killers in Node.js applications. Common causes include:

A JSON.parse() on a 50MB payload
A regex with catastrophic backtracking
A readFileSync() that someone left in a request handler
A tight loop doing CPU-intensive computation
Excessive garbage collection from rapid object allocation

The tricky part: these issues often only appear under production load. Your staging environment with 10 requests per second is fine. Production with 10,000 is not.

Existing Tools and Their Limitations

There are excellent tools for diagnosing event loop issues:

clinic.js — generates flame graphs, event loop delay charts, and more
0x — produces beautiful flame graphs from CPU profiles
blocked-at — detects blocking and reports the call stack

The catch? They all require you to start your application through them, or add code to your app. In production, that means a redeploy or restart. If the issue is intermittent, it might disappear by the time you restart.

What we need is a tool that can attach to an already-running process.

The Key Insight: SIGUSR1 + V8 Inspector

Node.js has a built-in escape hatch that most developers don't know about. If you send SIGUSR1 to a running Node.js process on Linux or macOS:

kill -SIGUSR1 <pid>

Node.js activates its V8 Inspector on port 9229. No restart. No code changes. The process keeps running normally, but now it's accepting Chrome DevTools Protocol (CDP) connections.

This is the foundation of node-loop-detective.

How node-loop-detective Works

The tool operates in five phases:

Phase 1: Activate the Inspector

// Send SIGUSR1 to activate the inspector
process.kill(targetPid, 'SIGUSR1');

After a brief pause, the target process opens an Inspector WebSocket endpoint. We discover the URL via the standard /json/list HTTP endpoint:

// GET http://127.0.0.1:9229/json/list
// Returns: [{ "webSocketDebuggerUrl": "ws://127.0.0.1:9229/..." }]

Phase 2: Connect via CDP

We establish a WebSocket connection and speak Chrome DevTools Protocol. This gives us access to the Runtime, Profiler, and other V8 debugging domains — the same ones Chrome DevTools uses.

const ws = new WebSocket(webSocketDebuggerUrl);

// Send CDP commands
ws.send(JSON.stringify({
  id: 1,
  method: 'Runtime.evaluate',
  params: { expression: '1 + 1', returnByValue: true }
}));

Phase 3: Inject a Lag Detector

Here's where it gets interesting. We use Runtime.evaluate to inject a tiny piece of code into the target process:

const timer = setInterval(() => {
  const now = Date.now();
  const lag = now - lastTime - interval;
  if (lag > threshold) {
    // Record the lag event with a stack trace
    lags.push({ lag, timestamp: now, stack: captureStack() });
  }
  lastTime = now;
}, interval);

timer.unref(); // Don't prevent process exit

The principle is simple: setInterval should fire every interval milliseconds. If the actual gap is much larger, the event loop was blocked during that time. The difference is the lag.

The critical addition: when lag is detected, we capture the current JavaScript call stack. This means each lag event comes with the exact code location that was executing when the event loop was blocked.

Design considerations:

timer.unref() ensures our timer doesn't keep the target process alive
The event buffer is capped at 100 entries to prevent memory leaks
Everything is cleaned up when we disconnect

Phase 4: CPU Profiling

Simultaneously, we use the V8 Profiler via CDP to capture a CPU profile:

await cdp.send('Profiler.enable');
await cdp.send('Profiler.setSamplingInterval', { interval: 100 });
await cdp.send('Profiler.start');

// Wait for the profiling duration...

const { profile } = await cdp.send('Profiler.stop');

The V8 CPU Profile is a statistical sampler. Every ~100 microseconds, it records which function is currently executing. Over thousands of samples, this builds an accurate picture of where CPU time is being spent.

The profile data structure contains:

nodes: a call tree where each node has a function name, file URL, and line number
samples: an array of node IDs — one per sample tick
timeDeltas: the time between each sample

Phase 5: Analysis

This is where raw data becomes actionable diagnostics. The analyzer processes the CPU profile to:

1. Calculate self time per function

For each sample, we attribute the time delta to the sampled function. This gives us "self time" — how much CPU time the function consumed directly (not including its children).

for (let i = 0; i < samples.length; i++) {
  const nodeId = samples[i];
  const delta = timeDeltas[i];
  timings.set(nodeId, (timings.get(nodeId) || 0) + delta);
}

2. Rank heavy functions

Sort by self time to find the biggest CPU consumers. Each entry includes the function name, file path, line number, and percentage of total CPU time.

3. Build call stacks

For the top blocking functions, we walk up the call tree to reconstruct the full call chain — from the entry point down to the blocking function. This answers "how did we get here?"

4. Pattern detection

The analyzer matches against six common blocking patterns:

Pattern	Detection Logic
`cpu-hog`	Single function consuming >50% of CPU
`json-heavy`	JSON.parse/stringify operations consuming >10%
`regex-heavy`	RegExp operations consuming >10%
`gc-pressure`	Garbage collection consuming >5%
`sync-io`	Functions from `node:fs` or names containing "Sync"
`crypto-heavy`	Crypto operations consuming >10%

Each pattern comes with a severity level, a human-readable description, the code location, and a suggested fix.

What the Output Looks Like

✔ Connected to Node.js process
  Profiling for 10s with 50ms lag threshold...

⚠ Event loop lag: 312ms at 2025-01-15T10:23:45.123Z
    → heavyComputation /app/server.js:42:1
    → handleRequest /app/routes.js:15:5

────────────────────────────────────────────────────────────
  Event Loop Detective Report
────────────────────────────────────────────────────────────
  Duration:  10023ms
  Samples:   4521
  Hot funcs: 12

  Diagnosis
────────────────────────────────────────────────────────────
   HIGH  cpu-hog
         Function "heavyComputation" consumed 62.3% of CPU time (6245ms)
         at /app/server.js:42
         → Consider breaking this into smaller async chunks
           or moving to a worker thread

  Top CPU-Heavy Functions
────────────────────────────────────────────────────────────
   1. heavyComputation
      ██████████████░░░░░░ 6245ms (62.3%)
      /app/server.js:42:1

  Call Stacks (top blockers)
────────────────────────────────────────────────────────────
  ▸ heavyComputation (6245ms)
    │ handleConnection /app/server.js:10
      │ processRequest /app/middleware.js:25
        → heavyComputation /app/server.js:42

  ⚠ Event Loop Lag Summary
    Events: 3
    Max:    312ms
    Avg:    198ms

    Lag by Code Location:
    heavyComputation /app/server.js:42
      3 events, total 594ms, max 312ms

✔ Disconnected cleanly

From this single output, you know:

The event loop was blocked 3 times during the 10-second sample
heavyComputation at /app/server.js:42 is the culprit
It consumed 62.3% of CPU time
The call path is handleConnection → processRequest → heavyComputation
The suggested fix is to break it into async chunks or use Worker Threads

Real-World Patterns I've Seen

The Hidden JSON.parse

A REST API was parsing request bodies manually instead of using a streaming parser. With small payloads, it was fine. When a client started sending 10MB JSON arrays, the event loop blocked for 800ms on every request.

loop-detective output:

 HIGH  json-heavy
       JSON operations took 823ms (41% of profile)
       → Consider streaming JSON parsing or processing smaller payloads

The Regex Time Bomb

A URL validation regex worked perfectly for years — until someone submitted a URL that triggered catastrophic backtracking. The regex engine spent 2 seconds on a single test() call.

 HIGH  cpu-hog
       Function "validateUrl" consumed 89% of CPU time
       at /app/validators.js:15

The Forgotten readFileSync

A configuration reload function used readFileSync. It was called once at startup — no problem. Then someone added a feature that reloaded config on every request.

 HIGH  sync-io
       Synchronous I/O detected: readFileSync (1200ms)
       → Replace synchronous file operations with async alternatives

Design Decisions

Why not use `perf` or eBPF?

They're powerful but platform-specific, require kernel support, and don't give you JavaScript-level function names and line numbers out of the box. The V8 Inspector approach works on any platform where Node.js runs and gives us rich JS-level diagnostics.

Why inject code instead of just profiling?

The CPU profile tells you which functions are heavy, but it doesn't directly tell you "the event loop was blocked for 312ms at this timestamp." The injected lag detector provides real-time, timestamped lag events with call stacks. Combined with the CPU profile, you get both the "what" and the "when."

Why not keep the Inspector open permanently?

The Inspector adds a small overhead and, more importantly, opens a debugging port. In production, you want to minimize the attack surface. loop-detective activates the Inspector, does its work, and disconnects — leaving the process in its original state.

Limitations

Windows: SIGUSR1 is not available. The target must be started with --inspect.
Sub-millisecond blocking: The V8 sampler may miss very brief blocking events.
I/O bottlenecks: If the slowness is caused by slow database queries or network I/O (not CPU), the event loop isn't technically blocked — it's just waiting. This tool focuses on CPU-bound blocking.
No flame graphs: The output is text-based. For visual flame graphs, use the --json output and feed it to a visualization tool.

Getting Started

npm install -g node-loop-detective

# Profile a running process for 10 seconds
loop-detective <pid>

# Longer profiling with lower threshold
loop-detective <pid> -d 30 -t 20

# Continuous monitoring
loop-detective <pid> --watch

# JSON output for automation
loop-detective <pid> --json

The source is on GitHub: github.com/iwtxokhtd83/node-loop-detective

Conclusion

Event loop blocking is one of those problems that's easy to understand but hard to diagnose in production. The traditional approach — restart with profiling tools attached — is disruptive and may not even reproduce the issue.

By leveraging Node.js's built-in Inspector protocol and SIGUSR1 signal, we can attach to a running process, profile it, and get actionable diagnostics without any disruption. The combination of real-time lag detection (with call stacks) and CPU profile analysis (with pattern matching) gives you both the symptoms and the root cause.

Next time your Node.js service is struggling and you need answers fast, try:

loop-detective $(pgrep -f "node app.js")

You might be surprised how quickly you find the culprit.

Top comments (1)

Bill Tu • Apr 6 • Edited

Added slow async I/O tracking (HTTP, DNS, TCP) with caller stack traces to detect I/O bottlenecks.