Bill Tu

Posted on Apr 7

Beyond the Event Loop: Tracking Slow I/O in Production Node.js

#npm #node #javascript

Your Node.js service is slow. You run a profiler. CPU usage is low. The event loop isn't blocked. Everything looks healthy.

But users are still waiting 3 seconds for a response.

This is the blind spot that most Node.js diagnostic tools have. They focus on what's blocking the event loop — CPU-bound work, synchronous I/O, regex backtracking. But in modern Node.js applications, the most common performance killer isn't CPU at all. It's waiting. Waiting for a database query. Waiting for an upstream API. Waiting for DNS resolution.

The event loop is free. It's just got nothing to do until the network comes back.

The Async I/O Blind Spot

Node.js is designed around non-blocking I/O. When your code does await fetch('https://api.example.com/users'), the event loop doesn't block. It moves on to handle other requests. When the response arrives, the callback fires.

This is great for concurrency. But it creates a diagnostic problem: traditional profiling tools see an idle event loop and report "everything is fine." Meanwhile, your P99 latency is through the roof because that upstream API is taking 2 seconds to respond.

CPU profilers can't see this. Event loop lag detectors can't see this. The time is spent waiting for the kernel to deliver network data, not executing JavaScript.

What If We Could See the Waiting?

This is the problem we set out to solve in node-loop-detective v1.1.0. The tool already detected event loop blocking via CPU profiling. But we kept hearing the same feedback: "The report says healthy, but my app is slow."

The answer was almost always slow I/O. A database connection taking 1.5 seconds to establish. An internal API responding in 800ms instead of 50ms. DNS resolution hanging for 2 seconds in a container with misconfigured resolv.conf.

So we added async I/O tracking. Here's how it works.

The Approach: Runtime Monkey-Patching via CDP

node-loop-detective already connects to a running Node.js process via the V8 Inspector Protocol (Chrome DevTools Protocol). It uses Runtime.evaluate to inject diagnostic code into the target process. We extended this mechanism to patch the core I/O modules.

The injected code monkey-patches three layers of the Node.js networking stack:

Layer 1: HTTP/HTTPS Requests

We wrap http.request and https.request — the foundation of virtually all outgoing HTTP traffic in Node.js. Whether your code uses axios, got, node-fetch, or the built-in http module, it all flows through these functions.

const origRequest = http.request;

http.request = function patchedRequest(...args) {
  const startTime = Date.now();
  const opts = typeof args[0] === 'string' ? { href: args[0] } : args[0];
  const target = opts.href || opts.hostname || opts.host;
  const method = (opts.method || 'GET').toUpperCase();

  // Capture the caller's stack trace
  const callerStack = new Error().stack.split('\n').slice(2, 6);

  const req = origRequest.apply(this, args);

  req.on('response', (res) => {
    const duration = Date.now() - startTime;
    if (duration >= threshold) {
      record({ type: 'http', method, target, statusCode: res.statusCode, duration, stack: callerStack });
    }
  });

  return req;
};

For each request, we capture:

HTTP method and target URL
Response status code (or error message if it failed)
Total duration from request initiation to first response byte
The caller's stack trace — so you know which code made the call

Layer 2: DNS Lookups

DNS resolution is the silent latency tax. Every new hostname your app connects to requires a DNS lookup. In most environments this takes 1-5ms. But in misconfigured containers, overloaded DNS servers, or when hitting rate limits, it can take seconds.

We patch dns.lookup, which is the function Node.js uses internally for all hostname resolution (including inside http.request):

const origLookup = dns.lookup;

dns.lookup = function patchedLookup(hostname, options, callback) {
  const startTime = Date.now();
  const callerStack = new Error().stack.split('\n').slice(2, 6);

  return origLookup.call(dns, hostname, options, function(err, address, family) {
    const duration = Date.now() - startTime;
    if (duration >= threshold) {
      record({ type: 'dns', target: hostname, duration, error: err?.message, stack: callerStack });
    }
    callback(err, address, family);
  });
};

Layer 3: TCP Connections

This is the broadest net we cast. Almost every database driver — MySQL, PostgreSQL, MongoDB, Redis — communicates over TCP sockets. By patching net.Socket.prototype.connect, we catch slow connections at the transport layer regardless of which database client library is being used.

const origConnect = net.Socket.prototype.connect;

net.Socket.prototype.connect = function patchedConnect(...args) {
  const startTime = Date.now();
  const opts = typeof args[0] === 'object' ? args[0] : { port: args[0], host: args[1] };
  const target = `${opts.host || '127.0.0.1'}:${opts.port}`;
  const callerStack = new Error().stack.split('\n').slice(2, 6);

  this.once('connect', () => {
    const duration = Date.now() - startTime;
    if (duration >= threshold) {
      record({ type: 'tcp', target, duration, stack: callerStack });
    }
  });

  return origConnect.apply(this, args);
};

This catches:

Slow MySQL/PostgreSQL connections (host:3306, host:5432)
Slow Redis connections (host:6379)
Slow MongoDB connections (host:27017)
Any other TCP-based service

What the Output Looks Like

During profiling, slow I/O events are printed in real time:

🌐 Slow HTTP: 2340ms GET api.example.com/users → 200
    at processRequest (/app/handlers.js:45:12)
    at handleRoute (/app/router.js:23:8)

🔌 Slow TCP: 1520ms db-server:3306
    at createConnection (/app/db.js:12:5)

🔍 Slow DNS: 890ms lookup internal-service.local
    at Agent.createConnection (/app/node_modules/http/agent.js:321:5)

At the end of the profiling session, the report includes an aggregated summary:

⚠ Slow Async I/O Summary
  Total slow ops: 8

  🌐 HTTP — 5 slow ops, avg 1800ms, max 2340ms
    GET api.example.com/users
      3 calls, total 5400ms, avg 1800ms, max 2340ms
      at processRequest (/app/handlers.js:45:12)
    POST api.example.com/orders
      2 calls, total 2400ms, avg 1200ms, max 1500ms
      at createOrder (/app/orders.js:78:5)

  🔌 TCP — 2 slow ops, avg 1200ms, max 1520ms
    db-server:3306
      2 calls, total 2400ms, avg 1200ms, max 1520ms
      at createConnection (/app/db.js:12:5)

  🔍 DNS — 1 slow ops, avg 890ms, max 890ms
    internal-service.local
      1 calls, total 890ms, max 890ms

Events are grouped by type, then by target, sorted by total duration. Each group shows call count, total/average/max duration, error count, and the originating stack trace.

Design Decisions

Why monkey-patching instead of async_hooks?

Node.js has async_hooks, which can track the lifecycle of every async operation. We considered it, but there are practical problems:

async_hooks has significant performance overhead when tracking all operations
It gives you async operation IDs but not the semantic meaning (you know "async operation #4521 took 2 seconds" but not "it was an HTTP GET to api.example.com")
It doesn't capture the target URL, hostname, or port without additional patching anyway

Monkey-patching the specific modules gives us rich, semantic data (HTTP method, URL, status code, hostname, port) with minimal overhead since we only record operations that exceed the threshold.

Why capture stack traces at call time?

When a slow HTTP request completes 2 seconds later, the original call stack is long gone. JavaScript is asynchronous — by the time the response arrives, the code that initiated the request has finished executing.

That's why we capture the stack trace at the moment the request is made, not when it completes. We store it alongside the start timestamp, and attach it to the event if the operation turns out to be slow.

This is a small overhead per I/O operation (capturing a stack trace costs ~0.1ms), but it's only stored in memory temporarily and only reported if the operation exceeds the threshold.

Why a threshold instead of tracking everything?

Recording every HTTP request and TCP connection would generate enormous amounts of data and add measurable overhead. The threshold (default 500ms, configurable via --io-threshold) ensures we only capture operations that are actually problematic.

In practice, if an HTTP request completes in 50ms, you don't care about it. You care about the one that took 2.3 seconds.

Buffer limits

The slow I/O event buffer is capped at 200 entries. If the target process is making thousands of slow requests, we keep the most recent 200. This prevents memory growth in the target process while still providing a representative sample.

Real-World Scenarios

The Connection Pool Exhaustion

A Node.js API was intermittently slow. CPU was fine. Event loop was fine. loop-detective's I/O tracking revealed:

🔌 TCP — 47 slow ops, avg 4200ms, max 10003ms
  db-server:5432
    47 calls, total 197400ms, avg 4200ms, max 10003ms

47 slow TCP connections to PostgreSQL in a 30-second window, some timing out at 10 seconds. The connection pool was exhausted, and new queries were waiting for a connection to become available. The fix: increase the pool size and add a query timeout.

The Forgotten Internal API

A microservice was calling an internal API on every request. The API was supposed to respond in 20ms. After a deployment of the upstream service, it started responding in 800ms. No alerts fired because 800ms wasn't considered an "error."

🌐 HTTP — 312 slow ops, avg 823ms, max 1205ms
  GET internal-auth.service:8080/verify
    312 calls, total 256776ms, avg 823ms, max 1205ms
    at verifyToken (/app/middleware/auth.js:15:22)

312 slow calls in 10 seconds. Every single request to the service was paying an 800ms tax for token verification. The fix: add response caching for the auth check.

The DNS Surprise

A containerized Node.js app had random latency spikes. The I/O tracker caught it:

🔍 DNS — 23 slow ops, avg 2100ms, max 5002ms
  payments-api.internal
    23 calls, total 48300ms, avg 2100ms, max 5002ms

DNS resolution for an internal hostname was taking 2-5 seconds. The container's /etc/resolv.conf was pointing to an overloaded DNS server. The fix: configure ndots:1 and use a local DNS cache.

Combining CPU and I/O Diagnostics

The real power comes from seeing both dimensions at once. When you run loop-detective against a slow process, you get:

CPU profile — which functions are consuming CPU time
Event loop lag — when and how long the event loop was blocked
Slow I/O — which network operations are causing latency

If the CPU profile shows a cpu-hog pattern, your problem is CPU-bound. If the event loop is healthy but slow I/O shows 2-second database connections, your problem is I/O-bound. Often it's a combination — a CPU-heavy JSON parse triggered by a large response from a slow API.

# See everything at once
loop-detective <pid> -d 30 --io-threshold 200

# Focus on I/O only (disable would need --no-io to exclude)
loop-detective <pid> -d 30 --io-threshold 100

Limitations and Future Work

This approach has known limitations:

fetch() is invisible. Node.js 18+ has a global fetch() backed by undici, which doesn't use http.request internally. We plan to add undici/fetch patching in a future release.
dns.promises.lookup is not patched. Only the callback-style dns.lookup is currently tracked.
Patches are not restored on cleanup. After loop-detective disconnects, the monkey-patches remain in the target process. They become effectively no-ops (the threshold filtering still runs, but the buffer is gone), but the original functions are not restored. This is tracked as issue #1.
Only measures time-to-first-byte for HTTP. The duration is from request initiation to the response event, not including body download time.

Try It

npm install -g node-loop-detective

# Profile with I/O tracking (enabled by default)
loop-detective <pid>

# Lower the I/O threshold for more sensitive detection
loop-detective <pid> --io-threshold 200

# Continuous monitoring
loop-detective <pid> --watch --io-threshold 500

# JSON output for integration with monitoring tools
loop-detective <pid> --json | jq '.slowIOEvents'

The source is on GitHub: github.com/iwtxokhtd83/node-loop-detective

Conclusion

Event loop blocking is only half the story. In production Node.js applications, slow async I/O — database queries, API calls, DNS lookups — is often the real bottleneck. But because these operations don't block the event loop, traditional profiling tools miss them entirely.

By monkey-patching the core networking modules at runtime via the V8 Inspector Protocol, we can track these operations with rich context: what was called, where it was called from, how long it took, and whether it failed. Combined with CPU profiling and event loop lag detection, this gives you a complete picture of why your Node.js application is slow.

The next time your profiler says "everything is fine" but your users disagree, check the I/O.

DEV Community