Paradane

Posted on Jun 15

Debugging a Production Memory Leak: A Step-by-Step Walkthrough

#javascript #node #debugging #performance

The 3 AM Alert

It started with a PagerDuty alert at 3 AM. Our client's Node.js API server had been restarted by the process manager — again. This was the third OOM (Out of Memory) kill this week, and the team was running out of patience.

The application was an Express.js API handling about 2,000 requests per minute. Nothing crazy. But every 6-8 hours, memory usage would climb past the 512MB container limit, and Docker would unceremoniously kill the process.

Here's exactly how we found the leak, fixed it, and what we learned along the way.

Step 1: Confirm It's Actually a Leak

First rule of memory debugging: make sure you're not just under-provisioned. A growing heap doesn't always mean a leak — it could be legitimate cache growth or increased traffic.

We checked the metrics:

# Container memory over 24 hours
docker stats --no-stream

The pattern was unmistakable: steady, linear growth with no plateau. A healthy application's memory should level off after warm-up. Ours never did.

// Quick and dirty heap size check we added to a health endpoint
app.get('/health/debug', (req, res) => {
  const used = process.memoryUsage();
  res.json({
    heapUsed: `${Math.round(used.heapUsed / 1024 / 1024)} MB`,
    heapTotal: `${Math.round(used.heapTotal / 1024 / 1024)} MB`,
    external: `${Math.round(used.external / 1024 / 1024)} MB`,
    rss: `${Math.round(used.rss / 1024 / 1024)} MB`,
  });
});

Hitting this endpoint every 30 seconds confirmed: heap used was climbing ~5MB per hour with no signs of garbage collection bringing it back down.

Step 2: Take a Heap Snapshot

Node.js has built-in heap snapshot capabilities via the v8 module and Chrome DevTools. We added a snapshot endpoint (behind auth, obviously — snapshots are expensive):

const v8 = require('v8');
const fs = require('fs');

app.get('/admin/heap-snapshot', async (req, res) => {
  const snapshot = v8.writeHeapSnapshot();
  const filename = snapshot;
  res.download(filename);
});

We took two snapshots 30 minutes apart, downloaded them, and loaded them into Chrome DevTools (Memory tab → Load).

Step 3: Compare Snapshots in Chrome DevTools

This is where the real debugging happens. In DevTools:

Load the first snapshot
Load the second snapshot
Switch to "Comparison" view
Sort by "Delta" (new objects) or "Size Delta"

The comparison view shows exactly what objects were allocated between the two snapshots and — crucially — what was NOT garbage collected.

Immediately, one thing jumped out: (closure) objects were dominating the delta. Thousands of closures, each holding references to request-scoped data.

Step 4: Trace the Closures

We drilled into the closure objects and followed the retaining paths. The DevTools "Retainers" view shows what's keeping each object alive. For our closures, the chain looked like:

closure → context → array → EventEmitter._events → ... → global

The EventEmitter._events was the smoking gun. Something was attaching event listeners that never got removed.

We searched the codebase for .on( and .addListener( calls. Found the culprit in our request logging middleware:

// THE BUG: middleware that attached a listener but never cleaned up
app.use((req, res, next) => {
  const requestId = uuid.v4();

  // This listener is attached per-request but never removed
  process.on('uncaughtException', (err) => {
    logger.error({ requestId, error: err.message });
  });

  res.on('finish', () => {
    logger.info({ requestId, status: res.statusCode });
  });

  next();
});

Every single request added a new uncaughtException listener to the global process object. Those listeners were closures capturing requestId and logger. After 100,000 requests, that's 100,000 listeners — each holding a closure with references that the GC couldn't touch because process is a global that never goes out of scope.

Step 5: The Fix

The fix was straightforward once we understood the problem:

// FIXED: use once() or manage listener lifecycle
app.use((req, res, next) => {
  const requestId = uuid.v4();

  // Option A: Use process.once() if you genuinely need per-request handling
  // (but you probably don't — uncaughtException handlers should be global)

  // Option B: Move the handler to app startup, pass context differently
  // This is what we actually did:

  res.on('finish', () => {
    logger.info({ requestId, status: res.statusCode });
  });

  next();
});

// Single global handler at startup — no per-request listeners
process.on('uncaughtException', (err) => {
  logger.error({ error: err.message, stack: err.stack });
});

We also added a safeguard: a periodic check that warns if the listener count grows unexpectedly:

setInterval(() => {
  const listenerCount = process.listenerCount('uncaughtException');
  if (listenerCount > 5) {
    logger.warn({ listenerCount }, 'Possible listener leak detected');
  }
}, 60000);

Step 6: Verify the Fix

After deploying, we watched the heap for 48 hours:

Before fix:  ~80MB → ~480MB over 8 hours (OOM kill)
After fix:   ~80MB → ~120MB over 48 hours (stable, GC working)

The heap stabilized around 120MB with normal GC sawtooth patterns. No more 3 AM alerts.

What We Learned

Heap snapshots are your best friend. The comparison view in Chrome DevTools makes leaks obvious. Don't guess — take snapshots.
Event listeners on global objects are dangerous. process, global, and long-lived EventEmitter instances will keep your closures alive forever. Always clean up with removeListener or use once().
Per-request resource allocation needs per-request cleanup. If you create something during a request, make sure it dies with the request. The res.on('finish') pattern is fine — res gets garbage collected. process does not.
Add listener count monitoring. A simple process.listenerCount() check in your health endpoint can catch leaks before they take down production.

At Paradane, we've debugged memory leaks across dozens of client applications — Node.js, Python, PHP, you name it. The pattern is almost always the same: something global holding references to things that should be ephemeral. If you're dealing with a production memory leak and need a second pair of eyes, our custom software development team has seen it all.

Tools We Use for Memory Profiling

Chrome DevTools (for Node.js heap snapshots) — free, built into Node
clinic.js — excellent for spotting event loop delays and memory issues
pm2 with --max-memory-restart — buys you time while you debug
Prometheus + Grafana — for long-term memory trend visualization

Have you debugged a production memory leak? What tools worked for you? Drop your war stories in the comments.

DEV Community