The 3 AM Alert
It started with a PagerDuty alert at 3 AM. Our client's Node.js API server had been restarted by the process manager — again. This was the third OOM (Out of Memory) kill this week, and the team was running out of patience.
The application was an Express.js API handling about 2,000 requests per minute. Nothing crazy. But every 6-8 hours, memory usage would climb past the 512MB container limit, and Docker would unceremoniously kill the process.
Here's exactly how we found the leak, fixed it, and what we learned along the way.
Step 1: Confirm It's Actually a Leak
First rule of memory debugging: make sure you're not just under-provisioned. A growing heap doesn't always mean a leak — it could be legitimate cache growth or increased traffic.
We checked the metrics:
# Container memory over 24 hours
docker stats --no-stream
The pattern was unmistakable: steady, linear growth with no plateau. A healthy application's memory should level off after warm-up. Ours never did.
// Quick and dirty heap size check we added to a health endpoint
app.get('/health/debug', (req, res) => {
const used = process.memoryUsage();
res.json({
heapUsed: `${Math.round(used.heapUsed / 1024 / 1024)} MB`,
heapTotal: `${Math.round(used.heapTotal / 1024 / 1024)} MB`,
external: `${Math.round(used.external / 1024 / 1024)} MB`,
rss: `${Math.round(used.rss / 1024 / 1024)} MB`,
});
});
Hitting this endpoint every 30 seconds confirmed: heap used was climbing ~5MB per hour with no signs of garbage collection bringing it back down.
Step 2: Take a Heap Snapshot
Node.js has built-in heap snapshot capabilities via the v8 module and Chrome DevTools. We added a snapshot endpoint (behind auth, obviously — snapshots are expensive):
const v8 = require('v8');
const fs = require('fs');
app.get('/admin/heap-snapshot', async (req, res) => {
const snapshot = v8.writeHeapSnapshot();
const filename = snapshot;
res.download(filename);
});
We took two snapshots 30 minutes apart, downloaded them, and loaded them into Chrome DevTools (Memory tab → Load).
Step 3: Compare Snapshots in Chrome DevTools
This is where the real debugging happens. In DevTools:
- Load the first snapshot
- Load the second snapshot
- Switch to "Comparison" view
- Sort by "Delta" (new objects) or "Size Delta"
The comparison view shows exactly what objects were allocated between the two snapshots and — crucially — what was NOT garbage collected.
Immediately, one thing jumped out: (closure) objects were dominating the delta. Thousands of closures, each holding references to request-scoped data.
Step 4: Trace the Closures
We drilled into the closure objects and followed the retaining paths. The DevTools "Retainers" view shows what's keeping each object alive. For our closures, the chain looked like:
closure → context → array → EventEmitter._events → ... → global
The EventEmitter._events was the smoking gun. Something was attaching event listeners that never got removed.
We searched the codebase for .on( and .addListener( calls. Found the culprit in our request logging middleware:
// THE BUG: middleware that attached a listener but never cleaned up
app.use((req, res, next) => {
const requestId = uuid.v4();
// This listener is attached per-request but never removed
process.on('uncaughtException', (err) => {
logger.error({ requestId, error: err.message });
});
res.on('finish', () => {
logger.info({ requestId, status: res.statusCode });
});
next();
});
Every single request added a new uncaughtException listener to the global process object. Those listeners were closures capturing requestId and logger. After 100,000 requests, that's 100,000 listeners — each holding a closure with references that the GC couldn't touch because process is a global that never goes out of scope.
Step 5: The Fix
The fix was straightforward once we understood the problem:
// FIXED: use once() or manage listener lifecycle
app.use((req, res, next) => {
const requestId = uuid.v4();
// Option A: Use process.once() if you genuinely need per-request handling
// (but you probably don't — uncaughtException handlers should be global)
// Option B: Move the handler to app startup, pass context differently
// This is what we actually did:
res.on('finish', () => {
logger.info({ requestId, status: res.statusCode });
});
next();
});
// Single global handler at startup — no per-request listeners
process.on('uncaughtException', (err) => {
logger.error({ error: err.message, stack: err.stack });
});
We also added a safeguard: a periodic check that warns if the listener count grows unexpectedly:
setInterval(() => {
const listenerCount = process.listenerCount('uncaughtException');
if (listenerCount > 5) {
logger.warn({ listenerCount }, 'Possible listener leak detected');
}
}, 60000);
Step 6: Verify the Fix
After deploying, we watched the heap for 48 hours:
Before fix: ~80MB → ~480MB over 8 hours (OOM kill)
After fix: ~80MB → ~120MB over 48 hours (stable, GC working)
The heap stabilized around 120MB with normal GC sawtooth patterns. No more 3 AM alerts.
What We Learned
Heap snapshots are your best friend. The comparison view in Chrome DevTools makes leaks obvious. Don't guess — take snapshots.
Event listeners on global objects are dangerous.
process,global, and long-livedEventEmitterinstances will keep your closures alive forever. Always clean up withremoveListeneror useonce().Per-request resource allocation needs per-request cleanup. If you create something during a request, make sure it dies with the request. The
res.on('finish')pattern is fine —resgets garbage collected.processdoes not.Add listener count monitoring. A simple
process.listenerCount()check in your health endpoint can catch leaks before they take down production.
At Paradane, we've debugged memory leaks across dozens of client applications — Node.js, Python, PHP, you name it. The pattern is almost always the same: something global holding references to things that should be ephemeral. If you're dealing with a production memory leak and need a second pair of eyes, our custom software development team has seen it all.
Tools We Use for Memory Profiling
- Chrome DevTools (for Node.js heap snapshots) — free, built into Node
- clinic.js — excellent for spotting event loop delays and memory issues
-
pm2 with
--max-memory-restart— buys you time while you debug - Prometheus + Grafana — for long-term memory trend visualization
Have you debugged a production memory leak? What tools worked for you? Drop your war stories in the comments.
Top comments (0)