Samson Tanimawo

Posted on May 2

Memory Leak Detection in Long-Running Services

#debugging #memory #sre #performance

The Slowest Incident to Diagnose

Memory leaks are sneaky. The service runs fine for hours. Then, slowly, it gets worse. Slower responses, more GC pauses, eventual OOM kills.

And when you look at the first 30 minutes of metrics, everything looks normal.

The Three Flavors of Memory Growth

1. True leaks

Objects allocated but never freed
Classic in C/C++, rare in Go/Java with GC
Grows linearly forever until OOM

2. Unbounded caches

Cache adds entries but never evicts
Common in Node.js, Python, Go
Grows until memory pressure triggers other issues

3. Memory fragmentation

Heap is large but not usable
Happens in long-running Java, Go,.NET services
Not really a "leak" but behaves like one

All three cause the same symptom: memory grows over time. Treatment is different for each.

Detection Without Heap Dumps

Before you reach for pprof or heap dumps, the fastest diagnosis is graph-watching:

# Is memory growing linearly over the last 24 hours?
deriv(container_memory_working_set_bytes{service="api"}[24h]) > 0

# Is GC pause time increasing?
rate(jvm_gc_pause_seconds_sum[1h]) > 0.05

If memory is growing by ~500MB/day and GC pauses are increasing, you have a leak. Diagnosis complete.

The question is where.

Go Memory Profiling

Go makes this relatively easy:

import _ "net/http/pprof"

// In main():
go func() {
http.ListenAndServe(":6060", nil)
}()

Then:

# Get a heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# In the pprof shell:
(pprof) top
(pprof) list suspiciousFunction
(pprof) web # generates a SVG callgraph

Look for:

Objects with high inuse_space
Objects with growing counts over time
Unexpected large maps or slices

Key trick: take two heap profiles 1 hour apart and diff them:

go tool pprof -base heap1.pprof heap2.pprof

What shows up as "new" allocations in the diff is almost certainly your leak.

Java Memory Profiling

Java is harder because the JVM adds layers:

# Dump the heap
jmap -dump:format=b,file=heap.hprof <pid>

# Analyze with Eclipse MAT or JVisualVM

In MAT, look for:

Leak Suspects report (automatic)
Dominator tree (what's holding the most memory)
GC roots path (what's preventing garbage collection)

Common Java culprits:

Static collections (especially static Map)
ThreadLocal values without cleanup
Listeners/callbacks registered but never unregistered
finalize() methods delaying collection

Node.js Memory Profiling

// Enable the inspector
node --inspect app.js

// Then in Chrome DevTools → Memory → Heap Snapshot
// Take 3 snapshots: baseline, after 10 min, after 20 min
// Compare to find retained objects

Common Node culprits:

Event emitter listeners that accumulate
Closures holding references to large objects
Unbounded caches (remember, Node has no built-in LRU)
Stream buffers not being drained

Python Memory Profiling

import tracemalloc
tracemalloc.start()

#... run the leaky operation...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)

Or use memory_profiler:

from memory_profiler import profile

@profile
def suspect_function():
# code here

Common Python culprits:

Global lists/dicts growing unbounded
Reference cycles with __del__ methods
C extensions leaking (hardest to find)
Pandas DataFrames kept around too long

The Cache Leak Special Case

The most common "leak" isn't a leak at all. It's a cache without eviction.

# BAD: unbounded
cache = {}
def get_user(user_id):
if user_id not in cache:
cache[user_id] = fetch_from_db(user_id)
return cache[user_id]

# GOOD: bounded LRU
from functools import lru_cache

@lru_cache(maxsize=10000)
def get_user(user_id):
return fetch_from_db(user_id)

Always bound your caches. Always.

Fragmentation in Go

Go's garbage collector can leave the heap fragmented. You see:

Runtime memory is high
Heap profile shows low allocations
runtime.GC() doesn't reduce usage much

Solution: tune GOGC or force memory release:

import "runtime/debug"
debug.SetGCPercent(20) // More aggressive GC
debug.FreeOSMemory() // Return memory to OS

The Long-Running Service Pattern

Services that run for weeks without restart accumulate cruft. Even without leaks.

We use this pattern:

deployment_policy:
max_uptime: 7d
restart_schedule: "rolling restart every 7 days"

Every pod gets restarted weekly during a quiet window. Eliminates slow memory growth as a class of problem.

This isn't defeat. It's acknowledging that long-running processes in any language eventually accumulate state you don't want.

The Diagnostic Checklist

When a service is suspected of leaking:

Is memory growing linearly or logarithmically? (linear = real leak)
Is GC frequency/duration increasing? (yes = real pressure)
Are request rates growing proportionally? (yes = normal growth, not leak)
Take heap profile, save baseline
Wait 1 hour, take second profile, diff
Look for unexpected high-count objects
Trace back to allocation site
Fix the leak, deploy, watch metrics for 24h

Rinse and repeat. Memory leaks are annoying but systematically fixable.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community