The Slowest Incident to Diagnose
Memory leaks are sneaky. The service runs fine for hours. Then, slowly, it gets worse. Slower responses, more GC pauses, eventual OOM kills.
And when you look at the first 30 minutes of metrics, everything looks normal.
The Three Flavors of Memory Growth
1. True leaks
- Objects allocated but never freed
- Classic in C/C++, rare in Go/Java with GC
- Grows linearly forever until OOM
2. Unbounded caches
- Cache adds entries but never evicts
- Common in Node.js, Python, Go
- Grows until memory pressure triggers other issues
3. Memory fragmentation
- Heap is large but not usable
- Happens in long-running Java, Go,.NET services
- Not really a "leak" but behaves like one
All three cause the same symptom: memory grows over time. Treatment is different for each.
Detection Without Heap Dumps
Before you reach for pprof or heap dumps, the fastest diagnosis is graph-watching:
# Is memory growing linearly over the last 24 hours?
deriv(container_memory_working_set_bytes{service="api"}[24h]) > 0
# Is GC pause time increasing?
rate(jvm_gc_pause_seconds_sum[1h]) > 0.05
If memory is growing by ~500MB/day and GC pauses are increasing, you have a leak. Diagnosis complete.
The question is where.
Go Memory Profiling
Go makes this relatively easy:
import _ "net/http/pprof"
// In main():
go func() {
http.ListenAndServe(":6060", nil)
}()
Then:
# Get a heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
# In the pprof shell:
(pprof) top
(pprof) list suspiciousFunction
(pprof) web # generates a SVG callgraph
Look for:
- Objects with high
inuse_space - Objects with growing counts over time
- Unexpected large maps or slices
Key trick: take two heap profiles 1 hour apart and diff them:
go tool pprof -base heap1.pprof heap2.pprof
What shows up as "new" allocations in the diff is almost certainly your leak.
Java Memory Profiling
Java is harder because the JVM adds layers:
# Dump the heap
jmap -dump:format=b,file=heap.hprof <pid>
# Analyze with Eclipse MAT or JVisualVM
In MAT, look for:
- Leak Suspects report (automatic)
- Dominator tree (what's holding the most memory)
- GC roots path (what's preventing garbage collection)
Common Java culprits:
- Static collections (especially
static Map) - ThreadLocal values without cleanup
- Listeners/callbacks registered but never unregistered
-
finalize()methods delaying collection
Node.js Memory Profiling
// Enable the inspector
node --inspect app.js
// Then in Chrome DevTools → Memory → Heap Snapshot
// Take 3 snapshots: baseline, after 10 min, after 20 min
// Compare to find retained objects
Common Node culprits:
- Event emitter listeners that accumulate
- Closures holding references to large objects
- Unbounded caches (remember, Node has no built-in LRU)
- Stream buffers not being drained
Python Memory Profiling
import tracemalloc
tracemalloc.start()
#... run the leaky operation...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
Or use memory_profiler:
from memory_profiler import profile
@profile
def suspect_function():
# code here
Common Python culprits:
- Global lists/dicts growing unbounded
- Reference cycles with
__del__methods - C extensions leaking (hardest to find)
- Pandas DataFrames kept around too long
The Cache Leak Special Case
The most common "leak" isn't a leak at all. It's a cache without eviction.
# BAD: unbounded
cache = {}
def get_user(user_id):
if user_id not in cache:
cache[user_id] = fetch_from_db(user_id)
return cache[user_id]
# GOOD: bounded LRU
from functools import lru_cache
@lru_cache(maxsize=10000)
def get_user(user_id):
return fetch_from_db(user_id)
Always bound your caches. Always.
Fragmentation in Go
Go's garbage collector can leave the heap fragmented. You see:
- Runtime memory is high
- Heap profile shows low allocations
-
runtime.GC()doesn't reduce usage much
Solution: tune GOGC or force memory release:
import "runtime/debug"
debug.SetGCPercent(20) // More aggressive GC
debug.FreeOSMemory() // Return memory to OS
The Long-Running Service Pattern
Services that run for weeks without restart accumulate cruft. Even without leaks.
We use this pattern:
deployment_policy:
max_uptime: 7d
restart_schedule: "rolling restart every 7 days"
Every pod gets restarted weekly during a quiet window. Eliminates slow memory growth as a class of problem.
This isn't defeat. It's acknowledging that long-running processes in any language eventually accumulate state you don't want.
The Diagnostic Checklist
When a service is suspected of leaking:
- Is memory growing linearly or logarithmically? (linear = real leak)
- Is GC frequency/duration increasing? (yes = real pressure)
- Are request rates growing proportionally? (yes = normal growth, not leak)
- Take heap profile, save baseline
- Wait 1 hour, take second profile, diff
- Look for unexpected high-count objects
- Trace back to allocation site
- Fix the leak, deploy, watch metrics for 24h
Rinse and repeat. Memory leaks are annoying but systematically fixable.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)