DEV Community

Cover image for Memory Leak Detection in Long-Running Services
Samson Tanimawo
Samson Tanimawo

Posted on

Memory Leak Detection in Long-Running Services

The Slowest Incident to Diagnose

Memory leaks are sneaky. The service runs fine for hours. Then, slowly, it gets worse. Slower responses, more GC pauses, eventual OOM kills.

And when you look at the first 30 minutes of metrics, everything looks normal.

The Three Flavors of Memory Growth

1. True leaks

  • Objects allocated but never freed
  • Classic in C/C++, rare in Go/Java with GC
  • Grows linearly forever until OOM

2. Unbounded caches

  • Cache adds entries but never evicts
  • Common in Node.js, Python, Go
  • Grows until memory pressure triggers other issues

3. Memory fragmentation

  • Heap is large but not usable
  • Happens in long-running Java, Go,.NET services
  • Not really a "leak" but behaves like one

All three cause the same symptom: memory grows over time. Treatment is different for each.

Detection Without Heap Dumps

Before you reach for pprof or heap dumps, the fastest diagnosis is graph-watching:

# Is memory growing linearly over the last 24 hours?
deriv(container_memory_working_set_bytes{service="api"}[24h]) > 0

# Is GC pause time increasing?
rate(jvm_gc_pause_seconds_sum[1h]) > 0.05
Enter fullscreen mode Exit fullscreen mode

If memory is growing by ~500MB/day and GC pauses are increasing, you have a leak. Diagnosis complete.

The question is where.

Go Memory Profiling

Go makes this relatively easy:

import _ "net/http/pprof"

// In main():
go func() {
http.ListenAndServe(":6060", nil)
}()
Enter fullscreen mode Exit fullscreen mode

Then:

# Get a heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# In the pprof shell:
(pprof) top
(pprof) list suspiciousFunction
(pprof) web # generates a SVG callgraph
Enter fullscreen mode Exit fullscreen mode

Look for:

  • Objects with high inuse_space
  • Objects with growing counts over time
  • Unexpected large maps or slices

Key trick: take two heap profiles 1 hour apart and diff them:

go tool pprof -base heap1.pprof heap2.pprof
Enter fullscreen mode Exit fullscreen mode

What shows up as "new" allocations in the diff is almost certainly your leak.

Java Memory Profiling

Java is harder because the JVM adds layers:

# Dump the heap
jmap -dump:format=b,file=heap.hprof <pid>

# Analyze with Eclipse MAT or JVisualVM
Enter fullscreen mode Exit fullscreen mode

In MAT, look for:

  • Leak Suspects report (automatic)
  • Dominator tree (what's holding the most memory)
  • GC roots path (what's preventing garbage collection)

Common Java culprits:

  • Static collections (especially static Map)
  • ThreadLocal values without cleanup
  • Listeners/callbacks registered but never unregistered
  • finalize() methods delaying collection

Node.js Memory Profiling

// Enable the inspector
node --inspect app.js

// Then in Chrome DevTools → Memory → Heap Snapshot
// Take 3 snapshots: baseline, after 10 min, after 20 min
// Compare to find retained objects
Enter fullscreen mode Exit fullscreen mode

Common Node culprits:

  • Event emitter listeners that accumulate
  • Closures holding references to large objects
  • Unbounded caches (remember, Node has no built-in LRU)
  • Stream buffers not being drained

Python Memory Profiling

import tracemalloc
tracemalloc.start()

#... run the leaky operation...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
Enter fullscreen mode Exit fullscreen mode

Or use memory_profiler:

from memory_profiler import profile

@profile
def suspect_function():
# code here
Enter fullscreen mode Exit fullscreen mode

Common Python culprits:

  • Global lists/dicts growing unbounded
  • Reference cycles with __del__ methods
  • C extensions leaking (hardest to find)
  • Pandas DataFrames kept around too long

The Cache Leak Special Case

The most common "leak" isn't a leak at all. It's a cache without eviction.

# BAD: unbounded
cache = {}
def get_user(user_id):
if user_id not in cache:
cache[user_id] = fetch_from_db(user_id)
return cache[user_id]

# GOOD: bounded LRU
from functools import lru_cache

@lru_cache(maxsize=10000)
def get_user(user_id):
return fetch_from_db(user_id)
Enter fullscreen mode Exit fullscreen mode

Always bound your caches. Always.

Fragmentation in Go

Go's garbage collector can leave the heap fragmented. You see:

  • Runtime memory is high
  • Heap profile shows low allocations
  • runtime.GC() doesn't reduce usage much

Solution: tune GOGC or force memory release:

import "runtime/debug"
debug.SetGCPercent(20) // More aggressive GC
debug.FreeOSMemory() // Return memory to OS
Enter fullscreen mode Exit fullscreen mode

The Long-Running Service Pattern

Services that run for weeks without restart accumulate cruft. Even without leaks.

We use this pattern:

deployment_policy:
max_uptime: 7d
restart_schedule: "rolling restart every 7 days"
Enter fullscreen mode Exit fullscreen mode

Every pod gets restarted weekly during a quiet window. Eliminates slow memory growth as a class of problem.

This isn't defeat. It's acknowledging that long-running processes in any language eventually accumulate state you don't want.

The Diagnostic Checklist

When a service is suspected of leaking:

  1. Is memory growing linearly or logarithmically? (linear = real leak)
  2. Is GC frequency/duration increasing? (yes = real pressure)
  3. Are request rates growing proportionally? (yes = normal growth, not leak)
  4. Take heap profile, save baseline
  5. Wait 1 hour, take second profile, diff
  6. Look for unexpected high-count objects
  7. Trace back to allocation site
  8. Fix the leak, deploy, watch metrics for 24h

Rinse and repeat. Memory leaks are annoying but systematically fixable.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)