Samson Tanimawo

Posted on Apr 24

Debugging Kubernetes OOMKilled: A Step-by-Step Guide

#kubernetes #debugging #sre #devops

The Dreaded OOMKilled

$ kubectl describe pod api-service-7f8d9c-abc12
...
State: Terminated
Reason: OOMKilled
Exit Code: 137

Your container got killed because it used more memory than its limit. Sounds simple. Debugging it is not.

Step 1: Confirm the OOM

# Check pod events
kubectl get events --field-selector involvedObject.name=api-service-7f8d9c-abc12

# Check container status
kubectl get pod api-service-7f8d9c-abc12 -o jsonpath='{.status.containerStatuses[0].lastState}'

# Check node-level OOM events
kubectl describe node <node-name> | grep -A5 'OOMKilling'

Important distinction:

Container OOM: Container exceeded its memory limit → K8s kills it
Node OOM: Node ran out of memory → kernel OOM killer picks a victim

Step 2: Understand Your Memory Usage

# Current memory usage vs limits
kubectl top pod api-service-7f8d9c-abc12

# Historical memory usage (Prometheus)
curl -s 'prometheus:9090/api/v1/query_range' \
--data-urlencode 'query=container_memory_working_set_bytes{pod="api-service-7f8d9c-abc12"}' \
--data-urlencode 'start=2024-03-15T00:00:00Z' \
--data-urlencode 'end=2024-03-15T12:00:00Z' \
--data-urlencode 'step=60s'

Key memory metrics:

container_memory_working_set_bytes: # What K8s uses for OOM decisions
container_memory_rss: # Resident Set Size (actual RAM)
container_memory_cache: # File system cache (reclaimable)
container_memory_usage_bytes: # Total (includes cache misleading!)

Always look at working_set_bytes, not usage_bytes. The latter includes cache that the kernel can reclaim.

Step 3: Find the Memory Leak

For Node.js:

// Add heap snapshot endpoint
const v8 = require('v8');
const fs = require('fs');

app.get('/debug/heap', (req, res) => {
const snapshotStream = v8.writeHeapSnapshot();
res.json({ snapshot: snapshotStream });
});

// Track memory over time
setInterval(() => {
const used = process.memoryUsage();
console.log(JSON.stringify({
rss: Math.round(used.rss / 1024 / 1024) + 'MB',
heapUsed: Math.round(used.heapUsed / 1024 / 1024) + 'MB',
heapTotal: Math.round(used.heapTotal / 1024 / 1024) + 'MB',
external: Math.round(used.external / 1024 / 1024) + 'MB'
}));
}, 30000); // Every 30 seconds

For Python:

import tracemalloc
import linecache

tracemalloc.start(25) # Keep 25 frames

@app.route('/debug/memory')
def memory_snapshot():
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')[:20]
return jsonify([
{
'file': str(stat.traceback),
'size_mb': round(stat.size / 1024 / 1024, 2),
'count': stat.count
}
for stat in top_stats
])

For Go:

import (
"net/http"
_ "net/http/pprof" // Enable profiling endpoints
)

// In main():
go func() {
http.ListenAndServe(":6060", nil)
}()

// Then: go tool pprof http://localhost:6060/debug/pprof/heap

Step 4: Common Causes and Fixes

1. Memory Limit Too Low

# Check actual usage pattern first
# If peak usage is 450MB and limit is 512MB, that's too tight
resources:
requests:
memory: 256Mi # Based on average usage
limits:
memory: 768Mi # 1.5x peak usage for headroom

2. Unbounded Caches

# Bad: Cache grows forever
cache = {}
def get_user(user_id):
if user_id not in cache:
cache[user_id] = db.query(user_id) # Never evicted!
return cache[user_id]

# Good: Bounded cache with LRU eviction
from functools import lru_cache

@lru_cache(maxsize=10000) # Bounded!
def get_user(user_id):
return db.query(user_id)

3. Connection Accumulation

# Bad: Connections never closed
def process_request():
conn = create_db_connection() # Opens connection
result = conn.query('SELECT...')
return result # Connection leaked!

# Good: Always close connections
def process_request():
with get_db_connection() as conn: # Auto-closes
return conn.query('SELECT...')

Step 5: Set Up Proactive Alerts

- alert: MemoryApproachingLimit
expr: |
container_memory_working_set_bytes
/ container_spec_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.pod }} using {{ $value | humanizePercentage }} of memory limit"

Catch it at 85% instead of discovering it at OOMKilled.

If you want AI-powered memory analysis that finds leaks before they crash your pods, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community