The Dreaded OOMKilled
$ kubectl describe pod api-service-7f8d9c-abc12
...
State: Terminated
Reason: OOMKilled
Exit Code: 137
Your container got killed because it used more memory than its limit. Sounds simple. Debugging it is not.
Step 1: Confirm the OOM
# Check pod events
kubectl get events --field-selector involvedObject.name=api-service-7f8d9c-abc12
# Check container status
kubectl get pod api-service-7f8d9c-abc12 -o jsonpath='{.status.containerStatuses[0].lastState}'
# Check node-level OOM events
kubectl describe node <node-name> | grep -A5 'OOMKilling'
Important distinction:
- Container OOM: Container exceeded its memory limit → K8s kills it
- Node OOM: Node ran out of memory → kernel OOM killer picks a victim
Step 2: Understand Your Memory Usage
# Current memory usage vs limits
kubectl top pod api-service-7f8d9c-abc12
# Historical memory usage (Prometheus)
curl -s 'prometheus:9090/api/v1/query_range' \
--data-urlencode 'query=container_memory_working_set_bytes{pod="api-service-7f8d9c-abc12"}' \
--data-urlencode 'start=2024-03-15T00:00:00Z' \
--data-urlencode 'end=2024-03-15T12:00:00Z' \
--data-urlencode 'step=60s'
Key memory metrics:
container_memory_working_set_bytes: # What K8s uses for OOM decisions
container_memory_rss: # Resident Set Size (actual RAM)
container_memory_cache: # File system cache (reclaimable)
container_memory_usage_bytes: # Total (includes cache misleading!)
Always look at working_set_bytes, not usage_bytes. The latter includes cache that the kernel can reclaim.
Step 3: Find the Memory Leak
For Node.js:
// Add heap snapshot endpoint
const v8 = require('v8');
const fs = require('fs');
app.get('/debug/heap', (req, res) => {
const snapshotStream = v8.writeHeapSnapshot();
res.json({ snapshot: snapshotStream });
});
// Track memory over time
setInterval(() => {
const used = process.memoryUsage();
console.log(JSON.stringify({
rss: Math.round(used.rss / 1024 / 1024) + 'MB',
heapUsed: Math.round(used.heapUsed / 1024 / 1024) + 'MB',
heapTotal: Math.round(used.heapTotal / 1024 / 1024) + 'MB',
external: Math.round(used.external / 1024 / 1024) + 'MB'
}));
}, 30000); // Every 30 seconds
For Python:
import tracemalloc
import linecache
tracemalloc.start(25) # Keep 25 frames
@app.route('/debug/memory')
def memory_snapshot():
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')[:20]
return jsonify([
{
'file': str(stat.traceback),
'size_mb': round(stat.size / 1024 / 1024, 2),
'count': stat.count
}
for stat in top_stats
])
For Go:
import (
"net/http"
_ "net/http/pprof" // Enable profiling endpoints
)
// In main():
go func() {
http.ListenAndServe(":6060", nil)
}()
// Then: go tool pprof http://localhost:6060/debug/pprof/heap
Step 4: Common Causes and Fixes
1. Memory Limit Too Low
# Check actual usage pattern first
# If peak usage is 450MB and limit is 512MB, that's too tight
resources:
requests:
memory: 256Mi # Based on average usage
limits:
memory: 768Mi # 1.5x peak usage for headroom
2. Unbounded Caches
# Bad: Cache grows forever
cache = {}
def get_user(user_id):
if user_id not in cache:
cache[user_id] = db.query(user_id) # Never evicted!
return cache[user_id]
# Good: Bounded cache with LRU eviction
from functools import lru_cache
@lru_cache(maxsize=10000) # Bounded!
def get_user(user_id):
return db.query(user_id)
3. Connection Accumulation
# Bad: Connections never closed
def process_request():
conn = create_db_connection() # Opens connection
result = conn.query('SELECT...')
return result # Connection leaked!
# Good: Always close connections
def process_request():
with get_db_connection() as conn: # Auto-closes
return conn.query('SELECT...')
Step 5: Set Up Proactive Alerts
- alert: MemoryApproachingLimit
expr: |
container_memory_working_set_bytes
/ container_spec_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.pod }} using {{ $value | humanizePercentage }} of memory limit"
Catch it at 85% instead of discovering it at OOMKilled.
If you want AI-powered memory analysis that finds leaks before they crash your pods, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)