Vijayendra Hunasgi

Posted on Oct 6

OOMKilled Pods: A guide to troubleshooting.

#kubernetes #containers #devops #sre

By [Vijayendra Hunasgi]

Abstract: When a service started dying with 5xx errors and pod restarts.

The root cause was an insidious memory leak.

This post walks through a common incident faced in real-world — how we can triage, stabilize, debugg and fix it — using kubectl, cgroup introspection, autoscaling, and code-level changes.

I also provide a reproducible demo (GitHub repo) so you can follow along.

1. The Incident & Timeline

Assuming observability tools flagged an HTTP 5xx surge and repeated pod restarts.

Within minutes, we would narrow down memory pressure and OOMKilled events.

Initial actions:

Run kubectl get events → to see “Killing” / “OOMKilled” warnings
Describ pods → confirms ExitCode 137
Check node conditions → look for MemoryPressure
Scan pods by memory usage → to see abnormal growth

next step is to stabilize the service and then dug deeper into the leak source.

2. Repro Demo: GitHub Memory-Leak Repo

To make this investigation reproducible, I created a demo repository:

👉 github.com/Vijayendrahunasgi/memory-leak-k8s

This includes:

memleak.py — a Python app that continuously allocates memory (simulating a leak)
deployment.yaml — Kubernetes Deployment manifest (with limits/requests)
hpa.yaml — HorizontalPodAutoscaler YAML
Instructions to run and observe OOM behavior

You can clone it and apply it in a test cluster to follow along.

3. Troubleshooting Commands Cheat Sheet

Following list gives you commands and what it does, feel free to use.

kubectl get events --field-selector reason=Killing,type=Warning  # Lists recent OOMKilled events quickly
kubectl describe pod <pod>                                       # Shows restart reasons, exit codes, and probe status
kubectl describe nodes                                           # Reveals node-level conditions like MemoryPressure
kubectl top pod --sort-by=memory                                 # Ranks pods by memory usage

# Shows configured vs real memory bounds (split for readability)
kubectl get pod <pod> \
  -o custom-columns="POD:.metadata.name,REQ:.spec.containers[].resources.requests.memory,LIM:.spec.containers[].resources.limits.memory"  # Requests vs limits

kubectl get pod <pod> -o jsonpath='{.status.qosClass}'           # Shows QoS class (Guaranteed / Burstable / BestEffort)

# cgroup v1 paths; adjust if using cgroup v2 on your nodes
kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes  # Reads cgroup-enforced memory limit
kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes  # Reads actual memory usage

kubectl scale deploy/memory-leak --replicas=4                    # Scale out to reduce per-pod pressure
kubectl set resources deploy/memory-leak --limits=memory=512Mi --requests=memory=256Mi  # Adjust memory headroom
kubectl rollout restart deploy/memory-leak                        # Restart pods to free leaked memory

kubectl get hpa                                                  # List HPAs in namespace
kubectl describe hpa memory-leak-hpa                             # See HPA events, target metrics
kubectl get hpa memory-leak-hpa -o yaml                          # Inspect HPA status / scaling conditions

4. Stabilization: Stop the Bleeding

Whenever we see issues like OOMKilled, it results in downtime and to avoid it we need to first stop the bleeding like first aid. Here's how we can stablized app by increasing replicas (so that we get some breather) then increase allocated memory for pods and perform restart if needed.

kubectl scale deploy/memory-leak --replicas=4
kubectl set resources deploy/memory-leak --limits=memory=512Mi --requests=memory=256Mi
kubectl rollout restart deploy/memory-leak

Then we need to check resource usage:

kubectl top pod --sort-by=memory
kubectl get events --field-selector reason=Killing,type=Warning

Sample output after stabilization:

NAME                          CPU(cores)   MEMORY(bytes)
memory-leak-5b4c7d5fb8-xyz12   0.15         320Mi
memory-leak-5b4c7d5fb8-abc34   0.12         290Mi
memory-leak-5b4c7d5fb8-foo56   0.17         310Mi
memory-leak-5b4c7d5fb8-bar78   0.14         300Mi

💡 Pro tip: Always set memory requests — if a pod lacks a request, memory-based HPA calculations break (usage / request = NaN or inf).

5. Autoscaling & HPA Diagnostics

Meanwhile before we reach out to engineering team (Developers) about issue, we need to buy time by enabling HPA (Horizontal Pod Autoscaler) that allocates new pods. This can be acheived as follows.

Applying following HPA manifest:

hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-leak-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: memory-leak
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

Then we need to verify if HPA is properly applied and working as expected, following is the way:

kubectl get hpa
kubectl describe hpa memory-leak-hpa
kubectl get hpa memory-leak-hpa -o yaml | sed -n '/status:/,$p'

Sample kubectl describe hpa output:

Name:                            memory-leak-hpa
Namespace:                       default
ScaleTargetRef:                  Deployment/memory-leak
Min replicas:                    2
Max replicas:                    10
Metrics:                          ( current / target )
  resource memory on pods        280Mi / 400Mi
  type: Utilization
  target: 70%

Post checking we can confirm:
✅ HPA was scaling correctly based on memory utilization.

✅ Desired replicas matched observed scaling events.

✅ AbleToScale and ScalingActive were true.

6. Reading Signals: Events, cgroups, QoS

After we stopped bleeding for time being we would like to dig more to identify the actual issue causing resource exhaustion. So we can do following to identify the pattern.

Logs:

Warning  Killing  <timestamp>  kubelet  Killing container with id docker://...: out of memory

From kubectl describe pod <pod>:

State:   Terminated
Reason:  OOMKilled
Exit Code: 137
Restart Count: 5

Check what is allocated memory for pod

kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# 536870912  (512Mi)

Check what is the usage currently

kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
# Rising usage approaching limit

QoS & Eviction Priority:
Evict burstable affected pods.

kubectl get pod <pod> -o jsonpath='{.status.qosClass}'
# Burstable

7. The Fix: Code + YAML

Code-level:
This section is for our replica code in github repo:

while True:
    arr.append(bytearray(1 * 1024 * 1024))  # allocate 1 MiB per iteration
    time.sleep(1)

✅ Fix: limit list size, evict old data, or use bounded cache.

deployment.yaml

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

8. Lessons Learned & Best Practices

✅ Detect early using kubectl get events, kubectl describe, and cgroup metrics.

✅ Always define requests and limits.

✅ Burstable pods are evicted first under pressure.

✅ HPA is a guardrail, not a cure.

✅ Instrument apps for memory observability.

✅ Use bounded caches and eviction policies.

✅ Validate fixes in staging and monitor 24–48 hrs.

✅ Cross-share insights across teams to strengthen runbooks.

9. Call to Action & Feedback

Thank you for reading this deep dive! 🙌

👉 Explore the demo repo: Memory-Leak Demo on GitHub

If this helped — or if you’ve had your own “OOM panic” — drop a comment below!

Let’s share our Kubernetes war stories and learn from each other.

DEV Community