DEV Community

Cover image for OOMKilled Pods: A guide to troubleshooting.
Vijayendra Hunasgi
Vijayendra Hunasgi

Posted on

OOMKilled Pods: A guide to troubleshooting.

By [Vijayendra Hunasgi]

Abstract: When a service started dying with 5xx errors and pod restarts.

The root cause was an insidious memory leak.

This post walks through a common incident faced in real-world — how we can triage, stabilize, debugg and fix it — using kubectl, cgroup introspection, autoscaling, and code-level changes.

I also provide a reproducible demo (GitHub repo) so you can follow along.


🧭 Table of Contents


1. The Incident & Timeline

Assuming observability tools flagged an HTTP 5xx surge and repeated pod restarts.

Within minutes, we would narrow down memory pressure and OOMKilled events.

Initial actions:

  • Run kubectl get events → to see “Killing” / “OOMKilled” warnings
  • Describ pods → confirms ExitCode 137
  • Check node conditions → look for MemoryPressure
  • Scan pods by memory usage → to see abnormal growth

next step is to stabilize the service and then dug deeper into the leak source.


2. Repro Demo: GitHub Memory-Leak Repo

To make this investigation reproducible, I created a demo repository:

👉 github.com/Vijayendrahunasgi/memory-leak-k8s

This includes:

  • memleak.py — a Python app that continuously allocates memory (simulating a leak)
  • deployment.yaml — Kubernetes Deployment manifest (with limits/requests)
  • hpa.yaml — HorizontalPodAutoscaler YAML
  • Instructions to run and observe OOM behavior

You can clone it and apply it in a test cluster to follow along.


3. Troubleshooting Commands Cheat Sheet

Following list gives you commands and what it does, feel free to use.

kubectl get events --field-selector reason=Killing,type=Warning  # Lists recent OOMKilled events quickly
kubectl describe pod <pod>                                       # Shows restart reasons, exit codes, and probe status
kubectl describe nodes                                           # Reveals node-level conditions like MemoryPressure
kubectl top pod --sort-by=memory                                 # Ranks pods by memory usage

# Shows configured vs real memory bounds (split for readability)
kubectl get pod <pod> \
  -o custom-columns="POD:.metadata.name,REQ:.spec.containers[].resources.requests.memory,LIM:.spec.containers[].resources.limits.memory"  # Requests vs limits

kubectl get pod <pod> -o jsonpath='{.status.qosClass}'           # Shows QoS class (Guaranteed / Burstable / BestEffort)

# cgroup v1 paths; adjust if using cgroup v2 on your nodes
kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes  # Reads cgroup-enforced memory limit
kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes  # Reads actual memory usage

kubectl scale deploy/memory-leak --replicas=4                    # Scale out to reduce per-pod pressure
kubectl set resources deploy/memory-leak --limits=memory=512Mi --requests=memory=256Mi  # Adjust memory headroom
kubectl rollout restart deploy/memory-leak                        # Restart pods to free leaked memory

kubectl get hpa                                                  # List HPAs in namespace
kubectl describe hpa memory-leak-hpa                             # See HPA events, target metrics
kubectl get hpa memory-leak-hpa -o yaml                          # Inspect HPA status / scaling conditions
Enter fullscreen mode Exit fullscreen mode

4. Stabilization: Stop the Bleeding

Whenever we see issues like OOMKilled, it results in downtime and to avoid it we need to first stop the bleeding like first aid. Here's how we can stablized app by increasing replicas (so that we get some breather) then increase allocated memory for pods and perform restart if needed.

kubectl scale deploy/memory-leak --replicas=4
kubectl set resources deploy/memory-leak --limits=memory=512Mi --requests=memory=256Mi
kubectl rollout restart deploy/memory-leak
Enter fullscreen mode Exit fullscreen mode

Then we need to check resource usage:

kubectl top pod --sort-by=memory
kubectl get events --field-selector reason=Killing,type=Warning
Enter fullscreen mode Exit fullscreen mode

Sample output after stabilization:

NAME                          CPU(cores)   MEMORY(bytes)
memory-leak-5b4c7d5fb8-xyz12   0.15         320Mi
memory-leak-5b4c7d5fb8-abc34   0.12         290Mi
memory-leak-5b4c7d5fb8-foo56   0.17         310Mi
memory-leak-5b4c7d5fb8-bar78   0.14         300Mi
Enter fullscreen mode Exit fullscreen mode

💡 Pro tip: Always set memory requests — if a pod lacks a request, memory-based HPA calculations break (usage / request = NaN or inf).


5. Autoscaling & HPA Diagnostics

Meanwhile before we reach out to engineering team (Developers) about issue, we need to buy time by enabling HPA (Horizontal Pod Autoscaler) that allocates new pods. This can be acheived as follows.

Applying following HPA manifest:

hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-leak-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: memory-leak
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
Enter fullscreen mode Exit fullscreen mode

Then we need to verify if HPA is properly applied and working as expected, following is the way:

kubectl get hpa
kubectl describe hpa memory-leak-hpa
kubectl get hpa memory-leak-hpa -o yaml | sed -n '/status:/,$p'
Enter fullscreen mode Exit fullscreen mode

Sample kubectl describe hpa output:

Name:                            memory-leak-hpa
Namespace:                       default
ScaleTargetRef:                  Deployment/memory-leak
Min replicas:                    2
Max replicas:                    10
Metrics:                          ( current / target )
  resource memory on pods        280Mi / 400Mi
  type: Utilization
  target: 70%
Enter fullscreen mode Exit fullscreen mode

Post checking we can confirm:
✅ HPA was scaling correctly based on memory utilization.

✅ Desired replicas matched observed scaling events.

AbleToScale and ScalingActive were true.


6. Reading Signals: Events, cgroups, QoS

After we stopped bleeding for time being we would like to dig more to identify the actual issue causing resource exhaustion. So we can do following to identify the pattern.

Logs:

Warning  Killing  <timestamp>  kubelet  Killing container with id docker://...: out of memory
Enter fullscreen mode Exit fullscreen mode

From kubectl describe pod <pod>:

State:   Terminated
Reason:  OOMKilled
Exit Code: 137
Restart Count: 5
Enter fullscreen mode Exit fullscreen mode

Check what is allocated memory for pod

kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# 536870912  (512Mi)

Check what is the usage currently

kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
# Rising usage approaching limit
Enter fullscreen mode Exit fullscreen mode

QoS & Eviction Priority:
Evict burstable affected pods.

kubectl get pod <pod> -o jsonpath='{.status.qosClass}'
# Burstable
Enter fullscreen mode Exit fullscreen mode

7. The Fix: Code + YAML

Code-level:
This section is for our replica code in github repo:

while True:
    arr.append(bytearray(1 * 1024 * 1024))  # allocate 1 MiB per iteration
    time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

✅ Fix: limit list size, evict old data, or use bounded cache.

deployment.yaml

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"
Enter fullscreen mode Exit fullscreen mode

8. Lessons Learned & Best Practices

✅ Detect early using kubectl get events, kubectl describe, and cgroup metrics.

✅ Always define requests and limits.

✅ Burstable pods are evicted first under pressure.

✅ HPA is a guardrail, not a cure.

✅ Instrument apps for memory observability.

✅ Use bounded caches and eviction policies.

✅ Validate fixes in staging and monitor 24–48 hrs.

✅ Cross-share insights across teams to strengthen runbooks.


9. Call to Action & Feedback

Thank you for reading this deep dive! 🙌

👉 Explore the demo repo: Memory-Leak Demo on GitHub

If this helped — or if you’ve had your own “OOM panic” — drop a comment below!

Let’s share our Kubernetes war stories and learn from each other.


Top comments (0)