DEV Community

Shreyans Sonthalia
Shreyans Sonthalia

Posted on

Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

A real-world debugging guide: from mysterious pod terminations to discovering a hidden kernel memory leak consuming 55% of node RAM.


The Incident

It was a regular morning when we noticed something off. One of our production services — running on an EKS cluster — had been terminated and a new pod had spun up in its place. No deployment had been triggered. No config changes. The pod just... died.

The Grafana dashboard for the old pod told a strange story:

  • Memory usage had climbed to 832 MiB, then abruptly dropped to zero
  • CPU dropped to zero at the same time
  • After a ~45 minute gap, a new pod appeared and started running normally

The new pod was already using 757 MiB of memory and running just fine. So what killed the old one?

This is the story of how we debugged it — and what we found.


Step 1: The Obvious Suspect — OOMKill

When a Kubernetes pod dies unexpectedly, the first thing most engineers check is whether it was killed for using too much memory (an OOMKill). We looked at the deployment spec:

resources:
  requests:
    cpu: 100m
    memory: 500Mi
  # No limits set
Enter fullscreen mode Exit fullscreen mode

No memory limit was configured. In Kubernetes, if you don't set a limit, the container can use as much memory as the node has available. So this wasn't a container-level OOMKill — the pod had no ceiling to hit.

But wait — if the new pod was happily running at 757 MiB, why would 832 MiB on the old pod be a problem? Something else was going on.


Step 2: Checking Kubernetes Events

We tried to pull events for the terminated pod:

kubectl get events -n live --field-selector involvedObject.name=<pod-name>
Enter fullscreen mode Exit fullscreen mode

Nothing. Kubernetes only retains events for about an hour, and the pod had died over 4 hours ago. The events had expired.

But when we checked broader events in the namespace, we found something interesting:

TaintManagerEviction  pod/<new-pod>  Cancelling deletion of Pod
Enter fullscreen mode Exit fullscreen mode

And the node had recent events:

NodeNotReady   node/<node-name>   Node status is now: NodeNotReady
NodeReady      node/<node-name>   Node status is now: NodeReady
Enter fullscreen mode Exit fullscreen mode

The node itself had gone NotReady. When a Kubernetes node stops responding to the API server, all pods on that node get evicted. This explained the pod termination — but why did the node go NotReady?

What Does NodeNotReady Mean?

Every node in a Kubernetes cluster runs a process called the kubelet. The kubelet sends periodic heartbeats to the API server (the control plane) saying "I'm alive and healthy." If the API server doesn't receive a heartbeat within a grace period (default 40 seconds), it marks the node as NotReady and begins evicting pods to reschedule them elsewhere.

A node goes NotReady when the kubelet process is too overwhelmed to send these heartbeats — usually due to extreme resource pressure (CPU, memory, or disk).


Step 3: The CPU Credit Theory (A Red Herring)

The node was a t3a.medium instance on AWS. T3/T3a instances are burstable — they don't give you full CPU all the time. We initially suspected that the instance had exhausted its CPU credits and was being throttled, causing the kubelet to miss heartbeats.

Not familiar with AWS burstable instances and CPU credits? Read our deep dive: AWS Burstable Instances Explained: CPU Credits, Throttling, and Why Your t3 Instance Isn't What You Think

We checked the credit configuration:

aws ec2 describe-instance-credit-specifications \
  --instance-ids <instance-id>
Enter fullscreen mode Exit fullscreen mode
CpuCredits: unlimited
Enter fullscreen mode Exit fullscreen mode

T3 Unlimited mode was already enabled — meaning the instance could burst beyond its credit balance without throttling (you just pay for the extra usage). We verified with CloudWatch: CPU credits were at 0 but surplus credits were maxed at 576. The instance was not being throttled.

CPU credits: ruled out.

But CloudWatch revealed something alarming: the instance had been running at ~100% CPU utilization for the entire day.


Step 4: What's Eating the CPU?

We checked CPU usage of all pods on the node using kubectl top:

service pod:           7m    (0.007 CPUs)
weave-scope-agent:    40m
aws-node:             23m
kube-proxy:            7m
ebs-csi:               3m
efs-csi:               5m
──────────────────────────
Total:               ~85m   (out of 2000m available)
Enter fullscreen mode Exit fullscreen mode

Pods were barely using any CPU. Yet CloudWatch showed 100% at the instance level. The CPU was being consumed by something outside of Kubernetes pods — at the operating system or kernel level.


Step 5: Getting Inside the Node

We needed to look at the node's operating system directly. We used AWS Systems Manager (SSM) to run commands on the instance without SSH:

aws ssm send-command \
  --instance-ids <instance-id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["cat /proc/loadavg"]'
Enter fullscreen mode Exit fullscreen mode

The result:

34.04 25.03 22.70
Enter fullscreen mode Exit fullscreen mode

A load average of 34 on a 2-CPU machine. That's 17x the capacity. The system was completely overloaded.

What is Load Average?

Load average represents the average number of processes that are either currently running on a CPU, or waiting in the queue to run. On a 2-CPU machine, a load average of 2.0 means both CPUs are fully utilized. A load average of 34 means there are 34 processes competing for 2 CPUs — each process spends most of its time waiting.

The three numbers represent the 1-minute, 5-minute, and 15-minute averages. All three being high meant this had been going on for a long time.


Step 6: Finding the Real Bottleneck with Linux PSI

Linux has a feature called Pressure Stall Information (PSI) that tells you exactly which resource is the bottleneck. We checked /proc/pressure/:

CPU:    some avg10=85.41  avg60=84.62  avg300=82.10
Memory: some avg10=98.98  avg60=98.90  avg300=98.38
        full avg10=62.85  avg60=63.91  avg300=63.33
IO:     some avg10=0.04   avg60=0.16   avg300=0.21
Enter fullscreen mode Exit fullscreen mode

The numbers told a clear story:

  • some = percentage of time at least one process was stalled on this resource
  • full = percentage of time ALL processes were stalled

99% of the time, some process was waiting for memory. 63% of the time, ALL processes were completely stalled.

This wasn't a CPU problem at all — it was a memory problem that manifested as high CPU usage.


Step 7: Swap Thrashing — The Real Killer

The memory stats confirmed it:

MemTotal:      3,936 MB
MemFree:          86 MB     (only 86 MB free!)
MemAvailable:    735 MB     (after counting reclaimable cache)
SwapTotal:     1,048 MB
SwapFree:        549 MB     (500 MB of swap in use)
Committed_AS:  5,001 MB     (5 GB committed on a 4 GB machine!)
Enter fullscreen mode Exit fullscreen mode

The node had 5 GB of memory committed on a machine with only 4 GB of RAM. The overflow was being handled by swap — a section of the disk used as overflow memory. But disk is 100,000x slower than RAM, and when the system constantly moves data between RAM and disk, you get swap thrashing: the CPU spends all its time waiting for disk I/O instead of doing useful work.

Want to understand swap, swap thrashing, and why memory problems cause CPU spikes? Read our explainer: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

This explained everything: swap thrashing -> kubelet can't send heartbeats -> NodeNotReady -> pod evicted. But where was all the memory going?


Step 8: The Hidden Memory Consumer — Kernel Slab

We checked memory distribution across all pods on the node:

PODS (total):                    616 MB
├── main service                 381 MB
├── weave-scope-agent             64 MB
├── aws-node (VPC CNI)            39 MB
├── promtail                      30 MB
├── ebs-csi-node                  26 MB
├── kube-proxy                    24 MB

SYSTEM PROCESSES:                 58 MB
PAGE CACHE:                      831 MB
FREE:                             87 MB
Enter fullscreen mode Exit fullscreen mode

That's only about 1.6 GB accounted for. On a 4 GB node, where was the other 2+ GB?

KERNEL SLAB:                   2,194 MB
├── SReclaimable:                 50 MB   (can be freed)
├── SUnreclaim:                2,143 MB   (CANNOT be freed!)
Enter fullscreen mode Exit fullscreen mode

2.1 GB of non-reclaimable kernel memory. Over half the node's RAM was consumed by the Linux kernel itself, completely invisible to all Kubernetes monitoring tools.

What is kernel slab memory and why can't Kubernetes see it? This is covered in detail in: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

Normal SUnreclaim on a healthy node is 50-200 MB. Our node had 2,143 MB. Something was leaking memory inside the kernel.


Step 9: Inside the Slab — 1.66 Million Leaked Network Packets

We examined /proc/slabinfo to see what was consuming the slab:

SLAB OBJECT              COUNT        x  SIZE      =  TOTAL
────────────────────────────────────────────────────────────
kmalloc-1k            1,667,384    x  1,024 B   =  1,632 MB
skbuff_head_cache     1,657,980    x    256 B   =    414 MB
────────────────────────────────────────────────────────────
These two alone:                                   2,046 MB
Enter fullscreen mode Exit fullscreen mode

1.66 million skbuff_head_cache entries — each one representing a network packet header in the Linux kernel. And 1.67 million kmalloc-1k allocations (the associated packet data). The almost 1:1 ratio confirmed this was a network subsystem memory leak: millions of network packets stuck in kernel memory, never being cleaned up.

What is skbuff and how does it relate to network packets? Explained in: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You


Step 10: It's Not Just One Node

Our affected pod ran on a dedicated node. Maybe this was a one-off? We checked two other nodes in the cluster:

                      affected node   large node       another small node
                      (t3a.medium)    (t3a.xlarge)     (t3a.medium)
──────────────────────────────────────────────────────────────────────────
Total RAM             3,936 MB        16,207 MB        3,938 MB
Slab (SUnreclaim)     2,143 MB         4,533 MB        1,744 MB
skbuff count          1,667,384        3,309,501        1,310,669
Memory pressure       98.98%           0.00%            0.00%
Load average          32.64            2.11             0.12
Enter fullscreen mode Exit fullscreen mode

Every node had the same leak. The t3a.xlarge node (16 GB) had an even bigger leak at 4.5 GB — but survived because it had enough RAM headroom. The other t3a.medium nodes were ticking time bombs.


Step 11: The Culprit — An Abandoned Monitoring Tool

What was common across all nodes and was intercepting network traffic? A network visualization DaemonSet.

We had Weave Scope running on every node — a tool that captures and analyzes network traffic to build a real-time map of your infrastructure.

kubectl get daemonsets -n weave
Enter fullscreen mode Exit fullscreen mode
NAME                DESIRED   CURRENT   READY   AGE
weave-scope-agent   16        16        16      2y326d
Enter fullscreen mode Exit fullscreen mode

Key findings:

  • Installed 2 years and 326 days ago via raw kubectl apply (no Helm, no GitOps)
  • Running weaveworks/scope:1.13.2 — the last version ever released
  • Weaveworks, the company behind it, shut down in 2024
  • The DaemonSet was running on all 16 nodes, intercepting all network traffic
  • Its packet interception was creating socket buffers in kernel space that were never freed

Over weeks and months, these accumulated into the millions, consuming gigabytes of kernel memory on every node.


The Fix

We deleted the entire namespace:

kubectl delete namespace weave
Enter fullscreen mode Exit fullscreen mode

The effect was immediate:

                          BEFORE              AFTER
──────────────────────────────────────────────────────
Slab (SUnreclaim)         2,143 MB            74 MB
MemFree                   87 MB               1,937 MB
MemAvailable              735 MB              2,600 MB
Memory pressure           98.98%              0.00%
Load average              32.64               0.39
Enter fullscreen mode Exit fullscreen mode

When the agent processes were killed, the kernel cleaned up all the orphaned socket buffers. 2 GB of memory was freed instantly. No node restart was even needed.


Lessons Learned

1. Your monitoring tools can be the problem

A monitoring tool designed to give visibility into our infrastructure was silently killing it. Tools that intercept network traffic at the kernel level can cause kernel-level resource leaks that are invisible to standard Kubernetes metrics.

2. Kubernetes metrics have a blind spot

kubectl top and Prometheus container metrics only show userspace memory used by containers. The 2.1 GB of kernel slab memory was completely invisible. We only found it by SSHing into the node and checking /proc/meminfo and /proc/slabinfo.

If you're running node-exporter, consider alerting on node_memory_SUnreclaim_bytes — it would have caught this early.

3. Small nodes amplify kernel-level issues

A t3a.medium (4 GB RAM) leaves very little headroom after kubelet, container runtime, CNI plugins, CSI drivers, DaemonSet pods, and OS overhead. Any kernel-level issue eats directly into the limited space available for your workloads.

4. Audit your DaemonSets regularly

DaemonSets run on every node. A single misbehaving DaemonSet multiplies its impact across your entire infrastructure. Review them periodically:

kubectl get daemonsets --all-namespaces
Enter fullscreen mode Exit fullscreen mode

Ask: Is this still needed? Is it maintained? When was it last updated?

5. Abandoned open-source software is a liability

Running unmaintained software in production — especially software that operates at the kernel level — is a risk that's easy to forget about. If the maintainers or company behind a tool have moved on, you should too.

6. High CPU doesn't always mean high computation

Our node showed 100% CPU, but actual computation was negligible. The CPU was spent on memory management — swapping pages in and out of disk. When you see high CPU coupled with high memory usage, check for swap thrashing first.

7. Follow the evidence, not assumptions

Our investigation path: OOMKill? (no) -> CPU credits? (no) -> Node issue? (yes, NodeNotReady) -> What caused it? (memory pressure) -> Where's the memory? (kernel slab) -> What's in the slab? (leaked socket buffers) -> What's leaking? (abandoned DaemonSet). Each wrong hypothesis was eliminated with data, not guesswork.


Debugging Cheatsheet

Kubernetes-level

# Pod events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Node conditions
kubectl describe node <node-name> | grep -A5 Conditions

# All pods on a node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> -o wide

# Pod resource usage
kubectl top pods -n <namespace>

# List all DaemonSets
kubectl get daemonsets --all-namespaces
Enter fullscreen mode Exit fullscreen mode

OS-level (via SSM or SSH)

# System pressure — which resource is the bottleneck?
cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

# Memory breakdown — look for SUnreclaim
grep -E "MemTotal|MemFree|MemAvailable|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree" /proc/meminfo

# Top kernel slab consumers
cat /proc/slabinfo | sort -k3 -rn | head -10

# Load average
cat /proc/loadavg
Enter fullscreen mode Exit fullscreen mode

AWS-level

# CPU credit balance (burstable instances)
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=<id> \
  --start-time <start> --end-time <end> \
  --period 300 --statistics Average

# Run commands on a node without SSH
aws ssm send-command \
  --instance-ids <id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["your-command"]'
Enter fullscreen mode Exit fullscreen mode

Further Reading


The most dangerous problems in production aren't the ones that set off alarms — they're the ones that slowly accumulate in places you're not looking.

Top comments (0)