DEV Community: Shreyans Sonthalia

Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

Shreyans Sonthalia — Wed, 15 Apr 2026 12:16:25 +0000

A real-world debugging guide: from mysterious pod terminations to discovering a hidden kernel memory leak consuming 55% of node RAM.

The Incident

It was a regular morning when we noticed something off. One of our production services — running on an EKS cluster — had been terminated and a new pod had spun up in its place. No deployment had been triggered. No config changes. The pod just... died.

The Grafana dashboard for the old pod told a strange story:

Memory usage had climbed to 832 MiB, then abruptly dropped to zero
CPU dropped to zero at the same time
After a ~45 minute gap, a new pod appeared and started running normally

The new pod was already using 757 MiB of memory and running just fine. So what killed the old one?

This is the story of how we debugged it — and what we found.

Step 1: The Obvious Suspect — OOMKill

When a Kubernetes pod dies unexpectedly, the first thing most engineers check is whether it was killed for using too much memory (an OOMKill). We looked at the deployment spec:

resources:
  requests:
    cpu: 100m
    memory: 500Mi
  # No limits set

No memory limit was configured. In Kubernetes, if you don't set a limit, the container can use as much memory as the node has available. So this wasn't a container-level OOMKill — the pod had no ceiling to hit.

But wait — if the new pod was happily running at 757 MiB, why would 832 MiB on the old pod be a problem? Something else was going on.

Step 2: Checking Kubernetes Events

We tried to pull events for the terminated pod:

kubectl get events -n live --field-selector involvedObject.name=<pod-name>

Nothing. Kubernetes only retains events for about an hour, and the pod had died over 4 hours ago. The events had expired.

But when we checked broader events in the namespace, we found something interesting:

TaintManagerEviction  pod/<new-pod>  Cancelling deletion of Pod

And the node had recent events:

NodeNotReady   node/<node-name>   Node status is now: NodeNotReady
NodeReady      node/<node-name>   Node status is now: NodeReady

The node itself had gone NotReady. When a Kubernetes node stops responding to the API server, all pods on that node get evicted. This explained the pod termination — but why did the node go NotReady?

What Does NodeNotReady Mean?

Every node in a Kubernetes cluster runs a process called the kubelet. The kubelet sends periodic heartbeats to the API server (the control plane) saying "I'm alive and healthy." If the API server doesn't receive a heartbeat within a grace period (default 40 seconds), it marks the node as NotReady and begins evicting pods to reschedule them elsewhere.

A node goes NotReady when the kubelet process is too overwhelmed to send these heartbeats — usually due to extreme resource pressure (CPU, memory, or disk).

Step 3: The CPU Credit Theory (A Red Herring)

The node was a t3a.medium instance on AWS. T3/T3a instances are burstable — they don't give you full CPU all the time. We initially suspected that the instance had exhausted its CPU credits and was being throttled, causing the kubelet to miss heartbeats.

Not familiar with AWS burstable instances and CPU credits? Read our deep dive: AWS Burstable Instances Explained: CPU Credits, Throttling, and Why Your t3 Instance Isn't What You Think

We checked the credit configuration:

aws ec2 describe-instance-credit-specifications \
  --instance-ids <instance-id>

CpuCredits: unlimited

T3 Unlimited mode was already enabled — meaning the instance could burst beyond its credit balance without throttling (you just pay for the extra usage). We verified with CloudWatch: CPU credits were at 0 but surplus credits were maxed at 576. The instance was not being throttled.

CPU credits: ruled out.

But CloudWatch revealed something alarming: the instance had been running at ~100% CPU utilization for the entire day.

Step 4: What's Eating the CPU?

We checked CPU usage of all pods on the node using kubectl top:

service pod:           7m    (0.007 CPUs)
weave-scope-agent:    40m
aws-node:             23m
kube-proxy:            7m
ebs-csi:               3m
efs-csi:               5m
──────────────────────────
Total:               ~85m   (out of 2000m available)

Pods were barely using any CPU. Yet CloudWatch showed 100% at the instance level. The CPU was being consumed by something outside of Kubernetes pods — at the operating system or kernel level.

Step 5: Getting Inside the Node

We needed to look at the node's operating system directly. We used AWS Systems Manager (SSM) to run commands on the instance without SSH:

aws ssm send-command \
  --instance-ids <instance-id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["cat /proc/loadavg"]'

The result:

34.04 25.03 22.70

A load average of 34 on a 2-CPU machine. That's 17x the capacity. The system was completely overloaded.

What is Load Average?

Load average represents the average number of processes that are either currently running on a CPU, or waiting in the queue to run. On a 2-CPU machine, a load average of 2.0 means both CPUs are fully utilized. A load average of 34 means there are 34 processes competing for 2 CPUs — each process spends most of its time waiting.

The three numbers represent the 1-minute, 5-minute, and 15-minute averages. All three being high meant this had been going on for a long time.

Step 6: Finding the Real Bottleneck with Linux PSI

Linux has a feature called Pressure Stall Information (PSI) that tells you exactly which resource is the bottleneck. We checked /proc/pressure/:

CPU:    some avg10=85.41  avg60=84.62  avg300=82.10
Memory: some avg10=98.98  avg60=98.90  avg300=98.38
        full avg10=62.85  avg60=63.91  avg300=63.33
IO:     some avg10=0.04   avg60=0.16   avg300=0.21

The numbers told a clear story:

some = percentage of time at least one process was stalled on this resource
full = percentage of time ALL processes were stalled

99% of the time, some process was waiting for memory. 63% of the time, ALL processes were completely stalled.

This wasn't a CPU problem at all — it was a memory problem that manifested as high CPU usage.

Step 7: Swap Thrashing — The Real Killer

The memory stats confirmed it:

MemTotal:      3,936 MB
MemFree:          86 MB     (only 86 MB free!)
MemAvailable:    735 MB     (after counting reclaimable cache)
SwapTotal:     1,048 MB
SwapFree:        549 MB     (500 MB of swap in use)
Committed_AS:  5,001 MB     (5 GB committed on a 4 GB machine!)

The node had 5 GB of memory committed on a machine with only 4 GB of RAM. The overflow was being handled by swap — a section of the disk used as overflow memory. But disk is 100,000x slower than RAM, and when the system constantly moves data between RAM and disk, you get swap thrashing: the CPU spends all its time waiting for disk I/O instead of doing useful work.

Want to understand swap, swap thrashing, and why memory problems cause CPU spikes? Read our explainer: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

This explained everything: swap thrashing -> kubelet can't send heartbeats -> NodeNotReady -> pod evicted. But where was all the memory going?

Step 8: The Hidden Memory Consumer — Kernel Slab

We checked memory distribution across all pods on the node:

PODS (total):                    616 MB
├── main service                 381 MB
├── weave-scope-agent             64 MB
├── aws-node (VPC CNI)            39 MB
├── promtail                      30 MB
├── ebs-csi-node                  26 MB
├── kube-proxy                    24 MB

SYSTEM PROCESSES:                 58 MB
PAGE CACHE:                      831 MB
FREE:                             87 MB

That's only about 1.6 GB accounted for. On a 4 GB node, where was the other 2+ GB?

KERNEL SLAB:                   2,194 MB
├── SReclaimable:                 50 MB   (can be freed)
├── SUnreclaim:                2,143 MB   (CANNOT be freed!)

2.1 GB of non-reclaimable kernel memory. Over half the node's RAM was consumed by the Linux kernel itself, completely invisible to all Kubernetes monitoring tools.

What is kernel slab memory and why can't Kubernetes see it? This is covered in detail in: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

Normal SUnreclaim on a healthy node is 50-200 MB. Our node had 2,143 MB. Something was leaking memory inside the kernel.

Step 9: Inside the Slab — 1.66 Million Leaked Network Packets

We examined /proc/slabinfo to see what was consuming the slab:

SLAB OBJECT              COUNT        x  SIZE      =  TOTAL
────────────────────────────────────────────────────────────
kmalloc-1k            1,667,384    x  1,024 B   =  1,632 MB
skbuff_head_cache     1,657,980    x    256 B   =    414 MB
────────────────────────────────────────────────────────────
These two alone:                                   2,046 MB

1.66 million skbuff_head_cache entries — each one representing a network packet header in the Linux kernel. And 1.67 million kmalloc-1k allocations (the associated packet data). The almost 1:1 ratio confirmed this was a network subsystem memory leak: millions of network packets stuck in kernel memory, never being cleaned up.

What is skbuff and how does it relate to network packets? Explained in: Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

Step 10: It's Not Just One Node

Our affected pod ran on a dedicated node. Maybe this was a one-off? We checked two other nodes in the cluster:

                      affected node   large node       another small node
                      (t3a.medium)    (t3a.xlarge)     (t3a.medium)
──────────────────────────────────────────────────────────────────────────
Total RAM             3,936 MB        16,207 MB        3,938 MB
Slab (SUnreclaim)     2,143 MB         4,533 MB        1,744 MB
skbuff count          1,667,384        3,309,501        1,310,669
Memory pressure       98.98%           0.00%            0.00%
Load average          32.64            2.11             0.12

Every node had the same leak. The t3a.xlarge node (16 GB) had an even bigger leak at 4.5 GB — but survived because it had enough RAM headroom. The other t3a.medium nodes were ticking time bombs.

Step 11: The Culprit — An Abandoned Monitoring Tool

What was common across all nodes and was intercepting network traffic? A network visualization DaemonSet.

We had Weave Scope running on every node — a tool that captures and analyzes network traffic to build a real-time map of your infrastructure.

kubectl get daemonsets -n weave

NAME                DESIRED   CURRENT   READY   AGE
weave-scope-agent   16        16        16      2y326d

Key findings:

Installed 2 years and 326 days ago via raw kubectl apply (no Helm, no GitOps)
Running weaveworks/scope:1.13.2 — the last version ever released
Weaveworks, the company behind it, shut down in 2024
The DaemonSet was running on all 16 nodes, intercepting all network traffic
Its packet interception was creating socket buffers in kernel space that were never freed

Over weeks and months, these accumulated into the millions, consuming gigabytes of kernel memory on every node.

The Fix

We deleted the entire namespace:

kubectl delete namespace weave

The effect was immediate:

                          BEFORE              AFTER
──────────────────────────────────────────────────────
Slab (SUnreclaim)         2,143 MB            74 MB
MemFree                   87 MB               1,937 MB
MemAvailable              735 MB              2,600 MB
Memory pressure           98.98%              0.00%
Load average              32.64               0.39

When the agent processes were killed, the kernel cleaned up all the orphaned socket buffers. 2 GB of memory was freed instantly. No node restart was even needed.

Lessons Learned

1. Your monitoring tools can be the problem

A monitoring tool designed to give visibility into our infrastructure was silently killing it. Tools that intercept network traffic at the kernel level can cause kernel-level resource leaks that are invisible to standard Kubernetes metrics.

2. Kubernetes metrics have a blind spot

kubectl top and Prometheus container metrics only show userspace memory used by containers. The 2.1 GB of kernel slab memory was completely invisible. We only found it by SSHing into the node and checking /proc/meminfo and /proc/slabinfo.

If you're running node-exporter, consider alerting on node_memory_SUnreclaim_bytes — it would have caught this early.

3. Small nodes amplify kernel-level issues

A t3a.medium (4 GB RAM) leaves very little headroom after kubelet, container runtime, CNI plugins, CSI drivers, DaemonSet pods, and OS overhead. Any kernel-level issue eats directly into the limited space available for your workloads.

4. Audit your DaemonSets regularly

DaemonSets run on every node. A single misbehaving DaemonSet multiplies its impact across your entire infrastructure. Review them periodically:

kubectl get daemonsets --all-namespaces

Ask: Is this still needed? Is it maintained? When was it last updated?

5. Abandoned open-source software is a liability

Running unmaintained software in production — especially software that operates at the kernel level — is a risk that's easy to forget about. If the maintainers or company behind a tool have moved on, you should too.

6. High CPU doesn't always mean high computation

Our node showed 100% CPU, but actual computation was negligible. The CPU was spent on memory management — swapping pages in and out of disk. When you see high CPU coupled with high memory usage, check for swap thrashing first.

7. Follow the evidence, not assumptions

Our investigation path: OOMKill? (no) -> CPU credits? (no) -> Node issue? (yes, NodeNotReady) -> What caused it? (memory pressure) -> Where's the memory? (kernel slab) -> What's in the slab? (leaked socket buffers) -> What's leaking? (abandoned DaemonSet). Each wrong hypothesis was eliminated with data, not guesswork.

Debugging Cheatsheet

Kubernetes-level

# Pod events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Node conditions
kubectl describe node <node-name> | grep -A5 Conditions

# All pods on a node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> -o wide

# Pod resource usage
kubectl top pods -n <namespace>

# List all DaemonSets
kubectl get daemonsets --all-namespaces

OS-level (via SSM or SSH)

# System pressure — which resource is the bottleneck?
cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

# Memory breakdown — look for SUnreclaim
grep -E "MemTotal|MemFree|MemAvailable|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree" /proc/meminfo

# Top kernel slab consumers
cat /proc/slabinfo | sort -k3 -rn | head -10

# Load average
cat /proc/loadavg

AWS-level

# CPU credit balance (burstable instances)
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=<id> \
  --start-time <start> --end-time <end> \
  --period 300 --statistics Average

# Run commands on a node without SSH
aws ssm send-command \
  --instance-ids <id> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["your-command"]'

Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

Shreyans Sonthalia — Wed, 15 Apr 2026 12:10:50 +0000

Your kubectl top says the node has plenty of free memory. The node crashes anyway. Here's what's hiding in the gap.

The Problem With Kubernetes Memory Metrics

When you run kubectl top node, you see something like:

NAME              CPU     MEMORY
ip-10-2-1-35     45m     616Mi/3936Mi    (15%)

15% memory usage. Looks healthy, right?

But the node is swap thrashing, the load average is 34, and pods are being evicted. How?

Because Kubernetes only shows you userspace memory — the memory your containers are using. It doesn't show you what the Linux kernel is consuming behind the scenes. On the node we were debugging, the kernel was secretly eating 2.1 GB out of 4 GB — and kubectl had no idea.

This post explains the layers of Linux memory that Kubernetes can't see, and how to find them when things go wrong.

How Linux Organizes Memory

When you check /proc/meminfo on a Linux machine, you see dozens of entries. Here's how they fit together on a 4 GB node:

Total RAM: 4,096 MB
├── Used by applications (Anonymous pages):     617 MB
│   ├── Container processes (what kubectl sees)
│   └── System processes (kubelet, containerd, etc.)
├── Page Cache (file-backed pages):             831 MB
│   └── Cached file data (can be reclaimed)
├── Kernel Slab:                              2,194 MB  ← invisible to k8s
│   ├── SReclaimable:      50 MB (can be freed)
│   └── SUnreclaim:     2,143 MB (cannot be freed!)
├── Kernel Stack, Page Tables, etc.:             60 MB
└── Free:                                        87 MB

Kubernetes metrics cover the first bucket. Everything else is the OS and kernel.

Let's break down each layer.

Layer 1: Application Memory (What Kubernetes Shows)

This is the memory your processes actively use — variables, heap allocations, stack frames. In Linux terms, these are anonymous pages (memory not backed by any file on disk).

# What Kubernetes reports
kubectl top pods -n live

NAME                         CPU     MEMORY
nightfort-688ccc5974-p47qs   7m      381Mi

This 381 MiB is the Resident Set Size (RSS) of the container's processes — the amount of physical RAM their memory allocations are currently occupying.

Why This Number Isn't the Full Picture

RSS only counts memory your process asked for. It doesn't count:

Memory the kernel allocated on behalf of your process (network buffers, file descriptors)
Kernel data structures for managing your containers (cgroups, namespaces)
Shared libraries loaded once but used by multiple containers

Layer 2: Page Cache

The page cache is Linux's way of caching file data in RAM so that repeated reads don't hit the disk.

First read of a file:   Disk → RAM (page cache) → Process     [slow]
Second read:             Page cache → Process                   [fast]

On our node, 831 MB was used for page cache. This sounds like a lot, but page cache is reclaimable — the kernel will automatically free it when applications need more RAM. It's essentially "free memory being used productively."

This is why MemAvailable is often much higher than MemFree:

MemFree:        87 MB    (truly unused)
MemAvailable:  735 MB    (free + reclaimable cache)

Key insight: If you see low MemFree but healthy MemAvailable, your system is fine — the kernel is just being smart about caching. Panic when MemAvailable is low.

Layer 3: Kernel Slab Memory (The Hidden Consumer)

This is where things get interesting — and where our production incident hid for months.

What is the Slab Allocator?

The Linux kernel constantly needs to create and destroy small data structures: file descriptors, inode objects, network packet headers, process descriptors, and hundreds of other internal types. Allocating and freeing these one at a time from the general-purpose memory allocator would be slow.

The slab allocator solves this by maintaining pre-allocated pools for each object type. Think of it like a restaurant kitchen with separate prep stations:

Instead of:
  "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation

The kernel does:
  "I need an inode" → grab one from the inode pool → fast, no fragmentation
  "Done with inode" → return it to the pool → ready for reuse

Each pool is called a slab cache. You can see all of them in /proc/slabinfo:

cat /proc/slabinfo | sort -k3 -rn | head -10

kmalloc-1k        1,667,384   1024 bytes each  →  1,632 MB
skbuff_head_cache 1,657,980    256 bytes each  →    414 MB
dentry                9,248    192 bytes each  →    1.7 MB
xfs_inode             9,649   1024 bytes each  →    9.4 MB

SReclaimable vs SUnreclaim

Slab memory is split into two categories:

SReclaimable — Slab caches that hold cached data the kernel can regenerate. The biggest example is the dentry cache (directory entry cache), which caches filesystem path lookups. If memory is needed, the kernel can shrink these caches.

SUnreclaim — Slab caches that hold active data the kernel is currently using. Network packet buffers, open file descriptors, active inode structures. These cannot be freed until the code that created them explicitly releases them.

On a healthy node:

SReclaimable:    200 MB   (caches, will shrink if needed)
SUnreclaim:      100 MB   (active kernel objects)

On our broken node:

SReclaimable:     50 MB
SUnreclaim:    2,143 MB   ← 21x normal!

Why Kubernetes Can't See Slab Memory

Kubernetes resource metrics come from cgroups (control groups), which track memory allocated by processes inside containers. Kernel slab allocations happen outside of any cgroup — they're charged to the kernel, not to any container. Even if your container triggered the kernel allocation (by sending a network packet, for example), the slab memory shows up as kernel memory, not container memory.

This means:

kubectl top won't show it
Prometheus container metrics won't show it
Your pod's memory limit won't be hit by it
But it still uses physical RAM on the node

The only way to see it is by checking /proc/meminfo or using node-exporter's node_memory_SUnreclaim_bytes metric.

Layer 4: Swap — The Emergency Overflow

What is Swap?

Swap is a section of the disk that Linux uses as overflow memory when physical RAM is full.

RAM (4 GB)     →  Fast (nanoseconds)    →  Expensive
Disk/Swap      →  Slow (milliseconds)   →  Cheap

When the kernel needs to free up RAM (because something needs more memory and there's nothing reclaimable left), it takes memory pages that haven't been accessed recently and writes them to the swap area on disk. This is called swapping out or paging out.

A Step-by-Step Example

Stage 1: Everything fits in RAM

RAM  [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB]
Swap [empty]

All processes' memory is in RAM. Memory access is fast. No problems.

Stage 2: RAM fills up

RAM  [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB]
Swap [empty]

Free memory is nearly gone. The kernel starts shrinking the page cache, but slab (SUnreclaim) can't be freed.

Stage 3: Swap kicks in

RAM  [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB]
Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used

The kernel identified memory pages that hadn't been accessed recently and moved them to disk. RAM now has room for active work.

Stage 4: Swap thrashing

This is where things go catastrophically wrong. When a process needs a page that was swapped out:

Normal access (page in RAM):
  CPU: "Give me address 0x1234"
  RAM: "Here you go"
  → 100 nanoseconds

Swapped access (page on disk):
  CPU: "Give me address 0x1234"
  RAM: "Not here — it's on disk"              → PAGE FAULT
  Kernel: "I need to load it from swap"
  Kernel: "But RAM is full. Let me swap OUT another page first"
  Disk write: Evict some other page to swap    → 1-5 milliseconds
  Disk read: Load the requested page           → 1-5 milliseconds
  CPU: "Finally!"
  → 2-10 milliseconds total (100,000x slower)

Now multiply this by dozens of processes, all needing pages that were swapped out:

Process A needs a page → it's on disk → swap in A, swap out B → 5ms
Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms
Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms
Process A runs → needs its page → swapped out by C! → ...

This circular eviction is swap thrashing. The system does almost no useful work — all CPU time is spent managing page faults and disk I/O.

Why Swap Thrashing Looks Like a CPU Problem

CloudWatch and top will show 100% CPU utilization during swap thrashing. But the CPU isn't doing computation. Here's the breakdown:

Actual computation:      ~5%     (your app, kubelet, etc.)
Kernel swap management:  ~30%    (deciding what to evict, page table updates)
I/O wait:               ~65%    (waiting for disk reads/writes)
────────────────────────────────
Total:                  ~100%

The load average also skyrockets because Linux counts processes in uninterruptible sleep (waiting for disk I/O) in the load average. If 30 processes are all waiting for swap pages, the load average shows 30 — even though very little CPU work is happening.

This is why our node showed a load average of 34 with pods using only 85m of CPU. The CPUs weren't busy computing — they were busy waiting for the disk.

What is skbuff? (Socket Buffers)

sk_buff (socket buffer) is the data structure at the heart of Linux networking. Every network packet — in or out — is represented by an sk_buff.

Anatomy of a Network Packet in Linux

When your container sends an HTTP request:

Application: send("GET /health HTTP/1.1\r\n...")
    ↓
Kernel: allocate an sk_buff
    ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info
    └── kmalloc-1k entry (1024 bytes) — the actual packet data
    ↓
Network stack: add TCP header, IP header, Ethernet header
    ↓
Network driver: transmit the packet
    ↓
Kernel: free the sk_buff ← THIS is what wasn't happening

On a healthy system, sk_buff structures are allocated when a packet is created and freed when the packet is sent/received/dropped. The slab pool recycles them efficiently.

What a Leak Looks Like

On our node, we found:

skbuff_head_cache:  1,657,980 objects  (414 MB)
kmalloc-1k:         1,667,384 objects  (1,632 MB)

The almost 1:1 ratio between skbuff headers and 1KB allocations is the signature of a network packet leak. Each packet consists of a header + data buffer. 1.66 million packets were stuck in kernel memory, never freed.

At a normal rate of ~1000 packets/second, 1.66 million packets represents about 28 minutes of traffic that was captured and never released. Over days and weeks, with the leaking tool constantly intercepting traffic, this accumulated to gigabytes.

How to Investigate Memory Issues on Kubernetes Nodes

Step 1: Check if the problem is even memory

cat /proc/pressure/memory

some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078
full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996

some > 50% → memory pressure exists
full > 10% → severe memory pressure (all tasks stalling)
full > 50% → critical — system is barely functional

Step 2: Get the full memory breakdown

grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo

Read it as:

MemTotal         → Total physical RAM
MemFree          → Completely unused RAM
MemAvailable     → Free + reclaimable (what's actually available)
AnonPages        → Application memory (what kubectl roughly shows)
Cached + Buffers → Page cache (reclaimable, usually harmless)
Slab             → Kernel internal allocations
  SReclaimable   → Kernel caches (can be freed)
  SUnreclaim     → Active kernel objects (cannot be freed!)
SwapTotal        → Total swap space
SwapFree         → Unused swap (SwapTotal - SwapFree = swap used)
Committed_AS     → Total memory promised to all processes

Red flags:

SUnreclaim > 500 MB on a small node → possible kernel memory leak
Committed_AS > MemTotal + SwapTotal → system is overcommitted
SwapFree much less than SwapTotal → active swapping
MemAvailable < 10% of MemTotal → trouble ahead

Step 3: If slab is high, find out what's in it

# Show top slab consumers by object count
cat /proc/slabinfo | sort -k3 -rn | head -10

Common slab objects and what they mean:

Object	What It Is	High Count Means
`skbuff_head_cache`	Network packet headers	Network packet leak or very high traffic
`kmalloc-*`	General kernel allocations	Often paired with another leak
`dentry`	Directory entry cache	Many files/paths accessed (usually reclaimable)
`inode_cache`	File inode cache	Many files accessed (usually reclaimable)
`ext4_inode_cache`	ext4 filesystem inodes	Same as above, ext4 specific
`nf_conntrack`	Connection tracking entries	Too many network connections / conntrack leak

Step 4: Check for swap thrashing

# Load average (should be < number of CPUs)
cat /proc/loadavg

# Swap usage
grep -E "SwapTotal|SwapFree" /proc/meminfo

# If swap is being actively used, check swap I/O
cat /proc/vmstat | grep -E "pswpin|pswpout"

pswpin = pages swapped in from disk (high = thrashing)
pswpout = pages swapped out to disk (high = thrashing)

Monitoring: What to Alert On

If you're running Prometheus with node-exporter, set up alerts for these metrics:

# Alert when non-reclaimable slab memory exceeds 500MB
- alert: HighKernelSlabMemory
  expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}"

# Alert when swap usage exceeds 50%
- alert: HighSwapUsage
  expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5
  for: 15m
  labels:
    severity: warning

# Alert when memory pressure is high (PSI)
- alert: MemoryPressureHigh
  expr: node_pressure_memory_stalled_seconds_total rate > 0.5
  for: 5m
  labels:
    severity: critical

# Alert when available memory is critically low
- alert: LowAvailableMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
  for: 10m
  labels:
    severity: critical

Key Takeaways

kubectl top only shows container memory. The kernel can consume gigabytes that are invisible to Kubernetes. Always check /proc/meminfo when debugging node-level memory issues.
High SUnreclaim means something is wrong. Normal is 50-200 MB. If it's in the gigabytes, you have a kernel memory leak — find the leaking slab cache in /proc/slabinfo.
Swap thrashing masquerades as a CPU problem. If you see high CPU + high load average + swap usage, the CPU isn't busy computing — it's busy waiting for disk I/O from swap.
Page cache is not a problem. Low MemFree with healthy MemAvailable is normal — the kernel is caching files intelligently. Only worry when MemAvailable drops.
Network monitoring tools can leak socket buffers. Any tool that intercepts packets at the kernel level (Weave Scope, long-running tcpdump, certain service mesh sidecars) can accumulate sk_buff objects in slab memory over time.
Monitor node_memory_SUnreclaim_bytes. This is the one metric that would have caught our issue months before it caused an outage.

This post is part of a series on debugging Kubernetes pod terminations. Read the full incident story: Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

AWS Burstable Instances Explained: CPU Credits, Throttling, and Why Your t3 Instance Isn't What You Think

Shreyans Sonthalia — Wed, 15 Apr 2026 12:10:14 +0000

You launched a t3a.medium with "2 vCPUs" but you're not getting 2 CPUs. Here's what you're actually paying for.

The Misconception

You go to the AWS console, launch a t3a.medium, and see this in the spec:

Spec	Value
vCPUs	2
Memory	4 GiB
Price	~$0.047/hr

Most engineers assume they're getting 2 full CPU cores, always available, for $0.047/hr. That's not what's happening.

What the "T" in T3 Means

AWS has several instance families:

Family	Type	CPU Model	Example
T3/T3a/T4g	Burstable	Shared, credit-based	t3a.medium
M5/M6i/M7i	General purpose	Dedicated	m5.large
C5/C6i/C7i	Compute optimized	Dedicated	c5.large

The "T" stands for burstable. When you buy a T-series instance, you're not buying dedicated CPU cores. You're buying a fraction of a CPU with the ability to temporarily use more.

A t3a.medium gives you 20% of each vCPU as a baseline — meaning you can continuously use 0.4 vCPUs (20% x 2). The other 80% is available on-demand, but only if you have CPU credits to spend.

Why is it cheaper?

This is the deal AWS offers: because most workloads don't use 100% CPU all the time, AWS can pack ~5 burstable instances onto the same physical hardware that would serve 1 dedicated instance. You get a discount; AWS gets better hardware utilization.

t3a.medium:  2 vCPUs (burstable, 20% baseline)  →  ~$0.047/hr
m5.large:    2 vCPUs (dedicated, 100% always)    →  ~$0.096/hr

The m5.large costs 2x more because those CPUs are reserved for you, always.

How CPU Credits Work

The credit system is how AWS meters your burst usage.

The Basic Math

1 CPU credit = 1 vCPU running at 100% for 1 minute

A t3a.medium:

Earns: 24 credits per hour (12 per vCPU x 2 vCPUs)
Baseline: 20% per vCPU (this is what 24 credits/hr translates to)
Maximum balance: 576 credits (can bank up to 24 hours worth)

A Real-World Example

Say you're running a Kubernetes node with a service that normally uses 0.01 CPU (1% of one core). That's well under the 0.4 baseline:

Earning:     24 credits/hour
Spending:    ~0.6 credits/hour  (0.01 CPU ≈ 0.5% utilization)
Net:         +23.4 credits/hour accumulating
Max balance: 576 credits

Your credit balance slowly fills up over 24 hours. Life is good.

Now imagine a traffic spike hits and the node needs full CPU:

Hour 1:  2.0 vCPUs used (100%)  → spends 120 credits
Hour 2:  2.0 vCPUs used (100%)  → spends 120 credits
Hour 3:  2.0 vCPUs used (100%)  → spends 120 credits
Hour 4:  2.0 vCPUs used (100%)  → spends 120 credits
...
After ~5 hours at full CPU:     → 576 credits exhausted

The Prepaid Data Plan Analogy

Think of CPU credits like a prepaid mobile data plan:

You get 1 GB/day at 4G speed (full CPU)
After 1 GB is used up, you're throttled to 2G speed (20% baseline)
You can still use the internet, but everything is painfully slow
Next day, your quota starts accumulating again

What Happens When Credits Hit Zero

This is where things get serious.

With credits:    2.0 vCPUs available at full speed
Without credits: 2.0 vCPUs CAPPED at 20% → effectively 0.4 vCPUs

The AWS hypervisor literally limits how many CPU cycles your instance can execute. Your instance still shows 2 vCPUs, but each one can only do 20% of the work.

Impact on Kubernetes

On a Kubernetes node throttled to 0.4 vCPUs, everything competes for scraps:

kubelet              → needs CPU for heartbeats every 10s
kube-proxy           → needs CPU for network rules
containerd           → container runtime
OS processes         → systemd, journald, etc.
Your application     → the thing you actually care about

If the kubelet can't send a heartbeat to the API server within 40 seconds (the default node-monitor-grace-period), the API server marks the node as NodeNotReady and starts evicting pods. Your application goes down — not because it was using too much CPU, but because the underlying node was throttled.

T3 Unlimited Mode

AWS offers a way out: T3 Unlimited mode.

# Check current mode
aws ec2 describe-instance-credit-specifications \
  --instance-ids <instance-id>

# Enable unlimited mode
aws ec2 modify-instance-credit-specification \
  --instance-credit-specification InstanceId=<id>,CpuCredits=unlimited

With Unlimited mode:

Your instance never gets throttled
When credits are exhausted, you keep bursting at full speed
You pay a small surcharge for "surplus credits" (~$0.05 per vCPU-hour on t3a)

When Unlimited Mode Costs Extra

You only pay extra when:

Your earned credits are exhausted, AND
You're using more than baseline (20%)

If your average usage is below 20%, Unlimited mode costs nothing extra — you earn enough credits to cover the occasional burst.

Average 10% usage:  Free — credits cover all bursts
Average 20% usage:  Free — exactly at baseline
Average 50% usage:  Extra cost — 30% surplus x $0.05/vCPU-hr
Average 100% usage: Expensive — just use a dedicated instance

Credit Balance: How to Check and What to Look For

Via CloudWatch

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=<instance-id> \
  --start-time $(date -u -d '6 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Average

Key Metrics to Monitor

Metric	What It Means	Alert When
`CPUCreditBalance`	Earned credits remaining	Drops below 50
`CPUSurplusCreditBalance`	Surplus credits used (Unlimited mode)	Consistently above 0
`CPUSurplusCreditsCharged`	Surplus credits you're paying for	Unexpected charges
`CPUCreditUsage`	Credits spent in the period	Sustained high usage

Reading the Credit Balance

576 credits  → Full (24 hours of baseline earned)
200 credits  → Healthy — some bursting happening
50 credits   → Warning — approaching exhaustion
0 credits    → Standard mode: THROTTLED / Unlimited mode: paying surplus

Instance Comparison: When to Use What

Scenario	Recommended	Why
Dev/staging environments	t3a.medium	Low baseline usage, cost-effective
Kubernetes worker nodes (production)	m5.large or m6i.large	Predictable performance, no throttling risk
CI/CD build agents	t3a.xlarge with Unlimited	Burst during builds, idle otherwise
Databases	m5/r5 series	Never throttle a database
Batch processing	c5/c6i series	Sustained compute needs dedicated CPU
Single dedicated-node workloads	m5.medium over t3a.medium	Same vCPU count, guaranteed performance, ~10% more cost

The Hidden Cost of Burstable

A t3a.medium at $0.047/hr seems cheaper than an m5.large at $0.096/hr. But consider:

When a t3a node gets throttled and your pod gets evicted, what's the cost of that downtime?
When you spend 3 hours debugging why a pod keeps dying, what's the engineering cost?
If you enable Unlimited and burst frequently, the surplus charges can approach dedicated instance pricing anyway

For production Kubernetes nodes, the small extra cost of dedicated instances often pays for itself in reliability and reduced debugging time.

Quick Reference: T3/T3a Instance Family

Instance	vCPUs	RAM	Baseline/vCPU	Credits/hr	Max Balance	Price/hr (Mumbai)
t3a.micro	2	1 GiB	10%	12	288	~$0.012
t3a.small	2	2 GiB	20%	24	576	~$0.024
t3a.medium	2	4 GiB	20%	24	576	~$0.047
t3a.large	2	8 GiB	30%	36	864	~$0.075
t3a.xlarge	4	16 GiB	40%	96	2304	~$0.150

Note: Baseline percentages are per vCPU. A t3a.medium with 20% baseline on 2 vCPUs gives you 0.4 vCPUs of sustained compute.

Key Takeaways

T-series instances are not dedicated compute. The "2 vCPUs" you see is the burst ceiling, not the sustained capacity. Your sustained capacity is the baseline percentage.
CPU credit exhaustion causes throttling, not failure. Your instance doesn't stop — it slows down. This is often worse than a crash because it causes cascading timeouts and hard-to-diagnose performance issues.
Enable Unlimited mode on all production T-series instances. There's no reason to risk throttling in production. The surplus cost is minimal for occasional bursts.
If you consistently need more than baseline, switch to a dedicated instance. T-series instances are designed for workloads that are mostly idle with occasional spikes — not for sustained high CPU usage.
Monitor CPUCreditBalance in CloudWatch. Set up alerts before credits hit zero so you can react proactively.

This post is part of a series on debugging Kubernetes pod terminations. Read the full incident story: Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

DEV Community: Shreyans Sonthalia

Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

The Incident

Step 1: The Obvious Suspect — OOMKill

Step 2: Checking Kubernetes Events

What Does NodeNotReady Mean?

Step 3: The CPU Credit Theory (A Red Herring)

Step 4: What's Eating the CPU?

Step 5: Getting Inside the Node

What is Load Average?

Step 6: Finding the Real Bottleneck with Linux PSI

Step 7: Swap Thrashing — The Real Killer

Step 8: The Hidden Memory Consumer — Kernel Slab

Step 9: Inside the Slab — 1.66 Million Leaked Network Packets

Step 10: It's Not Just One Node

Step 11: The Culprit — An Abandoned Monitoring Tool

The Fix

Lessons Learned

1. Your monitoring tools can be the problem

2. Kubernetes metrics have a blind spot

3. Small nodes amplify kernel-level issues

4. Audit your DaemonSets regularly

5. Abandoned open-source software is a liability

6. High CPU doesn't always mean high computation

7. Follow the evidence, not assumptions

Debugging Cheatsheet

Kubernetes-level

OS-level (via SSM or SSH)

AWS-level

Further Reading

Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

The Problem With Kubernetes Memory Metrics

How Linux Organizes Memory

Layer 1: Application Memory (What Kubernetes Shows)

Why This Number Isn't the Full Picture

Layer 2: Page Cache

Layer 3: Kernel Slab Memory (The Hidden Consumer)

What is the Slab Allocator?

SReclaimable vs SUnreclaim

Why Kubernetes Can't See Slab Memory

Layer 4: Swap — The Emergency Overflow

What is Swap?

A Step-by-Step Example

Why Swap Thrashing Looks Like a CPU Problem

What is skbuff? (Socket Buffers)

Anatomy of a Network Packet in Linux

What a Leak Looks Like

How to Investigate Memory Issues on Kubernetes Nodes

Step 1: Check if the problem is even memory

Step 2: Get the full memory breakdown

Step 3: If slab is high, find out what's in it

Step 4: Check for swap thrashing

Monitoring: What to Alert On

Key Takeaways

AWS Burstable Instances Explained: CPU Credits, Throttling, and Why Your t3 Instance Isn't What You Think

The Misconception

What the "T" in T3 Means

Why is it cheaper?

How CPU Credits Work

The Basic Math

A Real-World Example

The Prepaid Data Plan Analogy

What Happens When Credits Hit Zero

Impact on Kubernetes

T3 Unlimited Mode

When Unlimited Mode Costs Extra

Credit Balance: How to Check and What to Look For

Via CloudWatch

Key Metrics to Monitor

Reading the Credit Balance

Instance Comparison: When to Use What

The Hidden Cost of Burstable

Quick Reference: T3/T3a Instance Family

Key Takeaways