Shreyans Sonthalia

Posted on Apr 15

Linux Memory Explained: Swap, Kernel Slab, and skbuff — What Kubernetes Doesn't Show You

#kubernetes #linux #monitoring #performance

Your kubectl top says the node has plenty of free memory. The node crashes anyway. Here's what's hiding in the gap.

The Problem With Kubernetes Memory Metrics

When you run kubectl top node, you see something like:

NAME              CPU     MEMORY
ip-10-2-1-35     45m     616Mi/3936Mi    (15%)

15% memory usage. Looks healthy, right?

But the node is swap thrashing, the load average is 34, and pods are being evicted. How?

Because Kubernetes only shows you userspace memory — the memory your containers are using. It doesn't show you what the Linux kernel is consuming behind the scenes. On the node we were debugging, the kernel was secretly eating 2.1 GB out of 4 GB — and kubectl had no idea.

This post explains the layers of Linux memory that Kubernetes can't see, and how to find them when things go wrong.

How Linux Organizes Memory

When you check /proc/meminfo on a Linux machine, you see dozens of entries. Here's how they fit together on a 4 GB node:

Total RAM: 4,096 MB
├── Used by applications (Anonymous pages):     617 MB
│   ├── Container processes (what kubectl sees)
│   └── System processes (kubelet, containerd, etc.)
├── Page Cache (file-backed pages):             831 MB
│   └── Cached file data (can be reclaimed)
├── Kernel Slab:                              2,194 MB  ← invisible to k8s
│   ├── SReclaimable:      50 MB (can be freed)
│   └── SUnreclaim:     2,143 MB (cannot be freed!)
├── Kernel Stack, Page Tables, etc.:             60 MB
└── Free:                                        87 MB

Kubernetes metrics cover the first bucket. Everything else is the OS and kernel.

Let's break down each layer.

Layer 1: Application Memory (What Kubernetes Shows)

This is the memory your processes actively use — variables, heap allocations, stack frames. In Linux terms, these are anonymous pages (memory not backed by any file on disk).

# What Kubernetes reports
kubectl top pods -n live

NAME                         CPU     MEMORY
nightfort-688ccc5974-p47qs   7m      381Mi

This 381 MiB is the Resident Set Size (RSS) of the container's processes — the amount of physical RAM their memory allocations are currently occupying.

Why This Number Isn't the Full Picture

RSS only counts memory your process asked for. It doesn't count:

Memory the kernel allocated on behalf of your process (network buffers, file descriptors)
Kernel data structures for managing your containers (cgroups, namespaces)
Shared libraries loaded once but used by multiple containers

Layer 2: Page Cache

The page cache is Linux's way of caching file data in RAM so that repeated reads don't hit the disk.

First read of a file:   Disk → RAM (page cache) → Process     [slow]
Second read:             Page cache → Process                   [fast]

On our node, 831 MB was used for page cache. This sounds like a lot, but page cache is reclaimable — the kernel will automatically free it when applications need more RAM. It's essentially "free memory being used productively."

This is why MemAvailable is often much higher than MemFree:

MemFree:        87 MB    (truly unused)
MemAvailable:  735 MB    (free + reclaimable cache)

Key insight: If you see low MemFree but healthy MemAvailable, your system is fine — the kernel is just being smart about caching. Panic when MemAvailable is low.

Layer 3: Kernel Slab Memory (The Hidden Consumer)

This is where things get interesting — and where our production incident hid for months.

What is the Slab Allocator?

The Linux kernel constantly needs to create and destroy small data structures: file descriptors, inode objects, network packet headers, process descriptors, and hundreds of other internal types. Allocating and freeing these one at a time from the general-purpose memory allocator would be slow.

The slab allocator solves this by maintaining pre-allocated pools for each object type. Think of it like a restaurant kitchen with separate prep stations:

Instead of:
  "I need an inode" → malloc(sizeof(inode)) → slow, fragmentation

The kernel does:
  "I need an inode" → grab one from the inode pool → fast, no fragmentation
  "Done with inode" → return it to the pool → ready for reuse

Each pool is called a slab cache. You can see all of them in /proc/slabinfo:

cat /proc/slabinfo | sort -k3 -rn | head -10

kmalloc-1k        1,667,384   1024 bytes each  →  1,632 MB
skbuff_head_cache 1,657,980    256 bytes each  →    414 MB
dentry                9,248    192 bytes each  →    1.7 MB
xfs_inode             9,649   1024 bytes each  →    9.4 MB

SReclaimable vs SUnreclaim

Slab memory is split into two categories:

SReclaimable — Slab caches that hold cached data the kernel can regenerate. The biggest example is the dentry cache (directory entry cache), which caches filesystem path lookups. If memory is needed, the kernel can shrink these caches.

SUnreclaim — Slab caches that hold active data the kernel is currently using. Network packet buffers, open file descriptors, active inode structures. These cannot be freed until the code that created them explicitly releases them.

On a healthy node:

SReclaimable:    200 MB   (caches, will shrink if needed)
SUnreclaim:      100 MB   (active kernel objects)

On our broken node:

SReclaimable:     50 MB
SUnreclaim:    2,143 MB   ← 21x normal!

Why Kubernetes Can't See Slab Memory

Kubernetes resource metrics come from cgroups (control groups), which track memory allocated by processes inside containers. Kernel slab allocations happen outside of any cgroup — they're charged to the kernel, not to any container. Even if your container triggered the kernel allocation (by sending a network packet, for example), the slab memory shows up as kernel memory, not container memory.

This means:

kubectl top won't show it
Prometheus container metrics won't show it
Your pod's memory limit won't be hit by it
But it still uses physical RAM on the node

The only way to see it is by checking /proc/meminfo or using node-exporter's node_memory_SUnreclaim_bytes metric.

Layer 4: Swap — The Emergency Overflow

What is Swap?

Swap is a section of the disk that Linux uses as overflow memory when physical RAM is full.

RAM (4 GB)     →  Fast (nanoseconds)    →  Expensive
Disk/Swap      →  Slow (milliseconds)   →  Cheap

When the kernel needs to free up RAM (because something needs more memory and there's nothing reclaimable left), it takes memory pages that haven't been accessed recently and writes them to the swap area on disk. This is called swapping out or paging out.

A Step-by-Step Example

Stage 1: Everything fits in RAM

RAM  [App 750MB] [Kubelet 200MB] [Other 500MB] [Cache 700MB] [Free 1.8GB]
Swap [empty]

All processes' memory is in RAM. Memory access is fast. No problems.

Stage 2: RAM fills up

RAM  [App 830MB] [Kubelet 200MB] [Other 800MB] [Cache 700MB] [Slab 2.1GB] [Free 87MB]
Swap [empty]

Free memory is nearly gone. The kernel starts shrinking the page cache, but slab (SUnreclaim) can't be freed.

Stage 3: Swap kicks in

RAM  [App 750MB] [Kubelet 100MB] [Other 600MB] [Slab 2.1GB] [Cache 300MB]
Swap [Kubelet-old-pages 100MB | App-idle-pages 80MB | Other 320MB] = 500MB used

The kernel identified memory pages that hadn't been accessed recently and moved them to disk. RAM now has room for active work.

Stage 4: Swap thrashing

This is where things go catastrophically wrong. When a process needs a page that was swapped out:

Normal access (page in RAM):
  CPU: "Give me address 0x1234"
  RAM: "Here you go"
  → 100 nanoseconds

Swapped access (page on disk):
  CPU: "Give me address 0x1234"
  RAM: "Not here — it's on disk"              → PAGE FAULT
  Kernel: "I need to load it from swap"
  Kernel: "But RAM is full. Let me swap OUT another page first"
  Disk write: Evict some other page to swap    → 1-5 milliseconds
  Disk read: Load the requested page           → 1-5 milliseconds
  CPU: "Finally!"
  → 2-10 milliseconds total (100,000x slower)

Now multiply this by dozens of processes, all needing pages that were swapped out:

Process A needs a page → it's on disk → swap in A, swap out B → 5ms
Process B runs → needs its page → swapped out by A! → swap in B, swap out C → 5ms
Process C runs → needs its page → swapped out by B! → swap in C, swap out A → 5ms
Process A runs → needs its page → swapped out by C! → ...

This circular eviction is swap thrashing. The system does almost no useful work — all CPU time is spent managing page faults and disk I/O.

Why Swap Thrashing Looks Like a CPU Problem

CloudWatch and top will show 100% CPU utilization during swap thrashing. But the CPU isn't doing computation. Here's the breakdown:

Actual computation:      ~5%     (your app, kubelet, etc.)
Kernel swap management:  ~30%    (deciding what to evict, page table updates)
I/O wait:               ~65%    (waiting for disk reads/writes)
────────────────────────────────
Total:                  ~100%

The load average also skyrockets because Linux counts processes in uninterruptible sleep (waiting for disk I/O) in the load average. If 30 processes are all waiting for swap pages, the load average shows 30 — even though very little CPU work is happening.

This is why our node showed a load average of 34 with pods using only 85m of CPU. The CPUs weren't busy computing — they were busy waiting for the disk.

What is skbuff? (Socket Buffers)

sk_buff (socket buffer) is the data structure at the heart of Linux networking. Every network packet — in or out — is represented by an sk_buff.

Anatomy of a Network Packet in Linux

When your container sends an HTTP request:

Application: send("GET /health HTTP/1.1\r\n...")
    ↓
Kernel: allocate an sk_buff
    ├── skbuff_head_cache entry (256 bytes) — metadata, pointers, protocol info
    └── kmalloc-1k entry (1024 bytes) — the actual packet data
    ↓
Network stack: add TCP header, IP header, Ethernet header
    ↓
Network driver: transmit the packet
    ↓
Kernel: free the sk_buff ← THIS is what wasn't happening

On a healthy system, sk_buff structures are allocated when a packet is created and freed when the packet is sent/received/dropped. The slab pool recycles them efficiently.

What a Leak Looks Like

On our node, we found:

skbuff_head_cache:  1,657,980 objects  (414 MB)
kmalloc-1k:         1,667,384 objects  (1,632 MB)

The almost 1:1 ratio between skbuff headers and 1KB allocations is the signature of a network packet leak. Each packet consists of a header + data buffer. 1.66 million packets were stuck in kernel memory, never freed.

At a normal rate of ~1000 packets/second, 1.66 million packets represents about 28 minutes of traffic that was captured and never released. Over days and weeks, with the leaking tool constantly intercepting traffic, this accumulated to gigabytes.

How to Investigate Memory Issues on Kubernetes Nodes

Step 1: Check if the problem is even memory

cat /proc/pressure/memory

some avg10=98.98 avg60=98.90 avg300=98.38 total=381246311078
full avg10=62.85 avg60=63.91 avg300=63.33 total=281968539996

some > 50% → memory pressure exists
full > 10% → severe memory pressure (all tasks stalling)
full > 50% → critical — system is barely functional

Step 2: Get the full memory breakdown

grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|Slab|SReclaimable|SUnreclaim|SwapTotal|SwapFree|AnonPages|Committed_AS" /proc/meminfo

Read it as:

MemTotal         → Total physical RAM
MemFree          → Completely unused RAM
MemAvailable     → Free + reclaimable (what's actually available)
AnonPages        → Application memory (what kubectl roughly shows)
Cached + Buffers → Page cache (reclaimable, usually harmless)
Slab             → Kernel internal allocations
  SReclaimable   → Kernel caches (can be freed)
  SUnreclaim     → Active kernel objects (cannot be freed!)
SwapTotal        → Total swap space
SwapFree         → Unused swap (SwapTotal - SwapFree = swap used)
Committed_AS     → Total memory promised to all processes

Red flags:

SUnreclaim > 500 MB on a small node → possible kernel memory leak
Committed_AS > MemTotal + SwapTotal → system is overcommitted
SwapFree much less than SwapTotal → active swapping
MemAvailable < 10% of MemTotal → trouble ahead

Step 3: If slab is high, find out what's in it

# Show top slab consumers by object count
cat /proc/slabinfo | sort -k3 -rn | head -10

Common slab objects and what they mean:

Object	What It Is	High Count Means
`skbuff_head_cache`	Network packet headers	Network packet leak or very high traffic
`kmalloc-*`	General kernel allocations	Often paired with another leak
`dentry`	Directory entry cache	Many files/paths accessed (usually reclaimable)
`inode_cache`	File inode cache	Many files accessed (usually reclaimable)
`ext4_inode_cache`	ext4 filesystem inodes	Same as above, ext4 specific
`nf_conntrack`	Connection tracking entries	Too many network connections / conntrack leak

Step 4: Check for swap thrashing

# Load average (should be < number of CPUs)
cat /proc/loadavg

# Swap usage
grep -E "SwapTotal|SwapFree" /proc/meminfo

# If swap is being actively used, check swap I/O
cat /proc/vmstat | grep -E "pswpin|pswpout"

pswpin = pages swapped in from disk (high = thrashing)
pswpout = pages swapped out to disk (high = thrashing)

Monitoring: What to Alert On

If you're running Prometheus with node-exporter, set up alerts for these metrics:

# Alert when non-reclaimable slab memory exceeds 500MB
- alert: HighKernelSlabMemory
  expr: node_memory_SUnreclaim_bytes > 500 * 1024 * 1024
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "High non-reclaimable kernel slab memory on {{ $labels.instance }}"

# Alert when swap usage exceeds 50%
- alert: HighSwapUsage
  expr: (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5
  for: 15m
  labels:
    severity: warning

# Alert when memory pressure is high (PSI)
- alert: MemoryPressureHigh
  expr: node_pressure_memory_stalled_seconds_total rate > 0.5
  for: 5m
  labels:
    severity: critical

# Alert when available memory is critically low
- alert: LowAvailableMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
  for: 10m
  labels:
    severity: critical

Key Takeaways

kubectl top only shows container memory. The kernel can consume gigabytes that are invisible to Kubernetes. Always check /proc/meminfo when debugging node-level memory issues.
High SUnreclaim means something is wrong. Normal is 50-200 MB. If it's in the gigabytes, you have a kernel memory leak — find the leaking slab cache in /proc/slabinfo.
Swap thrashing masquerades as a CPU problem. If you see high CPU + high load average + swap usage, the CPU isn't busy computing — it's busy waiting for disk I/O from swap.
Page cache is not a problem. Low MemFree with healthy MemAvailable is normal — the kernel is caching files intelligently. Only worry when MemAvailable drops.
Network monitoring tools can leak socket buffers. Any tool that intercepts packets at the kernel level (Weave Scope, long-running tcpdump, certain service mesh sidecars) can accumulate sk_buff objects in slab memory over time.
Monitor node_memory_SUnreclaim_bytes. This is the one metric that would have caught our issue months before it caused an outage.

This post is part of a series on debugging Kubernetes pod terminations. Read the full incident story: Why Your Kubernetes Pod Keeps Getting Killed — And It's Not an OOMKill

DEV Community