iapilgrim

Posted on Feb 24

How I Troubleshot a KVM Memory Issue That Led to Swap & High CPU (Runbook + Real Scenario)

#linux #performance #sre #tutorial

Recently, I noticed something strange on one of my KVM hypervisors.

The server wasn’t heavily loaded, but earlier I saw:

qemu-system-x86 consuming 800%+ CPU
kswapd running hot
Swap usage near 100%

But when I checked later:

CPU was low
RAM had plenty free
Swap was still full

Here’s the exact troubleshooting flow I followed — and how you can do the same.

🧠 Environment Context

Hypervisor: KVM + libvirt
Host RAM: 314 GB
Swap: 976 MB
Multiple VMs running
Problem VM: testnet-node3

🔍 Step 1 — Identify High CPU Process

First signal:

ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head -n 10

Output showed:

qemu-system-x86  818%

⚠️ Important: In Linux, 100% = 1 core.

So:

800% = ~8 cores fully used

That means one VM was heavily consuming CPU.

🔎 Step 2 — Identify Which VM Maps to That Process

Each VM is a qemu-system-x86 process.

To map PID to VM:

ps -fp <PID>

Or list VMs:

virsh list --all

To see details:

virsh dominfo <vm-name>

This is how I identified:

testnet-node3

📊 Step 3 — Check Host Memory & Swap

Next, I checked memory:

free -h

Output:

Mem:   314Gi total
       217Gi used
        94Gi free
Swap:  976Mi total
       963Mi used

Swap was 98% used.

But RAM still had 94GB free.

This is where people panic incorrectly.

🧪 Step 4 — Check If System Is Under Active Memory Pressure

The key command:

vmstat 1 5

Focus on:

si → swap in
so → swap out

If both are 0:

You are NOT under active memory pressure.

In my case:

si = 0
so = 0

Meaning:

Swap usage was historical
Not current
System was stable

🔥 Why Swap Can Stay Full Even With Free RAM

Linux does NOT automatically move swapped pages back into RAM unless needed.

So:

VM previously caused pressure
Kernel swapped ~1GB
Memory pressure disappeared
Swap remained full

This is normal Linux behavior.

🧠 Step 5 — Check VM Memory Allocation

Then I inspected the VM:

virsh dominfo testnet-node3

Output:

Max memory: 98304000 KiB

Convert:

98304000 KiB ≈ 94 GB

So the VM had ~94GB allocated.

❓ Was The VM Actually Memory Starved?

Before increasing RAM, you must check inside the guest.

Inside VM:

free -h
vmstat 1 5

If inside the VM:

Swap used
OOM killer logs
Memory >90% used

Then increasing RAM makes sense.

If not — CPU issue may be workload-related instead.

🚀 Step 6 — Increase VM RAM Safely

Since the VM was already stopped:

Target: 128 GB

128GB in KiB:

128 × 1024 × 1024 = 134217728 KiB

Commands:

virsh setmaxmem testnet-node3 128G --config
virsh setmem testnet-node3 128G --config

Verify:

virsh dominfo testnet-node3

Then start:

virsh start testnet-node3

📊 Step 7 — Verify Host Stability After Resize

After starting the VM:

free -h

Mem:   314Gi total
       221Gi used
        89Gi free
Swap:   0B used

Swap cleared.

Then:

vmstat 1 5

Confirmed:

si = 0
so = 0
CPU idle high

System healthy.

🧩 Root Cause Pattern

Here’s the chain that usually happens:

VM workload spikes
Guest consumes heavy memory
Host experiences memory pressure
Host swap fills
kswapd increases CPU
qemu process CPU rises
After workload stabilizes → swap remains full

Without checking vmstat, people misdiagnose this.

🛑 Common Mistakes

❌ Increasing RAM without checking guest usage
❌ Assuming 100% swap = system dying
❌ Ignoring vmstat
❌ Allocating 100% host RAM to VMs

📐 Capacity Planning Rule for KVM Hosts

For large-memory hosts (like 314GB):

Leave 16–32GB minimum for host OS
Never allocate 100% to guests
Monitor swap regularly
Keep swap small (1–4GB is fine for large RAM systems)

🧠 Pro Tips

Check total VM memory allocation:

virsh list --name | while read vm; do
  virsh dominfo $vm | grep -i memory
done

See if swapping is active:

vmstat 1

See which process consumes most memory:

ps -eo pid,comm,%mem,%cpu --sort=-%mem | head

🎯 Final Takeaway

Swap usage alone does NOT mean memory problem.

The real indicators are:

Active swap in/out (vmstat)
OOM events
Sustained high CPU from kswapd
Guest-level memory pressure