DEV Community

iapilgrim
iapilgrim

Posted on

How I Troubleshot a KVM Memory Issue That Led to Swap & High CPU (Runbook + Real Scenario)

Recently, I noticed something strange on one of my KVM hypervisors.

The server wasn’t heavily loaded, but earlier I saw:

  • qemu-system-x86 consuming 800%+ CPU
  • kswapd running hot
  • Swap usage near 100%

But when I checked later:

  • CPU was low
  • RAM had plenty free
  • Swap was still full

Here’s the exact troubleshooting flow I followed — and how you can do the same.


🧠 Environment Context

  • Hypervisor: KVM + libvirt
  • Host RAM: 314 GB
  • Swap: 976 MB
  • Multiple VMs running
  • Problem VM: testnet-node3

🔍 Step 1 — Identify High CPU Process

First signal:

ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head -n 10
Enter fullscreen mode Exit fullscreen mode

Output showed:

qemu-system-x86  818%
Enter fullscreen mode Exit fullscreen mode

⚠️ Important: In Linux, 100% = 1 core.

So:

  • 800% = ~8 cores fully used

That means one VM was heavily consuming CPU.


🔎 Step 2 — Identify Which VM Maps to That Process

Each VM is a qemu-system-x86 process.

To map PID to VM:

ps -fp <PID>
Enter fullscreen mode Exit fullscreen mode

Or list VMs:

virsh list --all
Enter fullscreen mode Exit fullscreen mode

To see details:

virsh dominfo <vm-name>
Enter fullscreen mode Exit fullscreen mode

This is how I identified:

testnet-node3
Enter fullscreen mode Exit fullscreen mode

📊 Step 3 — Check Host Memory & Swap

Next, I checked memory:

free -h
Enter fullscreen mode Exit fullscreen mode

Output:

Mem:   314Gi total
       217Gi used
        94Gi free
Swap:  976Mi total
       963Mi used
Enter fullscreen mode Exit fullscreen mode

Swap was 98% used.

But RAM still had 94GB free.

This is where people panic incorrectly.


🧪 Step 4 — Check If System Is Under Active Memory Pressure

The key command:

vmstat 1 5
Enter fullscreen mode Exit fullscreen mode

Focus on:

  • si → swap in
  • so → swap out

If both are 0:

You are NOT under active memory pressure.

In my case:

si = 0
so = 0
Enter fullscreen mode Exit fullscreen mode

Meaning:

  • Swap usage was historical
  • Not current
  • System was stable

🔥 Why Swap Can Stay Full Even With Free RAM

Linux does NOT automatically move swapped pages back into RAM unless needed.

So:

  • VM previously caused pressure
  • Kernel swapped ~1GB
  • Memory pressure disappeared
  • Swap remained full

This is normal Linux behavior.


🧠 Step 5 — Check VM Memory Allocation

Then I inspected the VM:

virsh dominfo testnet-node3
Enter fullscreen mode Exit fullscreen mode

Output:

Max memory: 98304000 KiB
Enter fullscreen mode Exit fullscreen mode

Convert:

98304000 KiB ≈ 94 GB
Enter fullscreen mode Exit fullscreen mode

So the VM had ~94GB allocated.


❓ Was The VM Actually Memory Starved?

Before increasing RAM, you must check inside the guest.

Inside VM:

free -h
vmstat 1 5
Enter fullscreen mode Exit fullscreen mode

If inside the VM:

  • Swap used
  • OOM killer logs
  • Memory >90% used

Then increasing RAM makes sense.

If not — CPU issue may be workload-related instead.


🚀 Step 6 — Increase VM RAM Safely

Since the VM was already stopped:

Target: 128 GB

128GB in KiB:

128 × 1024 × 1024 = 134217728 KiB
Enter fullscreen mode Exit fullscreen mode

Commands:

virsh setmaxmem testnet-node3 128G --config
virsh setmem testnet-node3 128G --config
Enter fullscreen mode Exit fullscreen mode

Verify:

virsh dominfo testnet-node3
Enter fullscreen mode Exit fullscreen mode

Then start:

virsh start testnet-node3
Enter fullscreen mode Exit fullscreen mode

📊 Step 7 — Verify Host Stability After Resize

After starting the VM:

free -h
Enter fullscreen mode Exit fullscreen mode
Mem:   314Gi total
       221Gi used
        89Gi free
Swap:   0B used
Enter fullscreen mode Exit fullscreen mode

Swap cleared.

Then:

vmstat 1 5
Enter fullscreen mode Exit fullscreen mode

Confirmed:

  • si = 0
  • so = 0
  • CPU idle high

System healthy.


🧩 Root Cause Pattern

Here’s the chain that usually happens:

  1. VM workload spikes
  2. Guest consumes heavy memory
  3. Host experiences memory pressure
  4. Host swap fills
  5. kswapd increases CPU
  6. qemu process CPU rises
  7. After workload stabilizes → swap remains full

Without checking vmstat, people misdiagnose this.


🛑 Common Mistakes

❌ Increasing RAM without checking guest usage
❌ Assuming 100% swap = system dying
❌ Ignoring vmstat
❌ Allocating 100% host RAM to VMs


📐 Capacity Planning Rule for KVM Hosts

For large-memory hosts (like 314GB):

  • Leave 16–32GB minimum for host OS
  • Never allocate 100% to guests
  • Monitor swap regularly
  • Keep swap small (1–4GB is fine for large RAM systems)

🧠 Pro Tips

Check total VM memory allocation:

virsh list --name | while read vm; do
  virsh dominfo $vm | grep -i memory
done
Enter fullscreen mode Exit fullscreen mode

See if swapping is active:

vmstat 1
Enter fullscreen mode Exit fullscreen mode

See which process consumes most memory:

ps -eo pid,comm,%mem,%cpu --sort=-%mem | head
Enter fullscreen mode Exit fullscreen mode

🎯 Final Takeaway

Swap usage alone does NOT mean memory problem.

The real indicators are:

  • Active swap in/out (vmstat)
  • OOM events
  • Sustained high CPU from kswapd
  • Guest-level memory pressure

In my case:

  • VM memory was increased from 94GB → 128GB
  • Host remained healthy
  • No swap pressure
  • System stable

If you're running KVM in production, understanding this memory + swap + CPU interaction is critical.

Blindly adding RAM is easy.

Diagnosing correctly is what makes you a good systems engineer.

Top comments (0)