Mustafa ERBAY

Posted on May 16 • Originally published at mustafaerbay.com.tr

First OOM: kcompactd at 92% CPU, sshd Reset, Hard Reboot

#oom #swap #incident #kcompactd

It started with a "the site won't load" message

Around 8:00 AM, a message landed on my phone: "are the sites back up, friend? not loading."

I open mustafaerbay.com.tr — 8-second timeout. The other apps on the same VPS — same thing. Not the first time, but the deepest so far.

Diagnostic flow: SSH works but it's hanging

First SSH:

$ ssh vps
[5 second wait]
$ uptime
 05:27:24 up 9 days,  7:51,  3 users,
 load average: 52.51, 76.02, 70.66

Load 52, 76, 70. On a healthy system that should be 4. This is hell.

free -h:

               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       7.5Gi       122Mi       147Mi       330Mi        76Mi
Swap:          4.0Gi       3.9Gi       106Mi

All 7.6 GB of RAM is used. 76 MB available. Swap is full, 106 MB free. The system is running 4-5 times slower than normal because every memory allocation is hitting swap.

Which processes are doing this damage? ps aux --sort=-%cpu:

USER         PID %CPU %MEM    COMMAND
root          54 92.9  0.0    kcompactd0  ← HERE
ubuntu    382248 24.4 32.3    node ... astro build --out-dir dist-new
root      383691 26.4  7.0    node ... next build
github-+  379827  4.4  0.9    Runner.Worker spawnclient

kcompactd0 at 92% CPU. That's a new acquaintance for me. It's the "memory compaction" daemon of the Linux memory subsystem — the kernel kicks it in when it can't find a contiguous block ("RAM is fragmented, can't find a contiguous block"), and it tries to coalesce small free chunks. Seeing it eating 92% CPU means the system is spending most of its CPU just looking for memory.

And I can see why: simultaneously, an Astro build (2.5 GB) and a Next.js build (615 MB) are running. About 3 GB just from those two. The remaining 4 GB is system services + containers + sshd. Total demand > 7.6 GB → swap → swap full → kcompactd panics.

Why are the sites timing out?

I'd seen with free -h that RAM had only 76 MB free. When sshd accepts a connection, it does a fork(2) to spawn a child process to respond. fork wants RAM. There's wild swap thrashing on RAM, fork waits → connection reset by peer.

curl dies a similar half-death:

$ curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" --max-time 8 https://mustafaerbay.com.tr
000 8.006s

Nginx accepted the connection (port 443 is open), tries to proxy_pass to the Node app (127.0.0.1:3040), the Node app can't respond (no RAM). Timeout.

Who triggered this Astro build?

$ ps -p 382248 -o pid,user,cmd
    PID USER     CMD
 382248 ubuntu   node /opt/mustafaerbay/node_modules/.bin/astro build --out-dir dist-new

ubuntu user, output to /opt/mustafaerbay/dist-new. That belongs to my update.sh deploy script. So I triggered it — I did a git push, the VPS deploy timer pulled, the build started.

How long has it been running? 18+ minutes. A normal Astro build is 4-5 minutes. Stuck. Under memory pressure every step slows down, and the actual work can't finish.

No fix, hard reset

I tried gh run cancel — didn't get there (signals don't get delivered under RAM thrashing). My kill 382248 over ssh mostly didn't take. The OOM killer wasn't catching up either, because it also wants memory; the system was so bloated that the kernel was running slow.

Only one option left: log into Hostinger and hit hard reset. 90 seconds later it's back:

$ ssh vps 'uptime; free -h | head -2'
 05:38:07 up 0 min,  2 users,  load average: 2.74, 0.58, 0.19
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       1.4Gi       4.9Gi        83Mi       1.6Gi       6.1Gi

6.1 GB available. All containers auto-started thanks to restart: unless-stopped. The Postgres instances recovered through their WAL. Sites returning 200.

Now: how to keep this from happening again

I started working on it that first night. The single reflex of "make the build use less" isn't enough — that's defensive. I have to be preventive. I need a few layers.

1. Pre-flight resource guard (workflow)

- name: Pre-flight resource check
  id: preflight
  run: |
    AVAIL_GB=$(df -BG / | tail -1 | awk '{print $4}' | tr -d 'G')
    LOAD=$(awk '{print $1}' /proc/loadavg)
    LOAD_INT=${LOAD%.*}
    MEM_AVAIL_MB=$(awk '/MemAvailable/{print int($2/1024)}' /proc/meminfo)

    if [ "$AVAIL_GB" -lt 5 ] || [ "$LOAD_INT" -gt 8 ] || [ "$MEM_AVAIL_MB" -lt 1500 ]; then
      echo "skip=1" >> "$GITHUB_OUTPUT"
      echo "::warning::skipping — insufficient resources"
    fi

Before the workflow even starts, check whether the VPS can breathe. If not, graceful skip → success exit, no email spam.

2. URL polling instead of sleep 360

I used to have sleep 360 (6 minutes) in the workflow, just to wait for the deploy to finish. Sleep doesn't actively use RAM but it occupies a runner slot. If an OOM happens during the build, the sleep step gets SIGKILL'd → workflow fail → email.

New version:

for i in $(seq 1 108); do
  if curl -fsS -o /dev/null --max-time 5 "$URL"; then
    echo "Deploy detected on attempt ${i}"
    exit 0
  fi
  sleep 5
done

URL polling. Move forward as soon as the site goes live. Short intervals instead of one long sleep — resilient against OOM-killable situations.

3. AI quirk auto-fixer

The day before, three different AI quirks had failed cron jobs. I gathered them all into a single normalizer using a "fix instead of reject" strategy. I'd written about this earlier, but it's relevant: fail-soft mentality everywhere.

4. Pipeline-health monitor

This might be the most important one. The file /var/lib/mustafaerbay/health-state keeps the latest status (healthy / degraded). Every 4 hours a cron checks the most recent Bluesky post. If there's no post within 4+ hours, state-change = first DEGRADED email. After that it checks every 4 hours, and if posting resumes a RECOVERED email is sent. Even if 100 crons run in the same state, no email gets sent.

In this morning's incident, 16 cron jobs failed back to back, and only 1 email came through. In the classic "send an email for every workflow run" world, 16 emails would have arrived.

A deeper lesson

If I hadn't lived through this OOM, I wouldn't have built the layered defense. You have to live it once — you need that concrete experience that says this can happen. Now I have pre-flight at the start of every cron, polling instead of build wait, an AI output normalizer, and state-change alerts.

None of these prevent the OOM on its own. They prevent it together. Classic Swiss cheese model:

[OOM happens] →
  layer 1: pre-flight skips (resources already seen as insufficient)
  → layer 2: polling-wait is OOM-resilient
  → layer 3: auto-fixer doesn't trip on AI quirks
  → layer 4: pipeline-health monitor only alerts on state-change
  → result: 1 cron skip per half hour, no email, healthy system

If there were just one defense, it would eventually find a hole. Four sliced cheese slices in a row, holes don't line up → the ball doesn't get through.

⚠️ When you see kcompactd at 92% CPU

This means the system is suffocating. The kernel can't find free contiguous memory.
The typical cause: multiple large processes asking for RAM at the same time. Temporary
fix: kill the most expensive process. Permanent fix: don't let them start in parallel.

Conclusion

I got caught off guard by this OOM. But when you think of the post-mortem as a chance to build a four-layer pipeline reliability system, you have to say it was a good thing. A bad day can produce value — if you write it up.

The next cron will run in 30 minutes. Pre-flight will be checking. If RAM is still under 5 GB, it'll skip. I'm no longer worried, because I've seen this system work — the pipeline keeps going on its own.

DEV Community