It started with a "the site won't load" message
Around 8:00 AM, a message landed on my phone: "are the sites back up, friend? not loading."
I open mustafaerbay.com.tr — 8-second timeout. The other apps on the same VPS — same thing. Not the first time, but the deepest so far.
Diagnostic flow: SSH works but it's hanging
First SSH:
$ ssh vps
[5 second wait]
$ uptime
05:27:24 up 9 days, 7:51, 3 users,
load average: 52.51, 76.02, 70.66
Load 52, 76, 70. On a healthy system that should be 4. This is hell.
free -h:
total used free shared buff/cache available
Mem: 7.6Gi 7.5Gi 122Mi 147Mi 330Mi 76Mi
Swap: 4.0Gi 3.9Gi 106Mi
All 7.6 GB of RAM is used. 76 MB available. Swap is full, 106 MB free. The system is running 4-5 times slower than normal because every memory allocation is hitting swap.
Which processes are doing this damage? ps aux --sort=-%cpu:
USER PID %CPU %MEM COMMAND
root 54 92.9 0.0 kcompactd0 ← HERE
ubuntu 382248 24.4 32.3 node ... astro build --out-dir dist-new
root 383691 26.4 7.0 node ... next build
github-+ 379827 4.4 0.9 Runner.Worker spawnclient
kcompactd0 at 92% CPU. That's a new acquaintance for me. It's the "memory compaction" daemon of the Linux memory subsystem — the kernel kicks it in when it can't find a contiguous block ("RAM is fragmented, can't find a contiguous block"), and it tries to coalesce small free chunks. Seeing it eating 92% CPU means the system is spending most of its CPU just looking for memory.
And I can see why: simultaneously, an Astro build (2.5 GB) and a Next.js build (615 MB) are running. About 3 GB just from those two. The remaining 4 GB is system services + containers + sshd. Total demand > 7.6 GB → swap → swap full → kcompactd panics.
Why are the sites timing out?
I'd seen with free -h that RAM had only 76 MB free. When sshd accepts a connection, it does a fork(2) to spawn a child process to respond. fork wants RAM. There's wild swap thrashing on RAM, fork waits → connection reset by peer.
curl dies a similar half-death:
$ curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" --max-time 8 https://mustafaerbay.com.tr
000 8.006s
Nginx accepted the connection (port 443 is open), tries to proxy_pass to the Node app (127.0.0.1:3040), the Node app can't respond (no RAM). Timeout.
Who triggered this Astro build?
$ ps -p 382248 -o pid,user,cmd
PID USER CMD
382248 ubuntu node /opt/mustafaerbay/node_modules/.bin/astro build --out-dir dist-new
ubuntu user, output to /opt/mustafaerbay/dist-new. That belongs to my update.sh deploy script. So I triggered it — I did a git push, the VPS deploy timer pulled, the build started.
How long has it been running? 18+ minutes. A normal Astro build is 4-5 minutes. Stuck. Under memory pressure every step slows down, and the actual work can't finish.
No fix, hard reset
I tried gh run cancel — didn't get there (signals don't get delivered under RAM thrashing). My kill 382248 over ssh mostly didn't take. The OOM killer wasn't catching up either, because it also wants memory; the system was so bloated that the kernel was running slow.
Only one option left: log into Hostinger and hit hard reset. 90 seconds later it's back:
$ ssh vps 'uptime; free -h | head -2'
05:38:07 up 0 min, 2 users, load average: 2.74, 0.58, 0.19
total used free shared buff/cache available
Mem: 7.6Gi 1.4Gi 4.9Gi 83Mi 1.6Gi 6.1Gi
6.1 GB available. All containers auto-started thanks to restart: unless-stopped. The Postgres instances recovered through their WAL. Sites returning 200.
Now: how to keep this from happening again
I started working on it that first night. The single reflex of "make the build use less" isn't enough — that's defensive. I have to be preventive. I need a few layers.
1. Pre-flight resource guard (workflow)
- name: Pre-flight resource check
id: preflight
run: |
AVAIL_GB=$(df -BG / | tail -1 | awk '{print $4}' | tr -d 'G')
LOAD=$(awk '{print $1}' /proc/loadavg)
LOAD_INT=${LOAD%.*}
MEM_AVAIL_MB=$(awk '/MemAvailable/{print int($2/1024)}' /proc/meminfo)
if [ "$AVAIL_GB" -lt 5 ] || [ "$LOAD_INT" -gt 8 ] || [ "$MEM_AVAIL_MB" -lt 1500 ]; then
echo "skip=1" >> "$GITHUB_OUTPUT"
echo "::warning::skipping — insufficient resources"
fi
Before the workflow even starts, check whether the VPS can breathe. If not, graceful skip → success exit, no email spam.
2. URL polling instead of sleep 360
I used to have sleep 360 (6 minutes) in the workflow, just to wait for the deploy to finish. Sleep doesn't actively use RAM but it occupies a runner slot. If an OOM happens during the build, the sleep step gets SIGKILL'd → workflow fail → email.
New version:
for i in $(seq 1 108); do
if curl -fsS -o /dev/null --max-time 5 "$URL"; then
echo "Deploy detected on attempt ${i}"
exit 0
fi
sleep 5
done
URL polling. Move forward as soon as the site goes live. Short intervals instead of one long sleep — resilient against OOM-killable situations.
3. AI quirk auto-fixer
The day before, three different AI quirks had failed cron jobs. I gathered them all into a single normalizer using a "fix instead of reject" strategy. I'd written about this earlier, but it's relevant: fail-soft mentality everywhere.
4. Pipeline-health monitor
This might be the most important one. The file /var/lib/mustafaerbay/health-state keeps the latest status (healthy / degraded). Every 4 hours a cron checks the most recent Bluesky post. If there's no post within 4+ hours, state-change = first DEGRADED email. After that it checks every 4 hours, and if posting resumes a RECOVERED email is sent. Even if 100 crons run in the same state, no email gets sent.
In this morning's incident, 16 cron jobs failed back to back, and only 1 email came through. In the classic "send an email for every workflow run" world, 16 emails would have arrived.
A deeper lesson
If I hadn't lived through this OOM, I wouldn't have built the layered defense. You have to live it once — you need that concrete experience that says this can happen. Now I have pre-flight at the start of every cron, polling instead of build wait, an AI output normalizer, and state-change alerts.
None of these prevent the OOM on its own. They prevent it together. Classic Swiss cheese model:
[OOM happens] →
layer 1: pre-flight skips (resources already seen as insufficient)
→ layer 2: polling-wait is OOM-resilient
→ layer 3: auto-fixer doesn't trip on AI quirks
→ layer 4: pipeline-health monitor only alerts on state-change
→ result: 1 cron skip per half hour, no email, healthy system
If there were just one defense, it would eventually find a hole. Four sliced cheese slices in a row, holes don't line up → the ball doesn't get through.
⚠️ When you see kcompactd at 92% CPU
This means the system is suffocating. The kernel can't find free contiguous memory.
The typical cause: multiple large processes asking for RAM at the same time. Temporary
fix: kill the most expensive process. Permanent fix: don't let them start in parallel.
Conclusion
I got caught off guard by this OOM. But when you think of the post-mortem as a chance to build a four-layer pipeline reliability system, you have to say it was a good thing. A bad day can produce value — if you write it up.
The next cron will run in 30 minutes. Pre-flight will be checking. If RAM is still under 5 GB, it'll skip. I'm no longer worried, because I've seen this system work — the pipeline keeps going on its own.
Top comments (0)