Mustafa ERBAY

Posted on May 16 • Originally published at mustafaerbay.com.tr

3rd OOM on the VPS: Parallel Builds and a flock Mutex Story

#vps #oom #incident #buildmutex

Scene: "the sites are down dude"

Wednesday, May 7, afternoon. A message on my phone: "the sites are down dude."

Quick check: my own blog (mustafaerbay.com.tr), the other Next.js apps living on the same VPS, islistesi — none of them load. Only hesapciyiz.com and spamkalkani.com return 200. They're hosted on different infrastructure, so this time they survived.

First SSH attempt:

$ ssh vps
Connection timed out during banner exchange
Connection to 141.95.1.22 port 22 timed out

I'm seeing this error message for the third time. Banner exchange timeout means "TCP opened but sshd couldn't even send a hello." Typically there's only one cause: the system is out of RAM and sshd can't fork for a new connection.

curl told the same story:

$ curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" --max-time 8 https://mustafaerbay.com.tr
000 8.006s

The port is open at the network layer, but the processes can't keep up.

The Pattern Felt Familiar

This incident has happened to me on the same VPS for the third time:

April 28: Disk hit 100% full (Docker 33 GB build cache + 23 GB unused images). Half a day of outage.
May 4: RAM ran out and swap exploded (kcompactd at 92% CPU, sshd couldn't accept). Hard reboot.
May 7: Same picture again.

After the first two I had prepared defenses. I'd built layered protection:

Pre-flight resource guard (workflow start checks disk/load/RAM, skips if below threshold).
Disk-cleanup timer (auto-clean Docker build cache + dangling images).
Polling-based deploy wait (live URL polling instead of sleep 360 — no resource burn).
Pipeline-health monitor (single mail on state-change, no spam).

All of these were running. Yet today it still blew up. So something was still missing.

Diagnostic Flow

Since I couldn't reach the VPS, the first move was on the GitHub Actions side. I checked for active workflows:

$ gh run list --status=in_progress --limit 5
in_progress  Generate Content  schedule  25490764410  18m10s  ...

A content-generate cron had been running for 18 minutes. Normally it takes 12-14 minutes — 18 means it's stuck. I sent the cancel command:

$ gh run cancel 25490764410
✓ Request to cancel workflow 25490764410 submitted.

I sent the cancel, but wasn't sure it would arrive. The cancel signal goes from GitHub to the self-hosted runner on the VPS, then down to the step process. If the process is already thrashing under OOM, even SIGTERM doesn't get delivered properly.

What I expected happened — for several minutes SSH refused connections:

$ for i in 1 2 3 4; do sleep 25; ssh -o ConnectTimeout=15 vps 'uptime'; done
[Try 1] Connection timed out during banner exchange
[Try 2] Connection timed out during banner exchange
[Try 3] Connection timed out during banner exchange
[Try 4] Connection timed out during banner exchange

After 100 seconds still timing out. Even the Linux OOM killer can't catch up because it also needs memory; the system is doing so much swap thrashing that kernel calls come slowly.

Only one option left: hard reset. I rebooted the VPS from the OVH panel. 90 seconds later it came back:

$ ssh vps 'uptime; free -h | head -2'
 11:01:14 up 1 min,  2 users,  load average: 0.98, 0.43, 0.16
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       1.6Gi       4.5Gi       105Mi       1.8Gi       6.0Gi

All containers auto-started thanks to restart: unless-stopped, postgreses recovered with WAL, nginx came up. Within ~2 minutes all sites returned 200.

But the real work is the root cause.

Root Cause: The Parallel Build Assumption

I looked at the log of the mustafaerbay workflow I had cancelled. It was in the validate step, doing an astro build. Alone it eats 2.5 GB of RAM. On a 7.6 GB system, alone is fine — 5 GB of the VPS is already allocated to other containers (postgreses, redises, apps). 1-2 GB free remains, enough for the build.

But at the same moment another project's docker-compose build had also kicked off. Half a dozen different projects live on the same VPS — my own side products and customer applications mixed. Mustafaerbay (Astro), gercekveri.com, islistesi as my own projects, plus several other apps running Next.js + postgres + redis.

Each has its own update mechanism for deploy. They have independent cycles. One day, two of them enter the build phase within the same minute.

2 parallel builds × ~2 GB each = 4 GB of additional RAM demand. An amount the VPS doesn't have. The system drops into swap, when swap fills kcompactd kills CPU on memory compaction, no fork-able RAM remains, sshd can't take a connection, hostage situation.

All the protections I'd built so far are temporal for my own pipeline: I look at the start of the workflow, did it cross the threshold? If yes, skip. The problem is: after the build starts, if another project's build kicks in, I can't see it. Pre-flight is a snapshot, the build process takes 4-5 minutes. Anything can change in that window.

So my system: was written with a "single-tenant VPS" assumption. Reality: multi-tenant.

The Fix: A VPS-Wide Build Mutex

A classic mutex problem. Multiple tasks accessing a single resource (RAM); they need to be serialized. On Linux, you do this most cleanly with flock:

# Classic usage
flock -w 900 /var/lock/vps-build.lock <build-command>

-w 900: if the lock can't be acquired in 15 minutes, give up, return exit 124. This is critical for graceful skip — waiting forever would tie up the runner and the queue would balloon.

I applied it in three places:

1. mustafaerbay's VPS deploy script (deploy/update.sh):

echo "building (into dist-new)"
rm -rf dist-new
if ! flock -w 900 /var/lock/vps-build.lock \
       env DEPLOY_TARGET=node npx astro build --out-dir dist-new; then
  echo "BUILD FAILED OR LOCK TIMEOUT — let the next deploy try"
  rm -rf dist-new
  exit 0
fi

Two things happen at once: the lock is acquired → the build runs. If the lock can't be acquired (another project is building), it waits 15 min. If time runs out, skip — the next minute the deploy timer pulls again and tries.

2. The GitHub Actions validate step (content-generate.yml):

if flock -w 900 /var/lock/vps-build.lock \
    env DEPLOY_TARGET=node npx astro build --out-dir dist-validate; then

The workflow build also respects the same lock.

3. A reusable wrapper for other projects (/usr/local/bin/vps-build-lock.sh):

#!/usr/bin/env bash
exec flock -w 900 /var/lock/vps-build.lock "$@"

Adding this to the start of my other projects' deploy scripts is enough:

# Before
docker-compose build app

# After
/usr/local/bin/vps-build-lock.sh docker-compose build app

Why flock and Why 15 Minutes

flock is a kernel-level fcntl-based file lock on Linux. If the process dies (kill -9 included), the lock is automatically released. That's critical — if I'd used a manual state file, a stale lock would remain when a process crashed and I'd have to clean it manually.

The 15-minute choice isn't arbitrary:

A single build takes max ~4 minutes (Astro + Pagefind included).
In the worst case, with 3 builds in queue, the last one starts after 12 minutes.
A 15-minute threshold gracefully skips if "I'm 3 builds behind."
Since cron runs every 30 minutes, skip = "try again in the next minute."

Critical point: During the wait phase there's no RAM consumption. The flock-blocked process hasn't actually started the work yet — it's sleeping in the kernel. It just holds a few MB of stack. No RAM fire because the source of the actual fire — parallel build memory demand — is blocked.

Deeper Lesson

This incident taught me 3 things:

1. The "single-tenant" assumption is a hidden bomb in multi-tenant environments.

I wrote "good citizen" rules inside each project: rate-limit, retry-with-backoff, preflight check, cleanup timer. But these are for resources the project is aware of itself. For shared resources — RAM in this case — you need system-level coordination. Each project knowing its own ethics isn't enough.

2. SIGTERM doesn't always arrive.

I cancelled the workflow from GitHub, it didn't arrive. Because under system-wide thrashing, signal delivery slows down and can be lost. Don't treat cancel as a rescue. Block the problem at the start; keep hard reboot as the nuclear option.

3. If you've made the same mistake 3 times, you haven't solved the problem.

After the first OOM I added disk-cleanup. After the second, polling-wait. In this third incident I understood: the real problem is neither disk nor timing — it's resource reservation. The earlier fixes treated symptoms, not the disease. flock treats the disease. You can still kick off parallel builds, but when you do, they queue instead of running concurrently.

💡 If You Want to Apply This on Your Own VPS

The /var/lock/vps-build.lock file is created automatically by the first flock call; you don't need to create it manually.

The flock package comes with util-linux on Linux (default on Ubuntu/Debian). Verify it's installed via which flock.

Tune the lock timeout to your environment: build duration × 3-4 is reasonable.

You can also choose flock -n (non-blocking) to skip immediately instead of waiting, but then your cron dies halfway. -w (timed wait) is more forgiving.

Conclusion

Right now mutex is active with commit df6fd98. If the next cron starts two parallel builds, one will wait 15 min for the other. A RAM fire is impossible because only one build runs at a time.

I'm writing this post on the same day this happened. I'm still tired as if from a workout. But writing the post-mortem instead of taking that workout has value: the next time the same thing happens at the same hour, I can say "let me read last time's note and remember what I did."

If you have a VPS like this, you'll go multi-tenant sooner or later. Put flock in from the start.

DEV Community