A Cron Job Took Our Server to Load 41 by Attacking Itself

#devops #linux #bash #sysadmin

A */1 rsync took our staging box to a load average of 41 one afternoon, and it took me longer than I want to admit to work out why. The sync normally finished in about twenty seconds. That day the backup target's NFS mount went sluggish, the sync started taking ninety seconds, and cron — which does not know or care whether the last run is still going — launched a fresh copy every single minute on top of it.

Inside ten minutes there were a half-dozen rsyncs all reading the same tree off the same slow disk, each one making the disk slower, each new minute adding another. The box wasn't under attack. It was attacking itself, one polite copy at a time. The thing that stung was that nothing was broken — every individual rsync was correct, the disk eventually recovered on its own, and the only reason it became an outage is that cron has no concept of "the last one is still running."

That's the trap with scheduled jobs: a command that's perfectly fine when you run it by hand can take down a server the first time it runs longer than its interval with nobody watching.

The fix everyone reaches for first is the wrong one

The instinct is a PID file: write $$ to /var/run/job.pid on start, check whether that file exists on the next run, bail if it does. It almost works. Then one run gets kill -9'd, or the box reboots mid-job, and the PID file is left behind pointing at a process that died on Tuesday. Now every future run sees a "lock" owned by a PID that no longer exists, and the job never runs again — the opposite failure, just as silent.

There's also a race between the check and the write, and the times you most need the lock to be clean are exactly the times cleanup didn't happen, because the process died before it could clean up.

flock has none of that. The lock isn't a file you create and delete — it's a lock the kernel holds on an open file descriptor, and the kernel releases it automatically the instant that descriptor closes. The process exiting closes it. So does crashing. So does kill -9. There is no state to leave behind, which is the entire reason it survives the failure modes a PID file can't.

The single-instance pattern

#!/bin/bash
# Script: backup-with-lock.sh
# Purpose: Stop a cron job from overlapping itself when one run runs long
set -euo pipefail

CHECK="✓"
CROSS="✗"

# /run/lock is tmpfs, cleared cleanly on reboot. Never /tmp — temp-cleaners
# delete files there, and a deleted lock mid-run lets a second copy run.
LOCK_FILE="/run/lock/$(basename "$0").lock"

# The > opens (and creates) the lock file on fd 200 and holds it open for the
# whole script. The lock lives on this descriptor, not on the file existing.
exec 200>"$LOCK_FILE"

# -n = non-blocking: if a previous run still holds the lock, give up now
# instead of queueing another copy behind it.
if ! flock -n 200; then
    echo "$CROSS $(date '+%F %T') previous run still active — skipping" >&2
    exit 0
fi

echo "$CHECK $(date '+%F %T') lock acquired — starting"
rsync -a --delete /data/ /mnt/backup/data/
echo "$CHECK $(date '+%F %T') finished — kernel releases the lock on exit"

The two lines doing the work are exec 200>"$LOCK_FILE" and flock -n 200. The first opens the lock file on a descriptor that stays open for the life of the process. The second tries to grab the lock without waiting; if a sibling process already holds it, flock returns non-zero, we log it and exit 0 — a skipped run is normal, not an error, so we don't want it lighting up cron's mail.

Notice there is no cleanup. No trap to remove a PID file, no rm at the end. When this script exits for any reason, fd 200 closes and the lock is gone. That "for any reason" is the whole point.

You can lock a job without editing it at all

If the misbehaving job is already deployed and you don't want to touch it, wrap it from the crontab line:

# Skip the run if the last one is still going
*/1 * * * * /usr/bin/flock -n /run/lock/sync.lock /usr/local/bin/sync.sh

flock runs sync.sh only if it can grab the lock; if last minute's run is still holding it, this minute's run exits immediately and does nothing. It's the fastest retrofit for a job that's already on fire — no redeploy.

One thing worth burning into memory: -n skips, -w 30 waits up to thirty seconds then gives up, and a bare flock with neither blocks forever. On a fast cron schedule that bare form turns your "skipped" runs into a pile of stuck processes — the exact thing you were trying to prevent.

The part that actually mattered

The load-41 afternoon ended the moment I wrapped that rsync in flock -n. The slow NFS mount was still slow, but now exactly one sync ran at a time and the extras skipped harmlessly until the disk recovered. Locking didn't fix the slow disk — it stopped a transient slow disk from becoming a self-inflicted outage. That's the difference between a script that works when you run it and one that survives unattended.

A lock alone isn't the whole story, though. If the locked job itself hangs, it holds the lock forever and every future run skips — so the job silently stops running and you find out days later. That's why locking pairs with bounding runtime with timeout and retrying transient failures.

Full script, the -n vs -w decision, and the FAQ on where the lock file should live: https://bashsnippets.xyz/snippets/bash-flock-single-instance

If you're hardening a cron job, the next two guards are timeout and retry with backoff; the Hardened Cron Wrapper Generator stitches all three into one wrapper, and the full reasoning is in Bash Scripts That Survive Cron. The rest of the library is at https://bashsnippets.xyz

Top comments (1)

Chidari Sandeep • Jun 23

Super