My Cleanup Script Killed the GitHub Runner: A Self-Inflicted Incident

#incident #githubactions #cleanup #postmortem

When I woke up, 16 crons had failed back-to-back

My disk-cleanup timer ran last night at 03:30 (right on schedule). When I checked in the morning, the GitHub Actions panel was bright red: 16 crons had failed. Not one of them produced fresh content. The pipeline-health monitor had sent a "DEGRADED" email. One mail in the inbox.

I opened the run logs. They were all dying in the exact same place — Checkout step succeeded, Verify Node succeeded, Install dependencies succeeded, then the workflow died before the next step even started. Not normal at all.

GitHub's UI showed no detail — the runner-side error couldn't be written to the log, it just said "Job failed". I SSH'd into the VPS:

$ ssh vps 'sudo journalctl -u "actions.runner.*" --since "8 hours ago" | grep -iE "error|fail|missing"' | head -10
Missing file: /home/github-runner/runner-mustafaerbay/_work/_temp/_runner_file_commands/set_output_xyz123
Missing file: /home/github-runner/runner-mustafaerbay/_work/_temp/_runner_file_commands/set_output_abc456
Missing file: ...

Missing file: set_output_*. These are the files the GitHub Actions runner uses to pass state between steps.

GitHub Actions steps share state through a file called $GITHUB_OUTPUT:

echo "new_slugs=$NEW_SLUGS" >> "$GITHUB_OUTPUT"

This file lives under _work/_temp/_runner_file_commands/. The runner creates a unique file for each step. It's written in a structured format, then read by the next step as ${{ steps.<id>.outputs.<name> }}.

If the file isn't there, the runner behaves like it's lost. That's what makes the workflow fail.

I tracked down the cause with a sinking feeling

The day before, I had written my disk-cleanup.sh script. It had this line in it:

find /home/github-runner -path '*/_work/_temp/*' -mtime +7 -delete

That single line deletes everything older than 7 days — including directories. The _runner_file_commands directory may have been created 7 days ago, but the runner is still actively using it.

The -delete flag tells find to remove files AND directories. It won't delete non-empty directories, but a directory created 7 days ago whose latest activity is older than a week → can get treated as if it were stale.

First thing I did was figure out the actual state. I checked over SSH:

$ ls -la /home/github-runner/runner-mustafaerbay/_work/_temp/_runner_file_commands/
total 8
drwxr-xr-x 2 github-runner github-runner 4096 May  3 03:30 .
drwxr-xr-x 5 github-runner github-runner 4096 May  3 03:30 ..

Empty. Just "." and ".." Runner thinks the directory exists, writes into it, but when the hourly cron arrives, the directory is there but the files inside are gone → runner crash.

Actually if the runner restarts there's no problem because it re-initializes itself. But at 03:30 it was in sleep mode. When the cron triggered, it was waiting for state, couldn't find it, fail.

The fast recovery

I restarted the runner service:

$ ssh vps 'sudo systemctl restart actions.runner.<repo-slug>.<runner-name>.service'

After the restart, the runner rebuilt its state directory. The next cron passed cleanly.

I fixed disk-cleanup.sh. Now only files get deleted, directories stay untouched:

# Old (DANGEROUS)
find /home/github-runner -path '*/_work/_temp/*' -mtime +7 -delete

# New (safe — only known single-use file patterns)
find /home/github-runner -path '*/_work/_temp/*' -type f \
  \( -name 'set_output_*' -o -name 'set_env_*' -o -name 'add_path_*' -o -name '*.tmp' -o -name '*.log' \) \
  -mtime +7 -delete

Two key differences:

-type f — files only, not directories
A -name whitelist — only known single-use filenames

Runner state directories (like _runner_file_commands) are now off-limits. Old single-use set_output_* files get cleaned up (these are created once per step and are useless after they're consumed).

The deeper lesson

The real reason I'm writing this isn't to share the embarrassment of an incident I caused myself. The reason runs deeper:

"When you set up automation, changing its inputs without understanding them is more dangerous than the automation itself."

When I wrote disk-cleanup.sh, I thought of _work/_temp as "temporary files". The word temp literally means "temporary". Anything older than 7 days is probably leftover. Sounds reasonable.

But it isn't. _work/_temp is the runner's active state storage. Despite the name temp, things like _runner_file_commands are critical state — created at the start of each step, consumed when the step ends. It can persist for 7 days because the runner may sleep for long stretches.

The bottom line: before you set up automation, learn the contracts of the system you're touching.

For disk-cleanup.sh, the new principles are:

Whitelist > blacklist (list patterns to delete, don't say "everything old")
Files > directories (a directory continuing to exist may be critical to state)
Bump the time window (7 days might be too short, the runner can idle for a long time, I bumped it to 14)

⚠️ To anyone running a GitHub Actions self-hosted runner

Never apply blanket cleanup to the _work/_temp directory. Don't touch
these patterns inside it: _runner_file_commands/, _runner_temp/, or any
directory at all. Only clean up single-use files like set_output_*,
set_env_* via a whitelist. Damaging runner state means 16 hours of downtime.

Wrap-up

This event was a calibration error for me. Disk-cleanup.sh was a useful script, but a scope mistake by its owner broke their own system. I'm writing this openly because the cause of the 16-hour downtime is me — not GitHub, not AI, not a third-party bug.

I made a note to myself: when writing runbooks, ask three more questions:

What does this script delete? (Exactly. List it.)
Who owns the things it deletes? (A system service? The runner? Application data?)
Does that owner have a list of patterns it accepts being cleaned? (If not, I don't have permission to ask.)

The find ... -delete I wrote without asking any of these three questions cost 16 hours of outage. Asking takes 30 seconds. The trade-off is now extremely obvious.