đ Executive Summary
TL;DR: A critical production server crash due to a full disk from uncleaned temporary files highlighted the urgent need for robust automation in system housekeeping. The article details a progression from emergency cron jobs to modern solutions like systemd-tmpfiles and logrotate, ultimately advocating for ephemeral infrastructure to eliminate disk creep entirely.
đŻ Key Takeaways
- Systemic disk management is crucial, as relying on developers for perfect self-cleaning code is insufficient; active, automated housekeeping is required for finite disk resources.
- Modern Linux systems offer superior, declarative tools like systemd-tmpfiles for managing temporary directories and logrotate for automated log file rotation, compression, and deletion.
- Adopting an ephemeral infrastructure model, where servers are treated as disposable and regularly replaced from âGolden AMIs,â completely eliminates the problem of disk creep and local state accumulation.
Discover how a 3 AM server crash from a full disk led to a revelation in automation, moving from hacky cron jobs to robust systemd-tmpfiles and ephemeral infrastructure to ensure it never happens again.
That Time a Full Disk Crashed Production and I Vowed: âNever Againâ
It was 3:17 AM. My phone, buzzing angrily on the nightstand, lit up with a PagerDuty alert: [CRITICAL] Host Down: prod-api-gateway-03. I stumbled to my desk, heart pounding, and SSHâd into the jump box. Trying to get into the sick server was like trying to push a stringâthe connection would just hang. The monitoring graphs told the story: CPU was flatlined, but Disk I/O was pegged at 100% and the root partition was completely full. After a painful reboot and a frantic twenty minutes in single-user mode, I found the culprit: the /tmp directory was choked with hundreds of thousands of zero-byte session files from a misbehaving PHP application. Weâd been manually cleaning it every few weeks, but this time, we forgot. The business was down, all because of digital dust bunnies we failed to sweep up.
That was the moment. That was when I realized that some tasks arenât just tedious; they are ticking time bombs. Automating them isnât about saving time; itâs about saving the entire system. And trust me, youâll have a moment like this too.
The âWhyâ: Itâs Not a Leak, Itâs a Flood
So, why does this happen? Itâs rarely one single, massive file. Itâs the slow, insidious creep of neglect. Think about it:
-
Orphaned Temp Files: A process starts, creates a temporary file in
/tmpor/var/tmp, and then crashes before it can clean up after itself. - Verbose Application Logs: Your developers, in a debug-fueled frenzy, turn logging up to 11. Your log rotation is set to keep 30 days of logs, but now each dayâs log is 5GB instead of 50MB.
- User Uploads & Caches: A web application that handles file uploads might store them in a temporary location before moving them to S3. If that move fails, the file stays put. Forever.
The root cause isnât a single âbug,â but a systemic failure to treat disk space as a finite resource that requires active, automated housekeeping. You canât just rely on developers to write perfect, self-cleaning code. You have to build a resilient system that cleans up after them.
The Fixes: From Duct Tape to Deep Space
Look, Iâve been there. Youâre in a panic and need to get the system stable. Hereâs how weâve tackled this, from the immediate âstop the bleedingâ fix to the architectural shift that makes the problem obsolete.
Solution 1: The âScreaming Serverâ Quick Fix (Cron Job)
This is the classic, old-school, âget it done nowâ approach. Itâs ugly, it can be dangerous if youâre not careful, but it works. Weâre going to create a simple cron job that runs nightly and cleans out old files from a specific directory.
Letâs say our problem is the /tmp directory. Weâll add a job to the root userâs crontab.
# Open the crontab for editing
sudo crontab -e
# Add this line to run at 2:00 AM every day
0 2 * * * /usr/bin/find /tmp -type f -mtime +7 -delete
What this does: At 2 AM, the find command looks for any files (-type f) in /tmp that havenât been modified in over 7 days (-mtime +7) and deletes them (-delete). Itâs simple and effective for immediate relief.
Pro Tip: Be incredibly careful with commands like this. Always test your
findcommand without the-deleteflag first to see what it would delete. A typo in the path (e.g.,/ etcinstead of e.g.,/app/etc) can wipe your entire system. Youâve been warned.
Solution 2: The âPermanentâ Fix (systemd-tmpfiles & logrotate)
Cron jobs are fine, but modern Linux systems give us better, safer, and more declarative tools. This is the approach you should be using on any production server built in the last decade.
For Temporary Directories: Use systemd-tmpfiles. Itâs designed specifically for this purpose. You create a simple configuration file to define the cleanup policy.
Create a file at /etc/tmpfiles.d/myapp.conf:
# /etc/tmpfiles.d/myapp.conf
# Type, Path, Mode, UID, GID, Age, Argument
# Delete files in our app's temp dir older than 3 days
d /var/www/myapp/tmp 0755 www-data www-data 3d
# Clean out the main /tmp directory based on access time, not just modification
x /tmp/* - - - 1d
x /var/tmp/* - - - 5d
For Log Files: Use logrotate. This utility is the undisputed champion of log management. It can rotate, compress, and delete old logs automatically.
Create a file at /etc/logrotate.d/myapp:
/var/log/myapp/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
create 0640 www-data adm
}
This configuration tells logrotate to check the logs daily, keep 14 days worth of rotated logs, compress the old ones, and handle all the file permissions correctly. This is the robust, set-and-forget solution.
Solution 3: The âNuclearâ Option (Ephemeral Infrastructure)
Hereâs where we take a step back and question the premise. Why are we even cleaning the server? In a modern cloud-native world, servers should be treated like cattle, not pets. If one gets sick, you donât nurse it back to healthâyou replace it.
The real automation isnât about cleaning up a full disk; itâs about making the diskâs state irrelevant.
| The Old Way | The Ephemeral Way |
A long-running server (e.g., prod-db-01) accumulates cruft over months or years. |
An instance in an Auto Scaling Group (ASG) lives for a few hours or days at most. |
| You SSH in to patch, configure, and clean the server manually or with scripts. | The instance is built from a âGolden AMIâ (Amazon Machine Image) that is already patched and configured. |
| When the disk fills up, itâs an emergency that requires manual intervention. | When an instance has an issue (or on a regular schedule), it is simply terminated. The ASG automatically launches a brand new, clean replacement from the golden image. |
This approach completely eliminates the problem of disk creep. There is no manual cleanup because the server doesnât live long enough to get dirty. All state, like logs or user uploads, should be immediately shipped off the instance to a centralized service like an ELK stack, CloudWatch Logs, or an S3 bucket. The local disk is temporary and disposable.
Moving to this model is a significant architectural shift, but itâs the ultimate form of automation. Youâre not just automating the fix; youâre automating the problem out of existence. And youâll finally be able to sleep through the night.
đ Read the original article on TechResolve.blog
â Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)