Solved: The Automation That Ruined Manual Work For Me Forever

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: A critical production server crash due to a full disk from uncleaned temporary files highlighted the urgent need for robust automation in system housekeeping. The article details a progression from emergency cron jobs to modern solutions like systemd-tmpfiles and logrotate, ultimately advocating for ephemeral infrastructure to eliminate disk creep entirely.

🎯 Key Takeaways

Systemic disk management is crucial, as relying on developers for perfect self-cleaning code is insufficient; active, automated housekeeping is required for finite disk resources.
Modern Linux systems offer superior, declarative tools like systemd-tmpfiles for managing temporary directories and logrotate for automated log file rotation, compression, and deletion.
Adopting an ephemeral infrastructure model, where servers are treated as disposable and regularly replaced from ‘Golden AMIs,’ completely eliminates the problem of disk creep and local state accumulation.

Discover how a 3 AM server crash from a full disk led to a revelation in automation, moving from hacky cron jobs to robust systemd-tmpfiles and ephemeral infrastructure to ensure it never happens again.

That Time a Full Disk Crashed Production and I Vowed: “Never Again”

It was 3:17 AM. My phone, buzzing angrily on the nightstand, lit up with a PagerDuty alert: [CRITICAL] Host Down: prod-api-gateway-03. I stumbled to my desk, heart pounding, and SSH’d into the jump box. Trying to get into the sick server was like trying to push a string—the connection would just hang. The monitoring graphs told the story: CPU was flatlined, but Disk I/O was pegged at 100% and the root partition was completely full. After a painful reboot and a frantic twenty minutes in single-user mode, I found the culprit: the /tmp directory was choked with hundreds of thousands of zero-byte session files from a misbehaving PHP application. We’d been manually cleaning it every few weeks, but this time, we forgot. The business was down, all because of digital dust bunnies we failed to sweep up.

That was the moment. That was when I realized that some tasks aren’t just tedious; they are ticking time bombs. Automating them isn’t about saving time; it’s about saving the entire system. And trust me, you’ll have a moment like this too.

The “Why”: It’s Not a Leak, It’s a Flood

So, why does this happen? It’s rarely one single, massive file. It’s the slow, insidious creep of neglect. Think about it:

Orphaned Temp Files: A process starts, creates a temporary file in /tmp or /var/tmp, and then crashes before it can clean up after itself.
Verbose Application Logs: Your developers, in a debug-fueled frenzy, turn logging up to 11. Your log rotation is set to keep 30 days of logs, but now each day’s log is 5GB instead of 50MB.
User Uploads & Caches: A web application that handles file uploads might store them in a temporary location before moving them to S3. If that move fails, the file stays put. Forever.

The root cause isn’t a single “bug,” but a systemic failure to treat disk space as a finite resource that requires active, automated housekeeping. You can’t just rely on developers to write perfect, self-cleaning code. You have to build a resilient system that cleans up after them.

The Fixes: From Duct Tape to Deep Space

Look, I’ve been there. You’re in a panic and need to get the system stable. Here’s how we’ve tackled this, from the immediate “stop the bleeding” fix to the architectural shift that makes the problem obsolete.

Solution 1: The “Screaming Server” Quick Fix (Cron Job)

This is the classic, old-school, “get it done now” approach. It’s ugly, it can be dangerous if you’re not careful, but it works. We’re going to create a simple cron job that runs nightly and cleans out old files from a specific directory.

Let’s say our problem is the /tmp directory. We’ll add a job to the root user’s crontab.

# Open the crontab for editing
sudo crontab -e

# Add this line to run at 2:00 AM every day
0 2 * * * /usr/bin/find /tmp -type f -mtime +7 -delete

What this does: At 2 AM, the find command looks for any files (-type f) in /tmp that haven’t been modified in over 7 days (-mtime +7) and deletes them (-delete). It’s simple and effective for immediate relief.

Pro Tip: Be incredibly careful with commands like this. Always test your find command without the -delete flag first to see what it would delete. A typo in the path (e.g., / etc instead of e.g., /app/etc) can wipe your entire system. You’ve been warned.

Solution 2: The “Permanent” Fix (systemd-tmpfiles & logrotate)

Cron jobs are fine, but modern Linux systems give us better, safer, and more declarative tools. This is the approach you should be using on any production server built in the last decade.

For Temporary Directories: Use systemd-tmpfiles. It’s designed specifically for this purpose. You create a simple configuration file to define the cleanup policy.

Create a file at /etc/tmpfiles.d/myapp.conf:

# /etc/tmpfiles.d/myapp.conf
# Type, Path, Mode, UID, GID, Age, Argument

# Delete files in our app's temp dir older than 3 days
d /var/www/myapp/tmp 0755 www-data www-data 3d

# Clean out the main /tmp directory based on access time, not just modification
x /tmp/* - - - 1d
x /var/tmp/* - - - 5d

For Log Files: Use logrotate. This utility is the undisputed champion of log management. It can rotate, compress, and delete old logs automatically.

Create a file at /etc/logrotate.d/myapp:

/var/log/myapp/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    create 0640 www-data adm
}

This configuration tells logrotate to check the logs daily, keep 14 days worth of rotated logs, compress the old ones, and handle all the file permissions correctly. This is the robust, set-and-forget solution.

Solution 3: The “Nuclear” Option (Ephemeral Infrastructure)

Here’s where we take a step back and question the premise. Why are we even cleaning the server? In a modern cloud-native world, servers should be treated like cattle, not pets. If one gets sick, you don’t nurse it back to health—you replace it.

The real automation isn’t about cleaning up a full disk; it’s about making the disk’s state irrelevant.


The Old Way	The Ephemeral Way
A long-running server (e.g., `prod-db-01`) accumulates cruft over months or years.	An instance in an Auto Scaling Group (ASG) lives for a few hours or days at most.
You SSH in to patch, configure, and clean the server manually or with scripts.	The instance is built from a “Golden AMI” (Amazon Machine Image) that is already patched and configured.
When the disk fills up, it’s an emergency that requires manual intervention.	When an instance has an issue (or on a regular schedule), it is simply terminated. The ASG automatically launches a brand new, clean replacement from the golden image.

This approach completely eliminates the problem of disk creep. There is no manual cleanup because the server doesn’t live long enough to get dirty. All state, like logs or user uploads, should be immediately shipped off the instance to a centralized service like an ELK stack, CloudWatch Logs, or an S3 bucket. The local disk is temporary and disposable.

Moving to this model is a significant architectural shift, but it’s the ultimate form of automation. You’re not just automating the fix; you’re automating the problem out of existence. And you’ll finally be able to sleep through the night.