Solved: Monitor Linux Disk Space usage and trigger Cleanup Scripts

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: To prevent critical service outages caused by full disks, this guide outlines an automated system for Linux servers. It combines a Python script to monitor disk usage and trigger a Bash cleanup script when a predefined threshold is breached, all orchestrated by a cron job for continuous, proactive management.

🎯 Key Takeaways

A Python script (disk\_monitor.py) utilizes shutil.disk\_usage() to check disk usage percentage and subprocess.run() to execute a cleanup script if a USAGE\_THRESHOLD is exceeded.
The cleanup script (cleanup\_logs) is a Bash script that uses the find command with -type f, -mtime +$RETENTION\_DAYS, and -delete to remove old log and cache files from specified directories.
Robust logging via log\_message() in the Python script is crucial for tracking monitoring actions, cleanup script outputs (stdout/stderr), and debugging potential issues like permission errors.
Automation is achieved by scheduling the Python monitoring script to run periodically (e.g., hourly) using a cron job, specifying the absolute path to the Python interpreter and script.
Critical safety checks include verifying directory existence in the cleanup script (if [ -d “$DIR” ]) and always testing find commands without -delete first to prevent accidental data loss.

Monitor Linux Disk Space usage and trigger Cleanup Scripts

Introduction

As SysAdmins, Developers, and DevOps Engineers, we know that unchecked disk space usage is a ticking time bomb. A full disk can bring down critical applications, cause data corruption, and lead to hours of painful, reactive troubleshooting. The “out of disk space” error is one of the most common yet preventable causes of service outages. Proactive monitoring isn’t just a best practice; it’s essential for maintaining system stability and reliability.

This tutorial provides a practical, automated solution. We will build a system that constantly monitors disk usage on a Linux server. When a predefined threshold is breached, it will automatically trigger a cleanup script to reclaim space by deleting old logs, cache files, or other designated temporary data. This “set-it-and-forget-it” approach turns a reactive problem into a proactive, self-healing system.

Prerequisites

Before we begin, ensure you have the following:

Access to a Linux server (e.g., Ubuntu, CentOS, Debian) with shell access.
Python 3 installed on the server. Most modern Linux distributions include it by default.
Basic familiarity with the command line and editing text files.
Permissions to create and execute scripts in a user’s home directory.
Access to configure cron jobs for the user.

Step-by-Step Guide

We will break down this process into three main steps: creating the monitoring script, developing the cleanup script, and finally, automating the entire workflow with a cron job.

Step 1: Create the Disk Space Monitoring Script (Python)

Our first component is a Python script that will act as the brain of our operation. It will check the disk usage for a specified partition and decide if a cleanup is necessary.

Create a file named disk_monitor.py in a suitable directory, like /home/user/scripts/, and add the following code:

# Python Script: disk_monitor.py

import shutil
import subprocess
import os
from datetime import datetime

# --- Configuration ---
# The mount point to monitor (e.g., '/' for the root filesystem).
MONITOR_PATH = "/"

# The usage threshold (in percent) that triggers the cleanup.
USAGE_THRESHOLD = 85

# The full path to the cleanup script to be executed.
CLEANUP_SCRIPT_PATH = "/home/user/scripts/cleanup_logs"

# Path to the log file for this monitoring script.
LOG_FILE = "/home/user/logs/disk_monitor.log"

def get_disk_usage(path):
    """Returns the disk usage percentage for the given path."""
    try:
        total, used, free = shutil.disk_usage(path)
        # Calculate usage percentage
        return (used / total) * 100
    except FileNotFoundError:
        log_message(f"Error: The path '{path}' does not exist.")
        return None

def trigger_cleanup():
    """Executes the cleanup script."""
    log_message("Threshold exceeded. Triggering cleanup script...")
    try:
        # Ensure the script is executable
        os.chmod(CLEANUP_SCRIPT_PATH, 0o755)
        # Run the script and capture output
        result = subprocess.run([CLEANUP_SCRIPT_PATH], capture_output=True, text=True, check=True)
        log_message("Cleanup script executed successfully.")
        log_message(f"Cleanup Script STDOUT: {result.stdout.strip()}")
        if result.stderr:
            log_message(f"Cleanup Script STDERR: {result.stderr.strip()}")
    except FileNotFoundError:
        log_message(f"Error: Cleanup script not found at '{CLEANUP_SCRIPT_PATH}'.")
    except subprocess.CalledProcessError as e:
        log_message(f"Error executing cleanup script. Return code: {e.returncode}")
        log_message(f"STDOUT: {e.stdout.strip()}")
        log_message(f"STDERR: {e.stderr.strip()}")
    except Exception as e:
        log_message(f"An unexpected error occurred during cleanup: {e}")

def log_message(message):
    """Appends a timestamped message to the log file."""
    # Ensure log directory exists
    os.makedirs(os.path.dirname(LOG_FILE), exist_ok=True)
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    with open(LOG_FILE, "a") as f:
        f.write(f"{timestamp} - {message}\n")

def main():
    """Main function to check disk usage and trigger cleanup if needed."""
    log_message("--- Starting disk usage check ---")

    current_usage = get_disk_usage(MONITOR_PATH)

    if current_usage is None:
        log_message("Could not retrieve disk usage. Exiting.")
        return

    log_message(f"Current usage for '{MONITOR_PATH}' is {current_usage:.2f}%. Threshold is {USAGE_THRESHOLD}%.")

    if current_usage > USAGE_THRESHOLD:
        trigger_cleanup()
        # Optional: Re-check usage after cleanup
        new_usage = get_disk_usage(MONITOR_PATH)
        if new_usage is not None:
            log_message(f"Disk usage after cleanup is {new_usage:.2f}%.")
    else:
        log_message("Disk usage is within acceptable limits. No action taken.")

if __name__ == "__main__":
    main()

Code Logic Explained:

Configuration: At the top, we define key variables: MONITOR_PATH (the filesystem to check), USAGE_THRESHOLD (the trigger point), and the path to our cleanup script. This makes the script easy to adapt.
get_disk_usage(): This function uses Python’s built-in shutil.disk_usage() to get total, used, and free space for the given path and returns the usage as a percentage.
trigger_cleanup(): If the threshold is breached, this function is called. It uses the subprocess.run() method to execute our shell script. We capture the output (stdout and stderr) for logging purposes, which is crucial for debugging.
log_message(): Robust logging is key for any automation. This function writes timestamped messages to a log file, so you have a complete history of checks and actions performed.
main(): This is the entry point that orchestrates the check. It gets the current usage, compares it to the threshold, and calls the cleanup function if necessary.

Step 2: Develop the Automated Cleanup Script (Bash)

Next, we need the script that does the actual work of freeing up disk space. This script should be tailored to your specific needs, but a common use case is deleting old log files. Always be extremely cautious with scripts that delete files.

Create a file named cleanup_logs in the same directory (/home/user/scripts/):

# Bash Script
# cleanup_logs: This script removes old files to free up disk space.

# --- Configuration ---
# Directory containing old log files to be cleaned up.
# IMPORTANT: Double-check this path!
LOG_DIR="/var/log/my_app_logs"

# Directory for old cached data.
CACHE_DIR="/home/user/app_cache"

# Delete files older than this many days.
RETENTION_DAYS=14


echo "--- Starting Cleanup Process ---"

# --- Action 1: Clean up old application logs ---
echo "Searching for files older than ${RETENTION_DAYS} days in ${LOG_DIR}..."

# Check if the directory exists before proceeding
if [ -d "$LOG_DIR" ]; then
  # Use 'find' to locate and delete files. The '-print' command shows what was deleted.
  # The 'find' command is powerful and should be used with care.
  find "$LOG_DIR" -type f -name "*.log" -mtime +$RETENTION_DAYS -print -delete
  echo "Old log file cleanup complete."
else
  echo "Warning: Log directory ${LOG_DIR} not found. Skipping."
fi


# --- Action 2: Clean up old cache files ---
echo "Searching for files older than ${RETENTION_DAYS} days in ${CACHE_DIR}..."

if [ -d "$CACHE_DIR" ]; then
  find "$CACHE_DIR" -type f -mtime +$RETENTION_DAYS -print -delete
  echo "Old cache file cleanup complete."
else
  echo "Warning: Cache directory ${CACHE_DIR} not found. Skipping."
fi

echo "--- Cleanup Process Finished ---"

Code Logic Explained:

Configuration: Similar to the Python script, we define the target directories (LOG_DIR, CACHE_DIR) and the RETENTION_DAYS at the top.
Safety Checks: Before attempting to delete anything, the script checks if the target directories actually exist using if [ -d "$DIR" ]. This prevents catastrophic errors if a path is misconfigured.
The find Command: This is the core of the script.
- find "$LOG_DIR": Specifies the starting directory.
- -type f: Finds only files, not directories.
- -name "*.log": Narrows the search to files ending in .log.
- -mtime +$RETENTION_DAYS: The crucial part—finds files last modified more than the specified number of days ago.
- -print: Prints the path of each file found, which is useful for logging.
- -delete: The action that performs the deletion. Always test your find command without -delete first to see what it would affect.

Step 3: Integrate and Automate with Cron

With both scripts ready, the final step is to schedule the Python monitor to run automatically. The perfect tool for this is cron, the standard Linux job scheduler.

We’ll set up a cron job to run the disk_monitor.py script every hour. Cron will handle the scheduling, and our Python script will handle the logic.

To edit your user’s crontab, open your cron editor with the command:

crontab -e

Then, add the following line to the end of the file. This tells cron to execute our script at the beginning of every hour, every day.

# Run the disk space monitor every hour.
0 * * * * python3 /home/user/scripts/disk_monitor.py

Cron Syntax Explained:

0: The minute of the hour (0 means at the top of the hour).
*: The hour (asterisk means “every hour”).
*: The day of the month.
*: The month.
*: The day of the week.
python3 /home/user/scripts/disk_monitor.py: The command to execute. Note that we use the full, absolute path to the script to ensure cron can find it.

Common Pitfalls

1. Permission Errors

A frequent issue is that the user running the cron job does not have the necessary permissions to delete files in the target directories (e.g., /var/log/my_app_logs). The cron job might run, the Python script will execute, but the cleanup script will fail silently. Our robust logging will catch this! Check the disk_monitor.log for “Permission denied” errors in the STDERR output from the cleanup script. Ensure the cron user owns the target directories or has write permissions.

2. Overly Aggressive Cleanup Logic

The find ... -delete command is unforgiving. A typo in the LOG_DIR path or setting RETENTION_DAYS to 0 could wipe out critical data. Before automating, always run your find command manually without the -delete flag to get a list of files that would be deleted. This dry run is your most important safety check.

Conclusion

You have now successfully built an automated, self-healing system for managing disk space on a Linux server. By combining the logical power of Python for monitoring and the file-handling efficiency of a Bash script for cleanup, you’ve created a reliable solution that saves you from emergency interventions. The key takeaways are proactive monitoring, careful script design with safety checks, robust logging for visibility, and reliable automation using cron. This setup not only enhances system stability but also frees up valuable time for you to focus on more complex engineering challenges.