DEV Community

Anguishe
Anguishe

Posted on • Originally published at bashsnippets.xyz

My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script.

Nginx crashed on a Saturday night. An OOM kill, probably — I was running a Node app that leaked memory like a broken faucet. The service went down at 2:14 AM. I found out at 8:30 AM when I opened my laptop and saw Slack messages from six hours earlier asking why the site was down.

The fix took 10 seconds: sudo systemctl start nginx. The downtime cost me a weekend of credibility.

The thing is, systemctl already knows when a service dies. I just wasn't asking it to check. So I wrote a script that asks every 60 seconds and restarts the service if it's down. Took less time to write than it did to explain the outage to my team.


The Script

#!/bin/bash

CHECK="✓"
CROSS="✗"

# --- Configuration ---
SERVICE="nginx"                              # Change to your service name
LOG_FILE="/var/log/service-watchdog.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
NOTIFY_EMAIL=""                              # Optional: you@example.com

# --- Check if service is running ---
if systemctl is-active --quiet "$SERVICE"; then
  echo "$CHECK [$DATE] $SERVICE is running"
else
  echo "$CROSS [$DATE] $SERVICE is NOT running — attempting restart..."
  echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE"

  # --- Attempt restart ---
  if sudo systemctl start "$SERVICE"; then
    echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE"

    # --- Optional: send email notification ---
    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \
        | mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL"
    fi
  else
    echo "$CROSS [$DATE] $SERVICE FAILED to restart — manual intervention needed" \
      | tee -a "$LOG_FILE"

    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE failed to restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \
        | mail -s "[CRITICAL] $SERVICE restart failed" "$NOTIFY_EMAIL"
    fi
  fi
fi
Enter fullscreen mode Exit fullscreen mode

Why systemctl is-active --quiet and Not Something Else

I've seen people use ps aux | grep nginx for this. Don't. Here's why:

ps aux | grep nginx has a classic gotcha — the grep command itself shows up in the results because the word "nginx" is in the grep command line. People "fix" this with grep -v grep which works but is fragile and ugly. You're parsing process tables to answer a question that systemd already tracks natively.

systemctl is-active --quiet "$SERVICE" asks systemd directly: "is this unit in the active state?" The --quiet flag suppresses output and just returns an exit code. 0 means active. Anything else means it's not running. Clean, reliable, no string parsing.


The Two-Level Failure Check

This isn't just "is it down → restart it." There are two separate failure modes:

Level 1: Is the service running? If yes, print the check mark and exit. No log noise, no wasted disk.

Level 2: If the service is down and we try to restart it — did the restart actually work? systemctl start can fail for a dozen reasons: masked unit, broken config file, dependency that's also down, port already in use by something else. The script checks the exit code of the start command and sends a different email depending on whether recovery succeeded or failed.

The [RECOVERED] email means the script fixed it and you can keep sleeping. The [CRITICAL] email means something is actually broken and you need to look at it. That distinction matters at 3 AM.


Setting It Up with Cron

crontab -e
Enter fullscreen mode Exit fullscreen mode

Add this line:

* * * * * /home/user/service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1
Enter fullscreen mode Exit fullscreen mode

That runs every single minute. Is that overkill? Maybe. But the script finishes in under 100ms when the service is healthy, and the alternative is 6 hours of downtime on a Saturday night. I'll take the overkill.

One gotcha with cron and sudo: cron runs with a minimal environment and no terminal. If sudo systemctl start prompts for a password, it hangs silently forever. You need a sudoers rule:

sudo visudo
# Add this line:
youruser ALL=(ALL) NOPASSWD: /bin/systemctl start nginx
Enter fullscreen mode Exit fullscreen mode

Or just run the watchdog cron as root.


What Else I Watch With This

The SERVICE variable takes any systemd unit name. I run separate copies for:

  • nginx — the web server
  • mysql or mariadb — the database
  • docker — the container daemon
  • Custom services: my-node-app.service, redis-server, postgresql

If you want to watch multiple services in one script, loop through them:

SERVICES=("nginx" "mysql" "redis-server")
for SERVICE in "${SERVICES[@]}"; do
  # ... same check logic ...
done
Enter fullscreen mode Exit fullscreen mode

But I prefer separate scripts per service because the logs stay clean and each one can have a different notification strategy.


Pairing This With Other Scripts

This watchdog handles the restart. But if you also want to know why the service died, pair it with:

Between these three scripts, you've got a basic monitoring stack that runs entirely on cron and costs nothing.


Full script, the line-by-line breakdown, cron setup walkthrough, and three more variations:

bashsnippets.xyz/snippets/restart-service-if-stopped.html

If you're managing any Linux server with services that need to stay up, this takes 5 minutes to deploy and runs quietly forever.

Top comments (0)