Anguishe

Posted on May 21 • Originally published at bashsnippets.xyz

My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script.

#productivity #linux #bash #devops

Nginx crashed on a Saturday night. An OOM kill, probably — I was running a Node app that leaked memory like a broken faucet. The service went down at 2:14 AM. I found out at 8:30 AM when I opened my laptop and saw Slack messages from six hours earlier asking why the site was down.

The fix took 10 seconds: sudo systemctl start nginx. The downtime cost me a weekend of credibility.

The thing is, systemctl already knows when a service dies. I just wasn't asking it to check. So I wrote a script that asks every 60 seconds and restarts the service if it's down. Took less time to write than it did to explain the outage to my team.

The Script

#!/bin/bash

CHECK="✓"
CROSS="✗"

# --- Configuration ---
SERVICE="nginx"                              # Change to your service name
LOG_FILE="/var/log/service-watchdog.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
NOTIFY_EMAIL=""                              # Optional: you@example.com

# --- Check if service is running ---
if systemctl is-active --quiet "$SERVICE"; then
  echo "$CHECK [$DATE] $SERVICE is running"
else
  echo "$CROSS [$DATE] $SERVICE is NOT running — attempting restart..."
  echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE"

  # --- Attempt restart ---
  if sudo systemctl start "$SERVICE"; then
    echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE"

    # --- Optional: send email notification ---
    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \
        | mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL"
    fi
  else
    echo "$CROSS [$DATE] $SERVICE FAILED to restart — manual intervention needed" \
      | tee -a "$LOG_FILE"

    if [ -n "$NOTIFY_EMAIL" ]; then
      echo "$SERVICE failed to restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \
        | mail -s "[CRITICAL] $SERVICE restart failed" "$NOTIFY_EMAIL"
    fi
  fi
fi

Why `systemctl is-active --quiet` and Not Something Else

I've seen people use ps aux | grep nginx for this. Don't. Here's why:

ps aux | grep nginx has a classic gotcha — the grep command itself shows up in the results because the word "nginx" is in the grep command line. People "fix" this with grep -v grep which works but is fragile and ugly. You're parsing process tables to answer a question that systemd already tracks natively.

systemctl is-active --quiet "$SERVICE" asks systemd directly: "is this unit in the active state?" The --quiet flag suppresses output and just returns an exit code. 0 means active. Anything else means it's not running. Clean, reliable, no string parsing.

The Two-Level Failure Check

This isn't just "is it down → restart it." There are two separate failure modes:

Level 1: Is the service running? If yes, print the check mark and exit. No log noise, no wasted disk.

Level 2: If the service is down and we try to restart it — did the restart actually work? systemctl start can fail for a dozen reasons: masked unit, broken config file, dependency that's also down, port already in use by something else. The script checks the exit code of the start command and sends a different email depending on whether recovery succeeded or failed.

The [RECOVERED] email means the script fixed it and you can keep sleeping. The [CRITICAL] email means something is actually broken and you need to look at it. That distinction matters at 3 AM.

Setting It Up with Cron

crontab -e

Add this line:

* * * * * /home/user/service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1

That runs every single minute. Is that overkill? Maybe. But the script finishes in under 100ms when the service is healthy, and the alternative is 6 hours of downtime on a Saturday night. I'll take the overkill.

One gotcha with cron and sudo: cron runs with a minimal environment and no terminal. If sudo systemctl start prompts for a password, it hangs silently forever. You need a sudoers rule:

sudo visudo
# Add this line:
youruser ALL=(ALL) NOPASSWD: /bin/systemctl start nginx

Or just run the watchdog cron as root.

What Else I Watch With This

The SERVICE variable takes any systemd unit name. I run separate copies for:

nginx — the web server
mysql or mariadb — the database
docker — the container daemon
Custom services: my-node-app.service, redis-server, postgresql

If you want to watch multiple services in one script, loop through them:

SERVICES=("nginx" "mysql" "redis-server")
for SERVICE in "${SERVICES[@]}"; do
  # ... same check logic ...
done

But I prefer separate scripts per service because the logs stay clean and each one can have a different notification strategy.

Pairing This With Other Scripts

This watchdog handles the restart. But if you also want to know why the service died, pair it with:

Monitor CPU & RAM Usage — catches the OOM conditions that kill services in the first place
Send Email Alert from Bash — the email sending setup if you've never configured mail on Linux

Between these three scripts, you've got a basic monitoring stack that runs entirely on cron and costs nothing.

Full script, the line-by-line breakdown, cron setup walkthrough, and three more variations:

bashsnippets.xyz/snippets/restart-service-if-stopped.html

If you're managing any Linux server with services that need to stay up, this takes 5 minutes to deploy and runs quietly forever.

DEV Community

My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script.

The Script

Why `systemctl is-active --quiet` and Not Something Else

The Two-Level Failure Check

Setting It Up with Cron

What Else I Watch With This

Pairing This With Other Scripts

Top comments (0)

The Script

Why systemctl is-active --quiet and Not Something Else

The Two-Level Failure Check

Setting It Up with Cron

What Else I Watch With This

Pairing This With Other Scripts

Why `systemctl is-active --quiet` and Not Something Else