Nginx crashed on a Saturday night. An OOM kill, probably — I was running a Node app that leaked memory like a broken faucet. The service went down at 2:14 AM. I found out at 8:30 AM when I opened my laptop and saw Slack messages from six hours earlier asking why the site was down.
The fix took 10 seconds: sudo systemctl start nginx. The downtime cost me a weekend of credibility.
The thing is, systemctl already knows when a service dies. I just wasn't asking it to check. So I wrote a script that asks every 60 seconds and restarts the service if it's down. Took less time to write than it did to explain the outage to my team.
The Script
#!/bin/bash
CHECK="✓"
CROSS="✗"
# --- Configuration ---
SERVICE="nginx" # Change to your service name
LOG_FILE="/var/log/service-watchdog.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
NOTIFY_EMAIL="" # Optional: you@example.com
# --- Check if service is running ---
if systemctl is-active --quiet "$SERVICE"; then
echo "$CHECK [$DATE] $SERVICE is running"
else
echo "$CROSS [$DATE] $SERVICE is NOT running — attempting restart..."
echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE"
# --- Attempt restart ---
if sudo systemctl start "$SERVICE"; then
echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE"
# --- Optional: send email notification ---
if [ -n "$NOTIFY_EMAIL" ]; then
echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \
| mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL"
fi
else
echo "$CROSS [$DATE] $SERVICE FAILED to restart — manual intervention needed" \
| tee -a "$LOG_FILE"
if [ -n "$NOTIFY_EMAIL" ]; then
echo "$SERVICE failed to restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \
| mail -s "[CRITICAL] $SERVICE restart failed" "$NOTIFY_EMAIL"
fi
fi
fi
Why systemctl is-active --quiet and Not Something Else
I've seen people use ps aux | grep nginx for this. Don't. Here's why:
ps aux | grep nginx has a classic gotcha — the grep command itself shows up in the results because the word "nginx" is in the grep command line. People "fix" this with grep -v grep which works but is fragile and ugly. You're parsing process tables to answer a question that systemd already tracks natively.
systemctl is-active --quiet "$SERVICE" asks systemd directly: "is this unit in the active state?" The --quiet flag suppresses output and just returns an exit code. 0 means active. Anything else means it's not running. Clean, reliable, no string parsing.
The Two-Level Failure Check
This isn't just "is it down → restart it." There are two separate failure modes:
Level 1: Is the service running? If yes, print the check mark and exit. No log noise, no wasted disk.
Level 2: If the service is down and we try to restart it — did the restart actually work? systemctl start can fail for a dozen reasons: masked unit, broken config file, dependency that's also down, port already in use by something else. The script checks the exit code of the start command and sends a different email depending on whether recovery succeeded or failed.
The [RECOVERED] email means the script fixed it and you can keep sleeping. The [CRITICAL] email means something is actually broken and you need to look at it. That distinction matters at 3 AM.
Setting It Up with Cron
crontab -e
Add this line:
* * * * * /home/user/service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1
That runs every single minute. Is that overkill? Maybe. But the script finishes in under 100ms when the service is healthy, and the alternative is 6 hours of downtime on a Saturday night. I'll take the overkill.
One gotcha with cron and sudo: cron runs with a minimal environment and no terminal. If sudo systemctl start prompts for a password, it hangs silently forever. You need a sudoers rule:
sudo visudo
# Add this line:
youruser ALL=(ALL) NOPASSWD: /bin/systemctl start nginx
Or just run the watchdog cron as root.
What Else I Watch With This
The SERVICE variable takes any systemd unit name. I run separate copies for:
-
nginx— the web server -
mysqlormariadb— the database -
docker— the container daemon - Custom services:
my-node-app.service,redis-server,postgresql
If you want to watch multiple services in one script, loop through them:
SERVICES=("nginx" "mysql" "redis-server")
for SERVICE in "${SERVICES[@]}"; do
# ... same check logic ...
done
But I prefer separate scripts per service because the logs stay clean and each one can have a different notification strategy.
Pairing This With Other Scripts
This watchdog handles the restart. But if you also want to know why the service died, pair it with:
- Monitor CPU & RAM Usage — catches the OOM conditions that kill services in the first place
-
Send Email Alert from Bash — the email sending setup if you've never configured
mailon Linux
Between these three scripts, you've got a basic monitoring stack that runs entirely on cron and costs nothing.
Full script, the line-by-line breakdown, cron setup walkthrough, and three more variations:
bashsnippets.xyz/snippets/restart-service-if-stopped.html
If you're managing any Linux server with services that need to stay up, this takes 5 minutes to deploy and runs quietly forever.
Top comments (0)