DEV Community

Cover image for Stop Over-Engineering: A 100-line bash script that saved my servers
Sandro πŸ¦–β˜„οΈ
Sandro πŸ¦–β˜„οΈ

Posted on

Stop Over-Engineering: A 100-line bash script that saved my servers

We've all been there. Your website goes down at 3 AM. MySQL crashed. NGINX stopped responding. And you're scrambling to SSH into the server while your phone buzzes with angry customer emails.

Then someone suggests: "You should use Prometheus + Grafana + Alertmanager + PagerDuty!"

Sure. Or... hear me out... you could just use a 100-line bash script that checks your sites every minute and restarts services automatically when they fail.

The Problem with Enterprise Monitoring

Don't get me wrong - tools like Datadog, New Relic, and Prometheus are amazing. But they're also:

  • 🎯 Overkill for small projects
  • πŸ’° Expensive for startups
  • 🧩 Complex to set up and maintain
  • 🐌 Slow to deploy (days/weeks of configuration)
  • πŸ“š Require learning new query languages and dashboards

Meanwhile, your website is still down.

Enter: The 100-Line Solution

What if monitoring could be this simple?

# 1. Add your websites
echo "https://example.com" >> sites.txt

# 2. Install
sudo ./install.sh

# 3. Done. Seriously.
Enter fullscreen mode Exit fullscreen mode

That's it. Every minute, your server now:

  1. βœ… Checks if your websites respond
  2. πŸ” Detects if services are overwhelmed (not just down!)
  3. πŸ”§ Automatically restarts MySQL, NGINX, or Apache
  4. πŸ“ Logs only failures (no disk space waste)
  5. πŸ”„ Tracks failure counts to avoid false positives

How It Works (The Smart Part)

Most monitoring tools just check if a service is "running." That's not enough.

Here's what makes this script intelligent:

1. Load-Based Detection

# Don't just check if MySQL is running...
# Check if it's actually RESPONSIVE
check_mysql_health() {
    # Try to ping MySQL
    if timeout 3 mysqladmin ping; then
        # It's alive! But is it overwhelmed?
        current_connections=$(mysqladmin status | grep -oP 'Threads: \K\d+')

        if [[ "$current_connections" -gt 150 ]]; then
            # Too many connections - restart before it crashes
            return 1
        fi
    fi
}
Enter fullscreen mode Exit fullscreen mode

Your site can be down even when services show as "running" - when they're overloaded with traffic or locked up processing queries.

2. Advanced Health Checks

# NGINX example: Test config + connectivity + load
check_nginx_health() {
    # 1. Validate config before trying to use it
    nginx -t 2>/dev/null || return 1

    # 2. Can it accept connections?
    timeout 2 bash -c "echo > /dev/tcp/localhost/80" || return 1

    # 3. Is it drowning in connections?
    active_conn=$(curl -s http://localhost/nginx_status | grep -oP 'Active connections: \K\d+')
    [[ "$active_conn" -gt 1000 ]] && return 1

    return 0  # All good!
}
Enter fullscreen mode Exit fullscreen mode

3. Smart Recovery Logic

# Only restart after 3 consecutive failures (avoid false positives)
if [[ "$current_failures" -ge 3 ]]; then
    # Restart services in order: Database first, then web server
    for service in "${SERVICES[@]}"; do
        systemctl restart "$service"
    done
fi
Enter fullscreen mode Exit fullscreen mode

Real-World Example

Let's say your e-commerce site suddenly gets featured on Reddit (congrats! πŸŽ‰). Traffic spikes 10x:

Traditional Monitoring:

  • πŸ“Š Dashboards show high CPU/memory
  • 🚨 Alerts fire
  • πŸ‘¨β€πŸ’» You get paged
  • ⏰ You wake up, investigate, manually restart services
  • πŸ’Έ Lost sales during downtime

This Script:

  • πŸ” Detects MySQL has 200 active connections (threshold: 150)
  • πŸ€– Automatically restarts MySQL in 3 seconds
  • πŸ“ Logs: "MySQL OVERLOADED (200 connections) - restarted"
  • 😴 You stay asleep
  • πŸ’° Sales continue

Installation (Seriously, It's This Easy)

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/site-monitor.git
cd site-monitor

# 2. Add your websites
cat > sites.txt << EOF
https://example.com
https://api.example.com
https://www.example.com
EOF

# 3. Optional: Customize thresholds
vim config.conf  # Adjust MySQL/NGINX/Apache thresholds

# 4. Install (creates cron job, sets up logging)
sudo ./install.sh

# 5. Watch it work
sudo tail -f /var/log/site-monitor/monitor.log
Enter fullscreen mode Exit fullscreen mode

Output:

[2025-10-20 14:23:45] FAILURE: https://example.com - HTTP 000 (1/3 failures)
[2025-10-20 14:24:45] FAILURE: https://example.com - HTTP 000 (2/3 failures)
[2025-10-20 14:25:45] FAILURE: https://example.com - HTTP 000 (3/3 failures)
[2025-10-20 14:25:46] RECOVERY: Starting recovery for https://example.com
[2025-10-20 14:25:47] RECOVERY: MySQL OVERLOADED (187 connections) - restarted
[2025-10-20 14:25:49] RECOVERY: NGINX responsive - no action needed
[2025-10-20 14:25:50] RECOVERY: Recovery completed
[2025-10-20 14:26:45] SUCCESS: https://example.com back online (HTTP 200)
Enter fullscreen mode Exit fullscreen mode

Configuration Options

Everything is configurable in config.conf:

# HTTP Settings
TIMEOUT=10                    # Request timeout
FAILURE_THRESHOLD=3           # Failures before recovery

# Services to manage (in order)
SERVICES=("mysql" "nginx")    # Or: ("mysql" "apache2")

# Load Thresholds
MYSQL_MAX_CONNECTIONS=150     # Restart if connections exceed this
NGINX_MAX_CONNECTIONS=1000    # Restart if connections exceed this
APACHE_MAX_WORKERS=150        # Restart if busy workers exceed this

# Logging
LOG_SUCCESS=false             # Only log failures (save disk space)
Enter fullscreen mode Exit fullscreen mode

When to Use This vs. Enterprise Tools

Use This Simple Script When:

  • 🎯 You have < 50 websites to monitor
  • πŸ’° You're on a budget (it's free!)
  • ⚑ You need it deployed TODAY
  • πŸ”§ You manage your own Ubuntu servers
  • πŸŽ“ You want to understand what's happening (no black box)

Use Enterprise Tools When:

  • πŸ“Š You need fancy dashboards and metrics
  • 🌍 You have distributed microservices
  • πŸ‘₯ You have a dedicated DevOps team
  • πŸ’Ό You need compliance/audit trails
  • πŸ”— You need integration with 50+ other tools

Performance & Resource Usage

This script is incredibly lightweight:

  • CPU: Near zero (runs for ~1 second per minute)
  • Memory: ~5MB
  • Disk: <1MB logs per month (with default settings)
  • Network: One HTTP GET per site per minute

Compare that to running Prometheus + Grafana (hundreds of MB of RAM).

Production-Ready Features

Don't let the simplicity fool you - this runs in production:

βœ… State Tracking: Counts consecutive failures per site
βœ… Log Rotation: Yearly rotation via logrotate
βœ… Error Handling: Graceful failures, timeout protection
βœ… No Dependencies: Just bash + curl + systemctl (already on Ubuntu)
βœ… Tested: Works on Ubuntu 22.04 LTS

Advanced Use Cases

Multi-Server Deployment

Deploy to multiple servers with different site lists:

# Server 1: Monitor frontend sites
echo "https://app.example.com" > sites.txt

# Server 2: Monitor API endpoints
echo "https://api.example.com" > sites.txt

# Server 3: Monitor admin tools
echo "https://admin.example.com" > sites.txt
Enter fullscreen mode Exit fullscreen mode

Custom Services

Not just MySQL/NGINX! Add any systemd service:

# Add Redis, PHP-FPM, whatever you need
SERVICES=("mysql" "nginx" "redis-server" "php8.1-fpm")
Enter fullscreen mode Exit fullscreen mode

Integration with Existing Tools

Still want Slack notifications? Just add a webhook:

# In monitor.sh, add after line 320:
curl -X POST "YOUR_SLACK_WEBHOOK" \
  -d "{\"text\":\"🚨 $url is down! Auto-recovering...\"}"
Enter fullscreen mode Exit fullscreen mode

The Philosophy: Simple > Complex

This project follows the Unix philosophy:

  • Do one thing well
  • Use plain text for data
  • Build small, composable tools

Your monitoring doesn't need to be fancy. It needs to:

  1. Detect failures βœ…
  2. Fix them automatically βœ…
  3. Tell you what happened βœ…

Mission accomplished in 100 lines of bash.

Try It Yourself

The code is open source (MIT License):

πŸ”— GitHub: https://github.com/sgumz/site-monitor

Installation takes 2 minutes. Give it a try!

Closing Thoughts

Sometimes the best solution isn't the one with the most features - it's the one that solves your problem today without creating new ones.

Could this bash script replace Datadog for a Fortune 500 company? No.

Could it save your small SaaS business from 3 AM wake-up calls? Absolutely.


What's your take? Do you prefer simple scripts or enterprise monitoring? Any horror stories about over-engineered solutions? Drop a comment below! πŸ‘‡

Top comments (0)