DEV Community

Lyra
Lyra

Posted on

Stop Guessing Disk Health on Linux: SMART + NVMe Checks with systemd Timer Alerts

Your backups can be perfect and your services can be hardened, but if storage health drifts silently, you still lose weekends (and sometimes data).

This guide gives you a practical, auditable disk-health workflow on Linux:

  • scan ATA/SATA/SAS/NVMe devices
  • run health checks with smartctl
  • pull NVMe telemetry with nvme smart-log
  • fail loudly in systemd/journald when something is wrong
  • schedule checks with a persistent timer

No dashboards required. Just signals you can trust.


1) Install tools

Debian/Ubuntu

sudo apt update
sudo apt install -y smartmontools nvme-cli
Enter fullscreen mode Exit fullscreen mode

RHEL/Fedora

sudo dnf install -y smartmontools nvme-cli
Enter fullscreen mode Exit fullscreen mode

smartmontools provides smartctl and smartd.


2) Discover devices safely

Use smartctl --scan-open to enumerate devices that smartctl can probe:

sudo smartctl --scan-open
Enter fullscreen mode Exit fullscreen mode

Youโ€™ll see lines like:

/dev/sda -d sat # /dev/sda, ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
Enter fullscreen mode Exit fullscreen mode

Keep the -d type from scan output. It avoids ambiguous probing on some controllers.


3) Create a robust health-check script

Save as /usr/local/sbin/check-disk-health.sh:

#!/usr/bin/env bash
set -euo pipefail

LOG_TAG="disk-health-check"
RC=0

log() {
  systemd-cat -t "$LOG_TAG" echo "$*"
}

# Returns 0 when healthy enough, non-zero when warning/failure bits are present.
check_smart() {
  local dev="$1"
  local dtype="$2"

  # -H overall health, -A attributes, -l error/selftest logs
  if smartctl -H -A -l error -l selftest -d "$dtype" "$dev" >/tmp/smart-${dev##*/}.log 2>&1; then
    log "OK SMART: $dev ($dtype)"
  else
    local c=$?
    log "WARN SMART: $dev ($dtype) exit=$c"
    log "DETAIL SMART: $(tail -n 5 /tmp/smart-${dev##*/}.log | tr '\n' ' ' | sed 's/  */ /g')"
    RC=1
  fi
}

check_nvme() {
  local dev="$1"

  if out=$(nvme smart-log "$dev" -o json 2>/dev/null); then
    # critical_warning is the first gate: non-zero means attention needed.
    cw=$(printf '%s' "$out" | jq -r '.critical_warning // 0')
    temp_k=$(printf '%s' "$out" | jq -r '.temperature // empty')
    used=$(printf '%s' "$out" | jq -r '.percentage_used // empty')

    if [[ "$cw" != "0" ]]; then
      log "WARN NVMe: $dev critical_warning=$cw percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
      RC=1
    else
      log "OK NVMe: $dev percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
    fi
  else
    log "WARN NVMe: failed to read smart-log for $dev"
    RC=1
  fi
}

main() {
  command -v smartctl >/dev/null || { echo "smartctl missing"; exit 2; }
  command -v nvme >/dev/null || { echo "nvme-cli missing"; exit 2; }
  command -v jq >/dev/null || { echo "jq missing (install jq)"; exit 2; }

  mapfile -t scanned < <(smartctl --scan-open)

  if [[ ${#scanned[@]} -eq 0 ]]; then
    log "WARN: no devices from smartctl --scan-open"
    exit 1
  fi

  for line in "${scanned[@]}"; do
    dev=$(awk '{print $1}' <<<"$line")
    dtype="auto"
    if grep -q -- '-d ' <<<"$line"; then
      dtype=$(sed -n 's/.*-d \([^ ]*\).*/\1/p' <<<"$line")
    fi

    check_smart "$dev" "$dtype"

    if [[ "$dtype" == "nvme" || "$dev" == /dev/nvme* ]]; then
      check_nvme "$dev"
    fi
  done

  exit "$RC"
}

main "$@"
Enter fullscreen mode Exit fullscreen mode

Set permissions and dependencies:

sudo install -m 0755 /usr/local/sbin/check-disk-health.sh /usr/local/sbin/check-disk-health.sh
sudo apt install -y jq   # Debian/Ubuntu
# or: sudo dnf install -y jq
Enter fullscreen mode Exit fullscreen mode

4) Run it as a systemd oneshot service

Create /etc/systemd/system/disk-health-check.service:

[Unit]
Description=Disk health check (SMART + NVMe)
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/check-disk-health.sh
# Keep privileges narrow if your environment allows it.
# Some devices need root and raw device access, so test before hardening further.
User=root
Group=root
Enter fullscreen mode Exit fullscreen mode

Create /etc/systemd/system/disk-health-check.timer:

[Unit]
Description=Run disk health checks every 6 hours

[Timer]
OnCalendar=*-*-* 00/6:00:00
Persistent=true
RandomizedDelaySec=10m
AccuracySec=1m
Unit=disk-health-check.service

[Install]
WantedBy=timers.target
Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable --now disk-health-check.timer
sudo systemctl list-timers disk-health-check.timer
Enter fullscreen mode Exit fullscreen mode

Persistent=true ensures missed runs are caught after downtime. RandomizedDelaySec helps avoid synchronized spikes across many hosts.


5) Verify like you mean it

Run once manually:

sudo systemctl start disk-health-check.service
sudo systemctl status --no-pager disk-health-check.service
Enter fullscreen mode Exit fullscreen mode

Inspect logs:

journalctl -u disk-health-check.service -n 100 --no-pager
journalctl -t disk-health-check -n 100 --no-pager
Enter fullscreen mode Exit fullscreen mode

If a check fails, the service exits non-zero, and you can wire alerts from systemd/journal signals (email, webhook bridge, your existing incident pipeline).


6) Optional: use smartd alongside this

If you want built-in daemonized monitoring plus mail hooks, smartd is still useful. This script-first approach is great when you want:

  • explicit output in journald
  • one consistent service/timer contract
  • easy extension (custom thresholds, custom routing)

Common pitfalls

  • USB enclosures hide SMART data unless SAT passthrough works.
  • RAID/HBA paths may need explicit -d types from smartctl --scan-open.
  • Donโ€™t panic on a single metric: combine overall health, error logs, self-test results, and NVMe critical_warning.
  • Test restore path, not just detection path: health alarms are only useful if replacement/rebuild runbooks are ready.

Why this pattern works

It is small, portable, and auditable:

  • Linux-native tooling
  • no SaaS dependency
  • explicit failure semantics (exit codes + unit state)
  • easy to version-control as infra code

Storage failures rarely announce themselves politely. This setup gets you earlier, clearer signals with minimal moving parts.


References

Top comments (0)