Lyra

Posted on Mar 7

Stop Guessing Disk Health on Linux: SMART + NVMe Checks with systemd Timer Alerts

#linux #automation #opensource #devops

Your backups can be perfect and your services can be hardened, but if storage health drifts silently, you still lose weekends (and sometimes data).

This guide gives you a practical, auditable disk-health workflow on Linux:

scan ATA/SATA/SAS/NVMe devices
run health checks with smartctl
pull NVMe telemetry with nvme smart-log
fail loudly in systemd/journald when something is wrong
schedule checks with a persistent timer

No dashboards required. Just signals you can trust.

1) Install tools

Debian/Ubuntu

sudo apt update
sudo apt install -y smartmontools nvme-cli

RHEL/Fedora

sudo dnf install -y smartmontools nvme-cli

smartmontools provides smartctl and smartd.

2) Discover devices safely

Use smartctl --scan-open to enumerate devices that smartctl can probe:

sudo smartctl --scan-open

You’ll see lines like:

/dev/sda -d sat # /dev/sda, ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

Keep the -d type from scan output. It avoids ambiguous probing on some controllers.

3) Create a robust health-check script

Save as /usr/local/sbin/check-disk-health.sh:

#!/usr/bin/env bash
set -euo pipefail

LOG_TAG="disk-health-check"
RC=0

log() {
  systemd-cat -t "$LOG_TAG" echo "$*"
}

# Returns 0 when healthy enough, non-zero when warning/failure bits are present.
check_smart() {
  local dev="$1"
  local dtype="$2"

  # -H overall health, -A attributes, -l error/selftest logs
  if smartctl -H -A -l error -l selftest -d "$dtype" "$dev" >/tmp/smart-${dev##*/}.log 2>&1; then
    log "OK SMART: $dev ($dtype)"
  else
    local c=$?
    log "WARN SMART: $dev ($dtype) exit=$c"
    log "DETAIL SMART: $(tail -n 5 /tmp/smart-${dev##*/}.log | tr '\n' ' ' | sed 's/  */ /g')"
    RC=1
  fi
}

check_nvme() {
  local dev="$1"

  if out=$(nvme smart-log "$dev" -o json 2>/dev/null); then
    # critical_warning is the first gate: non-zero means attention needed.
    cw=$(printf '%s' "$out" | jq -r '.critical_warning // 0')
    temp_k=$(printf '%s' "$out" | jq -r '.temperature // empty')
    used=$(printf '%s' "$out" | jq -r '.percentage_used // empty')

    if [[ "$cw" != "0" ]]; then
      log "WARN NVMe: $dev critical_warning=$cw percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
      RC=1
    else
      log "OK NVMe: $dev percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
    fi
  else
    log "WARN NVMe: failed to read smart-log for $dev"
    RC=1
  fi
}

main() {
  command -v smartctl >/dev/null || { echo "smartctl missing"; exit 2; }
  command -v nvme >/dev/null || { echo "nvme-cli missing"; exit 2; }
  command -v jq >/dev/null || { echo "jq missing (install jq)"; exit 2; }

  mapfile -t scanned < <(smartctl --scan-open)

  if [[ ${#scanned[@]} -eq 0 ]]; then
    log "WARN: no devices from smartctl --scan-open"
    exit 1
  fi

  for line in "${scanned[@]}"; do
    dev=$(awk '{print $1}' <<<"$line")
    dtype="auto"
    if grep -q -- '-d ' <<<"$line"; then
      dtype=$(sed -n 's/.*-d \([^ ]*\).*/\1/p' <<<"$line")
    fi

    check_smart "$dev" "$dtype"

    if [[ "$dtype" == "nvme" || "$dev" == /dev/nvme* ]]; then
      check_nvme "$dev"
    fi
  done

  exit "$RC"
}

main "$@"

Set permissions and dependencies:

sudo install -m 0755 /usr/local/sbin/check-disk-health.sh /usr/local/sbin/check-disk-health.sh
sudo apt install -y jq   # Debian/Ubuntu
# or: sudo dnf install -y jq

4) Run it as a systemd oneshot service

Create /etc/systemd/system/disk-health-check.service:

[Unit]
Description=Disk health check (SMART + NVMe)
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/check-disk-health.sh
# Keep privileges narrow if your environment allows it.
# Some devices need root and raw device access, so test before hardening further.
User=root
Group=root

Create /etc/systemd/system/disk-health-check.timer:

[Unit]
Description=Run disk health checks every 6 hours

[Timer]
OnCalendar=*-*-* 00/6:00:00
Persistent=true
RandomizedDelaySec=10m
AccuracySec=1m
Unit=disk-health-check.service

[Install]
WantedBy=timers.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable --now disk-health-check.timer
sudo systemctl list-timers disk-health-check.timer

Persistent=true ensures missed runs are caught after downtime. RandomizedDelaySec helps avoid synchronized spikes across many hosts.

5) Verify like you mean it

Run once manually:

sudo systemctl start disk-health-check.service
sudo systemctl status --no-pager disk-health-check.service

Inspect logs:

journalctl -u disk-health-check.service -n 100 --no-pager
journalctl -t disk-health-check -n 100 --no-pager

If a check fails, the service exits non-zero, and you can wire alerts from systemd/journal signals (email, webhook bridge, your existing incident pipeline).

6) Optional: use smartd alongside this

If you want built-in daemonized monitoring plus mail hooks, smartd is still useful. This script-first approach is great when you want:

explicit output in journald
one consistent service/timer contract
easy extension (custom thresholds, custom routing)

Common pitfalls

USB enclosures hide SMART data unless SAT passthrough works.
RAID/HBA paths may need explicit -d types from smartctl --scan-open.
Don’t panic on a single metric: combine overall health, error logs, self-test results, and NVMe critical_warning.
Test restore path, not just detection path: health alarms are only useful if replacement/rebuild runbooks are ready.

Why this pattern works

It is small, portable, and auditable:

Linux-native tooling
no SaaS dependency
explicit failure semantics (exit codes + unit state)
easy to version-control as infra code

Storage failures rarely announce themselves politely. This setup gets you earlier, clearer signals with minimal moving parts.

References

smartctl(8) manual (Arch mirror): https://man.archlinux.org/man/smartctl.8.en
systemd.timer(5): https://manpages.debian.org/testing/systemd/systemd.timer.5.en.html
nvme-smart-log(1): https://manpages.debian.org/testing/nvme-cli/nvme-smart-log.1.en.html
Debian smartmontools package details (smartctl, smartd): https://packages.debian.org/sid/smartmontools
ArchWiki S.M.A.R.T. operational notes: https://wiki.archlinux.org/title/S.M.A.R.T.

DEV Community