Your backups can be perfect and your services can be hardened, but if storage health drifts silently, you still lose weekends (and sometimes data).
This guide gives you a practical, auditable disk-health workflow on Linux:
- scan ATA/SATA/SAS/NVMe devices
- run health checks with
smartctl - pull NVMe telemetry with
nvme smart-log - fail loudly in systemd/journald when something is wrong
- schedule checks with a persistent timer
No dashboards required. Just signals you can trust.
1) Install tools
Debian/Ubuntu
sudo apt update
sudo apt install -y smartmontools nvme-cli
RHEL/Fedora
sudo dnf install -y smartmontools nvme-cli
smartmontools provides smartctl and smartd.
2) Discover devices safely
Use smartctl --scan-open to enumerate devices that smartctl can probe:
sudo smartctl --scan-open
Youโll see lines like:
/dev/sda -d sat # /dev/sda, ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
Keep the -d type from scan output. It avoids ambiguous probing on some controllers.
3) Create a robust health-check script
Save as /usr/local/sbin/check-disk-health.sh:
#!/usr/bin/env bash
set -euo pipefail
LOG_TAG="disk-health-check"
RC=0
log() {
systemd-cat -t "$LOG_TAG" echo "$*"
}
# Returns 0 when healthy enough, non-zero when warning/failure bits are present.
check_smart() {
local dev="$1"
local dtype="$2"
# -H overall health, -A attributes, -l error/selftest logs
if smartctl -H -A -l error -l selftest -d "$dtype" "$dev" >/tmp/smart-${dev##*/}.log 2>&1; then
log "OK SMART: $dev ($dtype)"
else
local c=$?
log "WARN SMART: $dev ($dtype) exit=$c"
log "DETAIL SMART: $(tail -n 5 /tmp/smart-${dev##*/}.log | tr '\n' ' ' | sed 's/ */ /g')"
RC=1
fi
}
check_nvme() {
local dev="$1"
if out=$(nvme smart-log "$dev" -o json 2>/dev/null); then
# critical_warning is the first gate: non-zero means attention needed.
cw=$(printf '%s' "$out" | jq -r '.critical_warning // 0')
temp_k=$(printf '%s' "$out" | jq -r '.temperature // empty')
used=$(printf '%s' "$out" | jq -r '.percentage_used // empty')
if [[ "$cw" != "0" ]]; then
log "WARN NVMe: $dev critical_warning=$cw percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
RC=1
else
log "OK NVMe: $dev percentage_used=${used:-n/a} temperature(K)=${temp_k:-n/a}"
fi
else
log "WARN NVMe: failed to read smart-log for $dev"
RC=1
fi
}
main() {
command -v smartctl >/dev/null || { echo "smartctl missing"; exit 2; }
command -v nvme >/dev/null || { echo "nvme-cli missing"; exit 2; }
command -v jq >/dev/null || { echo "jq missing (install jq)"; exit 2; }
mapfile -t scanned < <(smartctl --scan-open)
if [[ ${#scanned[@]} -eq 0 ]]; then
log "WARN: no devices from smartctl --scan-open"
exit 1
fi
for line in "${scanned[@]}"; do
dev=$(awk '{print $1}' <<<"$line")
dtype="auto"
if grep -q -- '-d ' <<<"$line"; then
dtype=$(sed -n 's/.*-d \([^ ]*\).*/\1/p' <<<"$line")
fi
check_smart "$dev" "$dtype"
if [[ "$dtype" == "nvme" || "$dev" == /dev/nvme* ]]; then
check_nvme "$dev"
fi
done
exit "$RC"
}
main "$@"
Set permissions and dependencies:
sudo install -m 0755 /usr/local/sbin/check-disk-health.sh /usr/local/sbin/check-disk-health.sh
sudo apt install -y jq # Debian/Ubuntu
# or: sudo dnf install -y jq
4) Run it as a systemd oneshot service
Create /etc/systemd/system/disk-health-check.service:
[Unit]
Description=Disk health check (SMART + NVMe)
Wants=network-online.target
After=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/check-disk-health.sh
# Keep privileges narrow if your environment allows it.
# Some devices need root and raw device access, so test before hardening further.
User=root
Group=root
Create /etc/systemd/system/disk-health-check.timer:
[Unit]
Description=Run disk health checks every 6 hours
[Timer]
OnCalendar=*-*-* 00/6:00:00
Persistent=true
RandomizedDelaySec=10m
AccuracySec=1m
Unit=disk-health-check.service
[Install]
WantedBy=timers.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable --now disk-health-check.timer
sudo systemctl list-timers disk-health-check.timer
Persistent=true ensures missed runs are caught after downtime. RandomizedDelaySec helps avoid synchronized spikes across many hosts.
5) Verify like you mean it
Run once manually:
sudo systemctl start disk-health-check.service
sudo systemctl status --no-pager disk-health-check.service
Inspect logs:
journalctl -u disk-health-check.service -n 100 --no-pager
journalctl -t disk-health-check -n 100 --no-pager
If a check fails, the service exits non-zero, and you can wire alerts from systemd/journal signals (email, webhook bridge, your existing incident pipeline).
6) Optional: use smartd alongside this
If you want built-in daemonized monitoring plus mail hooks, smartd is still useful. This script-first approach is great when you want:
- explicit output in journald
- one consistent service/timer contract
- easy extension (custom thresholds, custom routing)
Common pitfalls
- USB enclosures hide SMART data unless SAT passthrough works.
-
RAID/HBA paths may need explicit
-dtypes fromsmartctl --scan-open. -
Donโt panic on a single metric: combine overall health, error logs, self-test results, and NVMe
critical_warning. - Test restore path, not just detection path: health alarms are only useful if replacement/rebuild runbooks are ready.
Why this pattern works
It is small, portable, and auditable:
- Linux-native tooling
- no SaaS dependency
- explicit failure semantics (exit codes + unit state)
- easy to version-control as infra code
Storage failures rarely announce themselves politely. This setup gets you earlier, clearer signals with minimal moving parts.
References
- smartctl(8) manual (Arch mirror): https://man.archlinux.org/man/smartctl.8.en
- systemd.timer(5): https://manpages.debian.org/testing/systemd/systemd.timer.5.en.html
- nvme-smart-log(1): https://manpages.debian.org/testing/nvme-cli/nvme-smart-log.1.en.html
- Debian smartmontools package details (
smartctl,smartd): https://packages.debian.org/sid/smartmontools - ArchWiki S.M.A.R.T. operational notes: https://wiki.archlinux.org/title/S.M.A.R.T.
Top comments (0)