DEV Community

patrickbloem-it
patrickbloem-it

Posted on

Cost-Effective Disaster Recovery: Managing ZFS Snapshots on Proxmox VE

Why Simple is Better for Public Sector IT

In my daily work as an Infrastructure Engineer in the public sector, I often face a common dilemma: We need enterprise-grade data integrity and auditability, but we don't always have the budget for high-end backup appliances.

Complexity is the enemy of security. That's why I prefer leveraging the native capabilities of ZFS directly on our Proxmox VE hosts, rather than adding layers of third-party software that might introduce new vulnerabilities or compliance issues.

The Challenge: Automated Snapshots without Bloat

I needed a way to:

  1. Create recurring snapshots of critical VMs and containers
  2. Rotate them automatically (GFS: hourly, daily, weekly retention)
  3. Have zero external dependencies (just a shell script)
  4. Be fully auditable via syslog integration
  5. Fail gracefully if a dataset is unavailable

The Solution

I wrote a lightweight wrapper script around the native zfs command. It's designed to be run as a cron job on any Debian-based Proxmox node.

Core Features

Here's the production-ready version with proper error handling:

!/bin/bash

ZFS Snapshot Manager for Proxmox VE
Retention: 24 hourly, 7 daily, 4 weekly
Author: Patrick Bloem
set -euo pipefail # Exit on error, undefined vars, pipe failures

Configuration
DATASET="${1:-rpool/data}"
LOG_FACILITY="local0"
TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
RETENTION_HOURLY=24
RETENTION_DAILY=7

Logging function
log() {
logger -t "zfs-snapshot" -p "${LOG_FACILITY}.info" "$1"
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

error_exit() {
logger -t "zfs-snapshot" -p "${LOG_FACILITY}.err" "ERROR: $1"
echo "ERROR: $1" >&2
exit 1
}

Verify dataset exists
if ! zfs list -H -o name "$DATASET" &>/dev/null; then
error_exit "Dataset $DATASET not found"
fi

Create snapshot
SNAPSHOT_NAME="${DATASET}@auto-hourly-${TIMESTAMP}"
log "Creating snapshot: $SNAPSHOT_NAME"

if ! zfs snapshot -r "$SNAPSHOT_NAME" 2>&1 | logger -t "zfs-snapshot"; then
error_exit "Failed to create snapshot $SNAPSHOT_NAME"
fi

Prune old hourly snapshots (keep last N)
log "Pruning old snapshots (keeping last $RETENTION_HOURLY hourly)"

zfs list -H -t snapshot -o name -s creation
| grep "${DATASET}@auto-hourly-"
| head -n -"$RETENTION_HOURLY"
| while read -r old_snap; do
log "Destroying old snapshot: $old_snap"
zfs destroy "$old_snap" || log "Warning: Could not destroy $old_snap"
done

log "Snapshot rotation completed successfully"

Deployment

Add this to your crontab for hourly execution:

Run every hour at minute 5
5 * * * * /usr/local/bin/zfs-snapshot-manager.sh rpool/data 2>&1 | logger -t zfs-snapshot

For daily/weekly snapshots, create separate scripts with adjusted retention policies or use a tag-based approach.

Why This Works for Compliance

  1. Auditability: All actions are logged to syslog, which can be forwarded to a central SIEM
  2. Atomicity: ZFS snapshots are atomic and crash-consistent
  3. Transparency: No proprietary tools; every action is traceable via zfs list -t snapshot
  4. Idempotency: Safe to run multiple times (won't create duplicates due to timestamp)

Advanced: Integration with Offsite Replication

In production, I combine this with zfs send/recv for offsite replication:

Example: Replicate to remote NAS
LATEST_SNAP=$(zfs list -H -t snapshot -o name -s creation | grep "auto-hourly" | tail -1)
zfs send -R "$LATEST_SNAP" | ssh backup-host "zfs recv -F backup/proxmox"

This gives us:

  • Recovery Point Objective (RPO): 1 hour
  • Recovery Time Objective (RTO): Minutes (just mount the dataset)

Get the Full Script

I've published the full, hardened version of this tool on GitHub with additional features:

  • Multi-dataset support
  • Configurable retention policies
  • Integration with Prometheus for monitoring

👉 Check it out here: proxmox-zfs-snapshot-manager on GitHub

Feel free to fork it or suggest improvements. In the public sector, sharing reliable, open-source tooling is the best way to ensure we all build more resilient infrastructure.


About me: I'm Patrick, a Senior Infrastructure Engineer focusing on Linux hardening and virtualization in the public sector. Connect with me on LinkedIn or check out my other projects on GitHub.

Top comments (0)