Daniel Glover

Posted on Apr 2 • Originally published at danieljamesglover.com

Proxmox Backup and Disaster Recovery Guide

#proxmox #backup #virtualisation #infrastructure

Most backup strategies look fine in a diagram. There is production, there is backup storage, there is some kind of retention policy, and there is a comforting sentence about disaster recovery. Then a host fails, a datastore corrupts, or ransomware lands on the management network, and you discover the truth: you did not have a recovery strategy, you had a hope strategy.

I like Proxmox because it makes the mechanics of backup straightforward. The dangerous bit is that this can create false confidence. Clicking "Add backup job" is easy. Building a recovery setup that survives hardware failure, operator error and a bad week is not. This guide is the practical version of what I look for when setting up Proxmox backups for a small IT team or a serious home lab.

If you need the leadership view first, read my guide to an IT disaster recovery plan that actually works. If you are thinking specifically about active attack scenarios, pair this with the ransomware response playbook. This post is about the hands-on Proxmox layer underneath both.

What Good Looks Like

A usable Proxmox backup and DR setup does five things well:

It backs up every important VM and container automatically.
It keeps enough restore points to be useful without filling storage.
It isolates backup storage from the main virtualisation environment.
It proves recoverability with regular test restores.
It gives you an off-site option so one hardware problem does not take out everything.

That sounds obvious, but most weak setups fail on one of those five points. The most common mistake is treating backup completion as the goal. It is not. Recovery within an acceptable time is the goal.

Start With the Failure Modes

Before touching the GUI, define what you are defending against.

For most Proxmox environments, the realistic failure modes are:

Single VM or container issue. Bad update, accidental deletion, filesystem corruption.
Host failure. Disk dies, motherboard dies, bad kernel update, RAID issue.
Storage failure. Local datastore corruption or accidental removal.
Security incident. Compromised admin account, malicious encryption, attacker targeting backup infrastructure.
Site loss. Fire, theft, power event, or catastrophic network issue.

Your design should map to those scenarios. Local backups help with VM mistakes. Separate backup infrastructure helps with host failure. Off-site replication helps with site loss. If one control is supposed to solve everything, it probably solves less than you think.

Proxmox Backup Server Beats Dumping Files to NFS

Proxmox VE can back up to file-level storage, and that is still better than having nothing. But if you are taking this seriously, use Proxmox Backup Server.

The reason is not fashion. It is capability.

Proxmox Backup Server gives you de-duplicated storage, better retention handling, verification workflows, and sync jobs to a remote PBS instance. Proxmox also explicitly recommends PBS on a dedicated host because of those advanced features. For a small team, that translates into two practical benefits: lower storage growth and a cleaner path to off-site copies.

If you are still writing backup archives to the same host cluster, ask yourself a brutal question: what happens if the host storage, the hypervisor, and the backup target all fail together? If the answer is "that would be awkward", you do not have enough separation.

My Baseline Architecture

For a small but sensible setup, I would aim for this baseline:

Primary Proxmox VE host or cluster running production workloads.
Dedicated Proxmox Backup Server on separate storage, ideally not on the same disks as the workloads.
Nightly backup jobs for all critical VMs and containers.
Prune schedule that keeps short-term and long-term restore points.
Verification jobs to detect corruption before you need a restore.
Off-site copy using a second PBS instance or another synced destination.

If you can only afford one improvement this month, make it the dedicated PBS host. If you can afford two, add off-site sync.

Backup Mode Choices Matter

Proxmox gives you different backup modes, and they are not interchangeable.

For VMs, snapshot mode is usually the right default because it keeps downtime low. With the guest agent enabled, Proxmox can freeze and thaw the filesystem to improve consistency. Stop mode gives you the highest consistency, but at the cost of downtime, so I reserve it for workloads where application-level consistency matters more than availability.

For containers, the decision depends more on storage and tolerance for interruption. Snapshot mode is excellent when the underlying storage supports it. Suspend mode can reduce downtime, but it needs temporary space. Stop mode is the blunt instrument. Use it when simplicity matters more than elegance.

The mistake here is picking one mode for everything without thinking about workload behaviour. Domain controllers, databases, app servers and throwaway lab machines do not all deserve identical handling.

Retention: Keep More Than Yesterday, Less Than Forever

Retention is where people either hoard uselessly or prune themselves into danger.

A practical retention policy for a small team might look like this:

Keep 7 daily backups
Keep 4 weekly backups
Keep 3 monthly backups
Keep 1-2 yearly backups for critical systems

That gives you recent rollback points, medium-term safety for slow-burn issues, and a minimal historical archive. The exact numbers depend on data churn and storage budget, but the principle is stable: retain enough history to catch corruption, mistakes and delayed detection.

Remember that ransomware and insider mistakes are not always discovered on the same day they happen. If your retention only covers the last three days, you may be faithfully preserving three versions of a bad outcome.

Backup Isolation Is Non-Negotiable

I have written before that attackers increasingly target backup systems because destroying recovery options dramatically increases the chance of payment. The same logic applies in Proxmox environments.

A few practical rules:

Put backup infrastructure on a restricted management network.
Do not use the same credentials everywhere.
Limit who can reach PBS over the network.
Do not mount the backup datastore casually on random admin machines.
Treat the backup server as a critical security asset, not a storage afterthought.

If your Proxmox host, your PBS server and your admin workstation all sit on the same flat network with shared credentials, you have convenience, not resilience. The home lab segmentation guide applies here too. Good backup design and good network design are joined at the hip.

Off-Site Copies Are Where DR Becomes Real

Local backup is backup. Off-site backup is disaster recovery.

This is where PBS sync jobs are so useful. A remote PBS can pull datastore contents into a local target on a schedule, which gives you a workable off-site pattern without building a completely separate toolchain. For small teams, that is attractive because it keeps operations consistent. Same interface, same restore logic, less context switching.

The point is not just geography. It is blast radius.

If the primary site is lost, encrypted or electrically unwell, you need a copy that was not affected by the same event. That might be a second office, a co-lo box, or a trusted remote location over a tunnel. However you do it, make sure the off-site path is tested and not just documented.

Verification and Restore Testing

This is the part that separates adults from optimists.

A backup job finishing successfully does not prove the backup is restorable. It proves a process completed. You still need to validate the result.

My rule is simple:

Verify backups routinely so corruption surfaces early.
Test restores monthly for at least one representative workload.
Time the restore so you know whether your recovery objectives are fantasy.
Document the gotchas you hit during restore, not just the happy path.

A sensible restore test matrix includes:

One small utility VM
One business-critical application VM
One container
One file-level restore from inside a backup

If you have never restored a Proxmox VM under pressure, do not assume you know how long it takes. You will discover little frictions you forgot to model: VLAN mappings, IP conflicts, DNS updates, missing credentials, application service dependencies. This is exactly why testing exists.

A Simple Recovery Runbook Template

Every Proxmox environment should have a short recovery runbook. Not a sixty-page binder. A short document somebody can use at 3 AM.

Mine would include:

Where backups live
Which credentials are needed and where they are stored
The order in which core systems should be restored
Network dependencies for each critical VM
Approximate restore times from the last successful test
Who signs off on failover or rebuild decisions

This is where technical recovery joins business reality. Restoring the wrong system first can waste hours. Your monitoring, identity and DNS services often matter more than the application people shout about first.

Common Proxmox Backup Mistakes

I see the same issues repeatedly:

Backing up to the same failure domain. If your VM storage and backup storage rely on the same physical box, you have reduced recovery options.

No guest agent on important VMs. Snapshot backups are better with proper filesystem quiescing.

No retention logic. Backups either accumulate until storage fills or get pruned too aggressively.

No verification. Corruption remains invisible until the worst possible moment.

No restore drills. The team learns the process during an incident instead of before one.

No off-site copy. One bad event can wipe out both production and recovery.

These are all fixable. None of them are glamorous. They are also the difference between a manageable incident and a career-limiting week.

Where I Would Start This Week

If your current Proxmox backup setup is basic, do these in order:

Stand up a dedicated Proxmox Backup Server.
Move critical VM and container backups onto a nightly schedule.
Apply a sensible retention policy.
Restrict network access to PBS.
Run one restore test and record the actual time.
Add a remote PBS sync for off-site protection.

That sequence gets you from "we do backups" to "we have the start of a disaster recovery capability".

The main thing to remember is that Proxmox makes backup accessible, but accessibility is not the same as resilience. Resilience comes from separation, verification, testing and repetition. If you build those habits into your environment now, the next incident is still unpleasant, but it stops being existential.

And that is the real goal. Not perfect diagrams. Not backup dashboards full of green ticks. Just the quiet confidence that when something breaks, you know exactly how you are getting it back.

DEV Community