DEV Community

mayur
mayur

Posted on

Schrödinger’s Backups: A Lesson in Uncertainty

Backups are a lot like Schrödinger’s cat. You think you have them, but when it’s time to check, they might not be what you expect! Even with the best tools in your arsenal, like pgBackRest, things can still go hilariously wrong. It’s not that pgBackRest falters; instead, the chaos is caused by human miscommunication and policy missteps. It’s a reminder that no matter how sharp the tool, it’s all about how you wield it!

Image description

Incident 1 :

So, here’s the deal. A big miscommunication-mismanagement-misalignment went down between the top level decision-makers behind the one of my previous company’s backup retention policy. Result? BOOM. Backup retention was set to 60 days in pgbackrest stanza, instead of the seven years as written in the some obscure corner in the confluence (and it was a single stanza, no typical hot/cold separation omg!!). One fine day, I went hunting for an old backup after a developer mistake in code deleted last quarters data (Hence needed recovery from a old backup)… and there was nothing! Zero, zilch, zip, nada, nothing.

Image description

This blog by pgstef is an excellent overview of retention policy for those new to the postgres. We had weekly automated backup restore tests configured, but they were ultimately futile in safeguarding against such a significant mistake by management in defining the retention policy.

Image description

Incident 2:

But wait, it gets better (or worse). The company had a second backup server with different pgbackrest stanza for a few more databases where retention was set to seven years as expected. Unfortunately, they had lumped cold and hot backups into a single stanza. What did this lead to? WAL archives going back to the dawn of time itself — like, 3.2 billion years ago. The storage ballooned to petabytes. PETABYTES. Yes, you read that right. It was as if the company was trying to archive the entire universe. Talk about overkill.

Image description

Problem 3:

Of course, we eventually separated the cold and hot backup stanzas for each database cluster. You’d think that would resolve the issue, right? Well, the quantum curse wasn’t entirely lifted. A single 32TB database took over an entire day to back up! Just imagine the Recovery Time Objective (RTO) for a point-in-time recovery (PITR). 😭 Just contemplating it was enough to break our spirits. But as they say, we don’t bow to the god of death just yet.
We upgraded pgBackRest from v1.1 to v2.46, increased the process-max to 12, switched compression to lz4, upgraded the instance to r5b.16xlarge (64vCPU,512GB ram) and changed the storage to io2 blockexpress (64K IOPS). This brought the full backup time down to 4 hours, with incrementals taking about 1 hour. Nonetheless, there’s still a looming threat: if the database growth rate isn’t curbed, the RTO will continue to rise.

Enter PostgreSQL 17 to Save the Day:

Enter PostgreSQL 17 (I wrote this article when PG17 RC1 was released and used it for demo, you should use GA version). With this, we could do a full backup once and then rely on incremental backups going forward. No more waiting for days on end. For month-end and year-end backups? Just restore them on ZFS with high compression, and shut down the database when not in use. No more insane delays. Just streamlined, efficient recovery and peace of mind.

1. Performing the Initial Full Backup
To set up your backup strategy, start with a full backup of your database.

2. Creating Incremental Backups
After completing the full backup, you can perform incremental backups to capture only the changes made since the last backup.

3. Using pg_combinebackup to Merge and Propagate Full Backups
The pg_combinebackup utility is introduced to manage and combine incremental backups efficiently. This utility allows you to merge incremental backups with a previous full backup, creating a new, up-to-date full backup without having to perform a full backup from scratch.

Below are the steps demonstrating above concept tested on Postgres-v17rc1:

A) Make sure wal summary is turned on before you take 1st full backup.

Image description

B) Sample database and dummy data gen for tests (note: I will increment data in between incremental backup).

Image description

C) Create first full backup using pg_baseback in the usual way

Image description

D) Create next incrementals and change data for tests

1st incremental uses backup_manifest of First full, second incremental we can use either 1st incremental or first full. Here we use 1st incremental as a base for 2nd incremental.

1st incremental

Image description

2nd incremental

Image description

E) Use pg_combinebackup to shift full backup forward in time without taking another full backup.

Image description

This is very powerful concept, we are moving full backup ahead in time and bringing it closer to the most recent incremental. Due to this we not only save on resources but also reduce recovery time significantly.

F) Restore new combined full backup and compare data

I will restore it on another port 6666 for the sake of testing and easy comparison for the blog format, you should always restore on a different host to minimize risks.

Restoring database on port 6666:
Image description

Verifying data by comparing restored vs original

Image description

PostgreSQL 17’s native incremental backup may not address human induced challenges by default, but it certainly simplifies key aspects of backup management. It enables easier implementation of retention policies, facilitates easy hot and cold data separation, and allows for a single full backup followed by incremental backups indefinitely, significantly improving recovery time objectives (RTO).

Finally, We have defeated schrodinger and kept our backup alive.

Image description

PS: Quantum Gravity : Optimizing Cold Storage with ZFS
It’s a wild idea, not yet used in Production(that I know of). Try ZFS for high compression and effective storage management. You can restore backups on ZFS, apply high compression (we were able to achieve ~6x compression ratio with zstd) and keep restored db but turned off, isolate machine from external connections completely and you have a very efficient cold backup with a 100% guaranteed restored Postgres.

Top comments (0)