Schrödinger’s Backups: A Lesson in Uncertainty

Mayur (Do not drink & database)
5 min readSep 12, 2024

--

Backups are a lot like Schrödinger’s cat. You think you have them, but when it’s time to check, they might not be what you expect! Even with the best tools in your arsenal, like pgBackRest, things can still go hilariously wrong. It’s not that pgBackRest falters; instead, the chaos is caused by human miscommunication and policy missteps. It’s a reminder that no matter how sharp the tool, it’s all about how you wield it!

So, here’s the deal. A big miscommunication-mismanagement-misalignment went down between the top level decision-makers behind the one of my previous company’s backup retention policy. Result? BOOM. Backup retention was set to 60 days in pgbackrest stanza, instead of the seven years as written in the some obscure corner in the confluence (and it was a single stanza, no typical hot/cold separation omg!!). One fine day, I went hunting for an old backup after a developer mistake in code deleted last quarters data (Hence needed recovery from a old backup)… and there was nothing! Zero, zilch, zip, nada, nothing.

This blog by pgstef is an excellent overview of retention policy for those new to the postgres. We had weekly automated backup restore tests configured, but they were ultimately futile in safeguarding against such a significant mistake by management in defining the retention policy.

Time based retention option, anything older will be removed.

But wait, it gets better (or worse). The company had a second backup server with different pgbackrest stanza for a few more databases where retention was set to seven years as expected. Unfortunately, they had lumped cold and hot backups into a single stanza. What did this lead to? WAL archives going back to the dawn of time itself — like, 3.2 billion years ago. The storage ballooned to petabytes. PETABYTES. Yes, you read that right. It was as if the company was trying to archive the entire universe. Talk about overkill.

I can brag of working on Petabytescale data

Of course, we eventually separated the cold and hot backup stanzas for each database cluster. You’d think that would resolve the issue, right? Well, the quantum curse wasn’t entirely lifted. A single 32TB database took over an entire day to back up! Just imagine the Recovery Time Objective (RTO) for a point-in-time recovery (PITR). 😭 Just contemplating it was enough to break our spirits. But as they say, we don’t bow to the god of death just yet. We upgraded pgBackRest from v1.1 to v2.46, increased the process-max to 12, switched compression to lz4, upgraded the instance to r5b.16xlarge (64vCPU,512GB ram) and changed the storage to io2 blockexpress (64K IOPS). This brought the full backup time down to 4 hours, with incrementals taking about 1 hour. Nonetheless, there’s still a looming threat: if the database growth rate isn’t curbed, the RTO will continue to rise.

Enter The PostgreSQL 17 :

Enter PostgreSQL 17 (Still early days, just on RC1 so this is not yet in Production). With this, we could do a full backup once and then rely on incremental backups going forward. No more waiting for days on end. For month-end and year-end backups? Just restore them on ZFS with high compression, and shut down the database when not in use. No more insane delays. Just streamlined, efficient recovery and peace of mind.

1. Performing the Initial Full Backup

To set up your backup strategy, start with a full backup of your database.

2. Creating Incremental Backups

After completing the full backup, you can perform incremental backups to capture only the changes made since the last backup.

3. Using pg_combinebackup to Merge and Propagate Full Backups

The pg_combinebackup utility is introduced to manage and combine incremental backups efficiently. This utility allows you to merge incremental backups with a previous full backup, creating a new, up-to-date full backup without having to perform a full backup from scratch.

Below are the steps demonstrating above concept tested on Postgres-v17rc1

A) Make sure wal summary is turned on before you take 1st full backup.

B) Sample database and dummy data gen for tests (note: I will increment data in between incremental backup).

C) Create first full backup using pg_baseback in the usual way

D) Create next incrementals and change data for tests

1st incremental uses backup_manifest of First full, second incremental we can use either 1st incremental or first full. Here we use 1st incremental as a base for 2nd incremental.

1st incremental backup
2nd incremental backup

E) Use pg_combinebackup to shift full backup forward in time without taking another full backup

Combined full backup, 2 incrementals and stored it in another folder for restoration

F) Restore new combined full backup and compare data

I will restore it on another port 6666 for the sake of testing and easy comparison for the blog format, you should always restore on a different host to minimize risks.

Restoring database on port 6666
Comparing data

PostgreSQL 17’s native incremental backup may not fully address all the previously discussed challenges by default, but it certainly simplifies key aspects of backup management. It enables easier implementation of retention policies, facilitates hot and cold data separation, and allows for a single full backup followed by incremental backups indefinitely, significantly improving recovery time objectives (RTO).

Finally, We have defeated schrodinger and kept our backup alive.

PS: Quantum Gravity : Optimizing Cold Storage with ZFS

It’s a wild idea, not yet used in Production(that I know of). Try ZFS for high compression and effective storage management. You can restore backups on ZFS, apply high compression (we were able to achieve ~6x compression ratio with zstd) and keep restored db but turned off, isolate machine from external connections completely and you have a very efficient cold backup with a 100% guaranteed restored Postgres.

--

--

Mayur (Do not drink & database)
Mayur (Do not drink & database)

Written by Mayur (Do not drink & database)

Database Specialist @Veeam, Ex-Wise. Data, Scaling and Database Comedy.

No responses yet