Mohamed Hussain S

Posted on May 6

PostgreSQL Restore Failures: It Wasn’t pgBackRest, It Was My Recovery Logic

#postgres #opensource #database #devops

I was building and testing a PostgreSQL backup and restore workflow using pgBackRest.

The idea was simple:

take backups
restore them automatically
validate the database
make recovery predictable

Instead, I ended up repeatedly breaking PostgreSQL recovery itself.

At one point, PostgreSQL refused to start entirely, the application depending on it failed to start, and I started seeing errors like:

invalid checkpoint record
could not locate a valid checkpoint record at 0/DEAD

Later, I also hit timeline mismatch errors like:

ERROR: [058]: target timeline 3 forked from backup timeline 2

At first, I thought:

pgBackRest restores were corrupting PostgreSQL.

That assumption turned out to be completely wrong.

The real problem was the way I was handling recovery.

What I Was Building

I was testing a PostgreSQL backup/restore flow locally after repeated restore failures elsewhere.

To isolate the issue properly, I moved PostgreSQL onto my local machine and started testing the restore logic independently through API-triggered workflows.

The restore flow looked roughly like this:

Download backup repo
Stop PostgreSQL
Restore backup
Start PostgreSQL
Validate database

Sounds straightforward.

It wasn't.

The First Major Failure

After a restore attempt, PostgreSQL refused to start.

The logs looked like this:

LOG: database system was interrupted
LOG: invalid checkpoint record
PANIC: could not locate a valid checkpoint record at 0/DEAD

At that point:

PostgreSQL was down
the application couldn't start
authentication-related functionality stopped working
and repeated restore attempts made things even worse

What confused me initially was this:

The restore itself appeared to complete.

But PostgreSQL would immediately enter recovery problems afterward.

My Wrong Assumption

This was the real issue.

Every time recovery failed, I kept seeing files like:

backup_label
recovery.signal
standby.signal

So I assumed they were leftover artifacts from failed restores.

My restore automation started aggressively cleaning them up.

Something like this:

rm -f recovery.signal standby.signal backup_label

I genuinely believed this was helping PostgreSQL start cleanly.

In reality:

I was deleting the exact recovery metadata PostgreSQL needed.

That misunderstanding caused almost every major issue afterward.

What PostgreSQL Was Actually Trying To Do

This was the turning point.

pgBackRest wasn't randomly writing junk files into the data directory.

Those files exist for a reason.

During restore:

backup_label tells PostgreSQL where recovery should begin
recovery.signal tells PostgreSQL to enter recovery mode
WAL replay reconstructs a consistent database state

PostgreSQL was actually trying to perform a valid recovery process.

My automation kept interrupting or invalidating it.

Once I understood that, the entire problem started making sense.

The Recovery Loop Problem

Because my cleanup logic removed recovery metadata prematurely, PostgreSQL ended up in inconsistent states repeatedly.

Sometimes it would:

enter recovery mode
fail WAL replay
lose checkpoint continuity
refuse startup entirely

Other times it would partially start, but remain stuck in recovery mode.

That led to additional logic being added just to stabilize startup behavior.

For example:

SELECT pg_is_in_recovery();

and when required:

SELECT pg_promote();

The goal wasn't to "force PostgreSQL to work".

The goal was:

let PostgreSQL finish recovery properly, then promote only when necessary.

That distinction mattered a lot.

The Timeline Mismatch Error

At one stage, I also hit this:

ERROR: [058]: target timeline 3 forked from backup timeline 2

This one was especially confusing at first.

The issue was not just corrupted startup state anymore.

Now PostgreSQL was rejecting WAL history itself.

This happened because earlier restore attempts had already created inconsistent recovery timelines.

I had essentially created multiple broken recovery histories while repeatedly testing and modifying the restore process.

That was another important lesson:

PostgreSQL backups are not just data files.
They are tightly connected to WAL history and recovery timelines.

At this point, I realized I was no longer debugging a simple restore failure. I was debugging recovery history itself.

The Real Problem In My Restore Flow

Initially, my restore logic tried to "fix" PostgreSQL after restore.

That approach was fundamentally flawed.

The older flow looked roughly like this:

Old Approach	Problem
Delta restore	Mixed old/new recovery state
Delete `backup_label`	Broke recovery metadata
Delete `recovery.signal`	Interrupted recovery
Force archive changes	Caused WAL continuity issues
Hope PostgreSQL starts	No validation or recovery awareness

I was treating recovery artifacts like corruption.

They weren't corruption.

They were part of PostgreSQL recovery itself.

The Change That Finally Fixed It

The biggest realization was this:

Stop fighting PostgreSQL recovery.

Instead of trying to manually "clean up" PostgreSQL after restore, I changed the restore flow completely.

The corrected restore flow became:

Stop PostgreSQL cleanly
Completely empty the data directory
Run pgBackRest restore properly
Let PostgreSQL recover normally
Wait for readiness
Promote only if recovery mode persists
Validate using pgBackRest check

The critical change was this:

self._run_pgbackrest("restore", "--type=immediate")

And equally important:

self._empty_directory(self.pg_data_dir)

Instead of attempting partial or delta-style recovery cleanup, the restore process now starts from a completely clean data directory.

That eliminated a huge amount of inconsistent state.

Why `--type=immediate` Helped

This turned out to be extremely important.

--type=immediate tells pgBackRest:

restore to the latest immediately consistent point available.

That meant:

PostgreSQL could perform proper WAL-based recovery
recovery metadata stayed intact
WAL replay remained valid
timeline handling became predictable

Most importantly:

PostgreSQL itself was finally allowed to control recovery correctly.

The Mistake That Increased the Blast Radius

One thing I learned the hard way:

Never test restore automation against a database actively used by an application.

Even though this was a testing workflow, the PostgreSQL instance was still tied to application startup behavior.

So whenever PostgreSQL failed:

application startup failed too
user-related functionality broke
debugging became much harder under pressure

After repeated failures, I moved the restore testing flow entirely onto my local machine and isolated PostgreSQL from the rest of the application stack.

That made debugging significantly easier.

Another Subtle Issue: Backup Failures After Restore

I also ran into another confusing problem after some restore attempts.

In certain cases, subsequent backups started failing unexpectedly after a restore.

Part of the issue came from mixing:

restore operations
delta-style restore assumptions
and archive/WAL state inconsistencies

At one stage, I was also toggling archive-related behavior incorrectly during recovery experiments, which further complicated WAL continuity.

This reinforced another important realization:

PostgreSQL backups are tightly coupled with WAL history and recovery timelines.

Even when the database appears to start correctly, inconsistent recovery state can break future backup behavior in subtle ways.

What I Learned From This

This experience completely changed how I think about PostgreSQL recovery.

Some major lessons:

backup_label and recovery.signal are not garbage files
PostgreSQL recovery is heavily WAL-dependent
Timelines matter more than most people realize
Partial cleanup creates inconsistent recovery states
A clean restore is often safer than trying to "repair" recovery manually
pgBackRest already knows how to orchestrate PostgreSQL recovery properly
Restore validation matters as much as backup creation
Backup testing should happen in isolated environments

Most importantly:

PostgreSQL recovery is not something you should "fight".

Once I stopped trying to override recovery behavior manually and instead allowed PostgreSQL + pgBackRest to handle recovery the way they were designed to, the restore flow finally became stable.

The Final Restore Flow That Actually Worked

After multiple failed recovery attempts, timeline mismatches, and broken startup states, I stopped trying to manually "fix" PostgreSQL recovery and instead simplified the restore process completely.

The final stable flow looked roughly like this:

# simplified restore flow

stop_postgres()

empty_data_directory()

pgbackrest_restore("--type=immediate")

start_postgres()

wait_for_connection()

if postgres_is_in_recovery():
    promote_postgres()

pgbackrest_check()

The important part here is not the code itself.

It's the recovery philosophy behind it.

The earlier versions of my restore logic tried to:

partially clean recovery state
remove recovery metadata
force PostgreSQL out of recovery
preserve old data directory state

That approach kept creating inconsistent recovery conditions.

The corrected flow instead does three important things:

starts from a completely clean data directory
lets pgBackRest manage recovery metadata properly
allows PostgreSQL to perform WAL recovery the way it was designed to

The biggest change was no longer treating files like backup_label or recovery.signal as corruption artifacts.

They were part of the recovery process itself.

Final Thought

At the beginning, I thought PostgreSQL restores were failing because the database was corrupted.

In reality, the corruption was coming from my own recovery assumptions.

The system wasn't broken.

My mental model of PostgreSQL recovery was.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.