DEV Community

Cover image for PostgreSQL Restore Failures: It Wasn’t pgBackRest, It Was My Recovery Logic
Mohamed Hussain S
Mohamed Hussain S

Posted on

PostgreSQL Restore Failures: It Wasn’t pgBackRest, It Was My Recovery Logic

I was building and testing a PostgreSQL backup and restore workflow using pgBackRest.

The idea was simple:

  • take backups
  • restore them automatically
  • validate the database
  • make recovery predictable

Instead, I ended up repeatedly breaking PostgreSQL recovery itself.

At one point, PostgreSQL refused to start entirely, the application depending on it failed to start, and I started seeing errors like:

invalid checkpoint record
could not locate a valid checkpoint record at 0/DEAD
Enter fullscreen mode Exit fullscreen mode

Later, I also hit timeline mismatch errors like:

ERROR: [058]: target timeline 3 forked from backup timeline 2
Enter fullscreen mode Exit fullscreen mode

At first, I thought:

pgBackRest restores were corrupting PostgreSQL.

That assumption turned out to be completely wrong.

The real problem was the way I was handling recovery.


What I Was Building

I was testing a PostgreSQL backup/restore flow locally after repeated restore failures elsewhere.

To isolate the issue properly, I moved PostgreSQL onto my local machine and started testing the restore logic independently through API-triggered workflows.

The restore flow looked roughly like this:

  1. Download backup repo
  2. Stop PostgreSQL
  3. Restore backup
  4. Start PostgreSQL
  5. Validate database

Sounds straightforward.

It wasn't.


The First Major Failure

After a restore attempt, PostgreSQL refused to start.

The logs looked like this:

LOG: database system was interrupted
LOG: invalid checkpoint record
PANIC: could not locate a valid checkpoint record at 0/DEAD
Enter fullscreen mode Exit fullscreen mode

At that point:

  • PostgreSQL was down
  • the application couldn't start
  • authentication-related functionality stopped working
  • and repeated restore attempts made things even worse

What confused me initially was this:

The restore itself appeared to complete.

But PostgreSQL would immediately enter recovery problems afterward.


My Wrong Assumption

This was the real issue.

Every time recovery failed, I kept seeing files like:

  • backup_label
  • recovery.signal
  • standby.signal

So I assumed they were leftover artifacts from failed restores.

My restore automation started aggressively cleaning them up.

Something like this:

rm -f recovery.signal standby.signal backup_label
Enter fullscreen mode Exit fullscreen mode

I genuinely believed this was helping PostgreSQL start cleanly.

In reality:

I was deleting the exact recovery metadata PostgreSQL needed.

That misunderstanding caused almost every major issue afterward.


What PostgreSQL Was Actually Trying To Do

This was the turning point.

pgBackRest wasn't randomly writing junk files into the data directory.

Those files exist for a reason.

During restore:

  • backup_label tells PostgreSQL where recovery should begin
  • recovery.signal tells PostgreSQL to enter recovery mode
  • WAL replay reconstructs a consistent database state

PostgreSQL was actually trying to perform a valid recovery process.

My automation kept interrupting or invalidating it.

Once I understood that, the entire problem started making sense.


The Recovery Loop Problem

Because my cleanup logic removed recovery metadata prematurely, PostgreSQL ended up in inconsistent states repeatedly.

Sometimes it would:

  • enter recovery mode
  • fail WAL replay
  • lose checkpoint continuity
  • refuse startup entirely

Other times it would partially start, but remain stuck in recovery mode.

That led to additional logic being added just to stabilize startup behavior.

For example:

SELECT pg_is_in_recovery();
Enter fullscreen mode Exit fullscreen mode

and when required:

SELECT pg_promote();
Enter fullscreen mode Exit fullscreen mode

The goal wasn't to "force PostgreSQL to work".

The goal was:

let PostgreSQL finish recovery properly, then promote only when necessary.

That distinction mattered a lot.


The Timeline Mismatch Error

At one stage, I also hit this:

ERROR: [058]: target timeline 3 forked from backup timeline 2
Enter fullscreen mode Exit fullscreen mode

This one was especially confusing at first.

The issue was not just corrupted startup state anymore.

Now PostgreSQL was rejecting WAL history itself.

This happened because earlier restore attempts had already created inconsistent recovery timelines.

I had essentially created multiple broken recovery histories while repeatedly testing and modifying the restore process.

That was another important lesson:

PostgreSQL backups are not just data files.
They are tightly connected to WAL history and recovery timelines.

At this point, I realized I was no longer debugging a simple restore failure. I was debugging recovery history itself.


The Real Problem In My Restore Flow

Initially, my restore logic tried to "fix" PostgreSQL after restore.

That approach was fundamentally flawed.

The older flow looked roughly like this:

Old Approach Problem
Delta restore Mixed old/new recovery state
Delete backup_label Broke recovery metadata
Delete recovery.signal Interrupted recovery
Force archive changes Caused WAL continuity issues
Hope PostgreSQL starts No validation or recovery awareness

I was treating recovery artifacts like corruption.

They weren't corruption.

They were part of PostgreSQL recovery itself.


The Change That Finally Fixed It

The biggest realization was this:

Stop fighting PostgreSQL recovery.

Instead of trying to manually "clean up" PostgreSQL after restore, I changed the restore flow completely.

The corrected restore flow became:

  1. Stop PostgreSQL cleanly
  2. Completely empty the data directory
  3. Run pgBackRest restore properly
  4. Let PostgreSQL recover normally
  5. Wait for readiness
  6. Promote only if recovery mode persists
  7. Validate using pgBackRest check

The critical change was this:

self._run_pgbackrest("restore", "--type=immediate")
Enter fullscreen mode Exit fullscreen mode

And equally important:

self._empty_directory(self.pg_data_dir)
Enter fullscreen mode Exit fullscreen mode

Instead of attempting partial or delta-style recovery cleanup, the restore process now starts from a completely clean data directory.

That eliminated a huge amount of inconsistent state.


Why --type=immediate Helped

This turned out to be extremely important.

--type=immediate tells pgBackRest:

restore to the latest immediately consistent point available.

That meant:

  • PostgreSQL could perform proper WAL-based recovery
  • recovery metadata stayed intact
  • WAL replay remained valid
  • timeline handling became predictable

Most importantly:

PostgreSQL itself was finally allowed to control recovery correctly.


The Mistake That Increased the Blast Radius

One thing I learned the hard way:

Never test restore automation against a database actively used by an application.

Even though this was a testing workflow, the PostgreSQL instance was still tied to application startup behavior.

So whenever PostgreSQL failed:

  • application startup failed too
  • user-related functionality broke
  • debugging became much harder under pressure

After repeated failures, I moved the restore testing flow entirely onto my local machine and isolated PostgreSQL from the rest of the application stack.

That made debugging significantly easier.


Another Subtle Issue: Backup Failures After Restore

I also ran into another confusing problem after some restore attempts.

In certain cases, subsequent backups started failing unexpectedly after a restore.

Part of the issue came from mixing:

  • restore operations
  • delta-style restore assumptions
  • and archive/WAL state inconsistencies

At one stage, I was also toggling archive-related behavior incorrectly during recovery experiments, which further complicated WAL continuity.

This reinforced another important realization:

PostgreSQL backups are tightly coupled with WAL history and recovery timelines.

Even when the database appears to start correctly, inconsistent recovery state can break future backup behavior in subtle ways.


What I Learned From This

This experience completely changed how I think about PostgreSQL recovery.

Some major lessons:

  • backup_label and recovery.signal are not garbage files
  • PostgreSQL recovery is heavily WAL-dependent
  • Timelines matter more than most people realize
  • Partial cleanup creates inconsistent recovery states
  • A clean restore is often safer than trying to "repair" recovery manually
  • pgBackRest already knows how to orchestrate PostgreSQL recovery properly
  • Restore validation matters as much as backup creation
  • Backup testing should happen in isolated environments

Most importantly:

PostgreSQL recovery is not something you should "fight".

Once I stopped trying to override recovery behavior manually and instead allowed PostgreSQL + pgBackRest to handle recovery the way they were designed to, the restore flow finally became stable.


The Final Restore Flow That Actually Worked

After multiple failed recovery attempts, timeline mismatches, and broken startup states, I stopped trying to manually "fix" PostgreSQL recovery and instead simplified the restore process completely.

The final stable flow looked roughly like this:

# simplified restore flow

stop_postgres()

empty_data_directory()

pgbackrest_restore("--type=immediate")

start_postgres()

wait_for_connection()

if postgres_is_in_recovery():
    promote_postgres()

pgbackrest_check()
Enter fullscreen mode Exit fullscreen mode

The important part here is not the code itself.

It's the recovery philosophy behind it.

The earlier versions of my restore logic tried to:

  • partially clean recovery state
  • remove recovery metadata
  • force PostgreSQL out of recovery
  • preserve old data directory state

That approach kept creating inconsistent recovery conditions.

The corrected flow instead does three important things:

  • starts from a completely clean data directory
  • lets pgBackRest manage recovery metadata properly
  • allows PostgreSQL to perform WAL recovery the way it was designed to

The biggest change was no longer treating files like backup_label or recovery.signal as corruption artifacts.

They were part of the recovery process itself.


Final Thought

At the beginning, I thought PostgreSQL restores were failing because the database was corrupted.

In reality, the corruption was coming from my own recovery assumptions.

The system wasn't broken.

My mental model of PostgreSQL recovery was.

Top comments (0)