I was building and testing a PostgreSQL backup and restore workflow using pgBackRest.
The idea was simple:
- take backups
- restore them automatically
- validate the database
- make recovery predictable
Instead, I ended up repeatedly breaking PostgreSQL recovery itself.
At one point, PostgreSQL refused to start entirely, the application depending on it failed to start, and I started seeing errors like:
invalid checkpoint record
could not locate a valid checkpoint record at 0/DEAD
Later, I also hit timeline mismatch errors like:
ERROR: [058]: target timeline 3 forked from backup timeline 2
At first, I thought:
pgBackRest restores were corrupting PostgreSQL.
That assumption turned out to be completely wrong.
The real problem was the way I was handling recovery.
What I Was Building
I was testing a PostgreSQL backup/restore flow locally after repeated restore failures elsewhere.
To isolate the issue properly, I moved PostgreSQL onto my local machine and started testing the restore logic independently through API-triggered workflows.
The restore flow looked roughly like this:
- Download backup repo
- Stop PostgreSQL
- Restore backup
- Start PostgreSQL
- Validate database
Sounds straightforward.
It wasn't.
The First Major Failure
After a restore attempt, PostgreSQL refused to start.
The logs looked like this:
LOG: database system was interrupted
LOG: invalid checkpoint record
PANIC: could not locate a valid checkpoint record at 0/DEAD
At that point:
- PostgreSQL was down
- the application couldn't start
- authentication-related functionality stopped working
- and repeated restore attempts made things even worse
What confused me initially was this:
The restore itself appeared to complete.
But PostgreSQL would immediately enter recovery problems afterward.
My Wrong Assumption
This was the real issue.
Every time recovery failed, I kept seeing files like:
backup_labelrecovery.signalstandby.signal
So I assumed they were leftover artifacts from failed restores.
My restore automation started aggressively cleaning them up.
Something like this:
rm -f recovery.signal standby.signal backup_label
I genuinely believed this was helping PostgreSQL start cleanly.
In reality:
I was deleting the exact recovery metadata PostgreSQL needed.
That misunderstanding caused almost every major issue afterward.
What PostgreSQL Was Actually Trying To Do
This was the turning point.
pgBackRest wasn't randomly writing junk files into the data directory.
Those files exist for a reason.
During restore:
-
backup_labeltells PostgreSQL where recovery should begin -
recovery.signaltells PostgreSQL to enter recovery mode - WAL replay reconstructs a consistent database state
PostgreSQL was actually trying to perform a valid recovery process.
My automation kept interrupting or invalidating it.
Once I understood that, the entire problem started making sense.
The Recovery Loop Problem
Because my cleanup logic removed recovery metadata prematurely, PostgreSQL ended up in inconsistent states repeatedly.
Sometimes it would:
- enter recovery mode
- fail WAL replay
- lose checkpoint continuity
- refuse startup entirely
Other times it would partially start, but remain stuck in recovery mode.
That led to additional logic being added just to stabilize startup behavior.
For example:
SELECT pg_is_in_recovery();
and when required:
SELECT pg_promote();
The goal wasn't to "force PostgreSQL to work".
The goal was:
let PostgreSQL finish recovery properly, then promote only when necessary.
That distinction mattered a lot.
The Timeline Mismatch Error
At one stage, I also hit this:
ERROR: [058]: target timeline 3 forked from backup timeline 2
This one was especially confusing at first.
The issue was not just corrupted startup state anymore.
Now PostgreSQL was rejecting WAL history itself.
This happened because earlier restore attempts had already created inconsistent recovery timelines.
I had essentially created multiple broken recovery histories while repeatedly testing and modifying the restore process.
That was another important lesson:
PostgreSQL backups are not just data files.
They are tightly connected to WAL history and recovery timelines.
At this point, I realized I was no longer debugging a simple restore failure. I was debugging recovery history itself.
The Real Problem In My Restore Flow
Initially, my restore logic tried to "fix" PostgreSQL after restore.
That approach was fundamentally flawed.
The older flow looked roughly like this:
| Old Approach | Problem |
|---|---|
| Delta restore | Mixed old/new recovery state |
Delete backup_label
|
Broke recovery metadata |
Delete recovery.signal
|
Interrupted recovery |
| Force archive changes | Caused WAL continuity issues |
| Hope PostgreSQL starts | No validation or recovery awareness |
I was treating recovery artifacts like corruption.
They weren't corruption.
They were part of PostgreSQL recovery itself.
The Change That Finally Fixed It
The biggest realization was this:
Stop fighting PostgreSQL recovery.
Instead of trying to manually "clean up" PostgreSQL after restore, I changed the restore flow completely.
The corrected restore flow became:
- Stop PostgreSQL cleanly
- Completely empty the data directory
- Run pgBackRest restore properly
- Let PostgreSQL recover normally
- Wait for readiness
- Promote only if recovery mode persists
- Validate using pgBackRest check
The critical change was this:
self._run_pgbackrest("restore", "--type=immediate")
And equally important:
self._empty_directory(self.pg_data_dir)
Instead of attempting partial or delta-style recovery cleanup, the restore process now starts from a completely clean data directory.
That eliminated a huge amount of inconsistent state.
Why --type=immediate Helped
This turned out to be extremely important.
--type=immediate tells pgBackRest:
restore to the latest immediately consistent point available.
That meant:
- PostgreSQL could perform proper WAL-based recovery
- recovery metadata stayed intact
- WAL replay remained valid
- timeline handling became predictable
Most importantly:
PostgreSQL itself was finally allowed to control recovery correctly.
The Mistake That Increased the Blast Radius
One thing I learned the hard way:
Never test restore automation against a database actively used by an application.
Even though this was a testing workflow, the PostgreSQL instance was still tied to application startup behavior.
So whenever PostgreSQL failed:
- application startup failed too
- user-related functionality broke
- debugging became much harder under pressure
After repeated failures, I moved the restore testing flow entirely onto my local machine and isolated PostgreSQL from the rest of the application stack.
That made debugging significantly easier.
Another Subtle Issue: Backup Failures After Restore
I also ran into another confusing problem after some restore attempts.
In certain cases, subsequent backups started failing unexpectedly after a restore.
Part of the issue came from mixing:
- restore operations
- delta-style restore assumptions
- and archive/WAL state inconsistencies
At one stage, I was also toggling archive-related behavior incorrectly during recovery experiments, which further complicated WAL continuity.
This reinforced another important realization:
PostgreSQL backups are tightly coupled with WAL history and recovery timelines.
Even when the database appears to start correctly, inconsistent recovery state can break future backup behavior in subtle ways.
What I Learned From This
This experience completely changed how I think about PostgreSQL recovery.
Some major lessons:
-
backup_labelandrecovery.signalare not garbage files - PostgreSQL recovery is heavily WAL-dependent
- Timelines matter more than most people realize
- Partial cleanup creates inconsistent recovery states
- A clean restore is often safer than trying to "repair" recovery manually
- pgBackRest already knows how to orchestrate PostgreSQL recovery properly
- Restore validation matters as much as backup creation
- Backup testing should happen in isolated environments
Most importantly:
PostgreSQL recovery is not something you should "fight".
Once I stopped trying to override recovery behavior manually and instead allowed PostgreSQL + pgBackRest to handle recovery the way they were designed to, the restore flow finally became stable.
The Final Restore Flow That Actually Worked
After multiple failed recovery attempts, timeline mismatches, and broken startup states, I stopped trying to manually "fix" PostgreSQL recovery and instead simplified the restore process completely.
The final stable flow looked roughly like this:
# simplified restore flow
stop_postgres()
empty_data_directory()
pgbackrest_restore("--type=immediate")
start_postgres()
wait_for_connection()
if postgres_is_in_recovery():
promote_postgres()
pgbackrest_check()
The important part here is not the code itself.
It's the recovery philosophy behind it.
The earlier versions of my restore logic tried to:
- partially clean recovery state
- remove recovery metadata
- force PostgreSQL out of recovery
- preserve old data directory state
That approach kept creating inconsistent recovery conditions.
The corrected flow instead does three important things:
- starts from a completely clean data directory
- lets pgBackRest manage recovery metadata properly
- allows PostgreSQL to perform WAL recovery the way it was designed to
The biggest change was no longer treating files like backup_label or recovery.signal as corruption artifacts.
They were part of the recovery process itself.
Final Thought
At the beginning, I thought PostgreSQL restores were failing because the database was corrupted.
In reality, the corruption was coming from my own recovery assumptions.
The system wasn't broken.
My mental model of PostgreSQL recovery was.
Top comments (0)