Why database backups don’t fix integration failures (and what actually does)
I used to treat integration failures as data problems.
Restore the database.
Re-run the job.
Patch the gap.
It works — until it doesn’t.
Because most of the time, the data isn’t missing.
The event just never made it where it needed to go.
The gap no one owns
Most systems are built around state.
Databases, backups, snapshots — all focused on what the data looks like.
But integrations are about how data moves.
A typical flow:
ERP → Service A → Service B → API
Now imagine:
- ERP emits an
order.createdevent - Service A forwards it
- Service B times out
- No retry is triggered
- The upstream system assumes success
Nothing crashes.
No alerts fire.
Until someone notices the order was never fulfilled.
Why backups fall short
When this happens, teams usually:
- restore a backup
- re-run a job
- patch data manually
But backups only restore state.
They don’t tell you:
- what failed
- what never arrived
And they can’t reconstruct a missing event.
The real problem: delivery
At some point it becomes clear:
This isn’t a data problem.
It’s a delivery problem.
Most systems rely on:
- webhooks
- ad hoc retry logic
- logs spread across services
When something fails mid-flight, debugging becomes guesswork.
And if something is lost entirely, recovery becomes manual.
A different approach: replay
Instead of restoring state, replay flow.
If events are:
- captured
- stored
- replayable
You can recover without guessing.
Not by re-running jobs.
Not by patching data.
But by replaying what actually happened.
What this enables
- Re-deliver only failed events
- Trace what happened to a specific event
- Apply updated retry or routing policies
Recovery becomes predictable — assuming idempotent consumers.
Trade-offs
- Storage overhead
- Idempotency requirements
- Architectural shift
Final thought
If you can’t replay what happened between your systems,
you don’t really have a recovery strategy.
You have a snapshot.
Top comments (1)
Curious how others are handling this in practice.
Do you rely on retries, queues, or something else when events fail?
At one place we used to call it “heroes of the night” — people manually fixing things when integrations broke.
Probably a sign the system isn’t doing its job.