Sverre Senneset

Posted on Apr 20

Why database backups don’t fix integration failures (and what actually does)

#backend #distributedsystems #devops #api

Why database backups don’t fix integration failures (and what actually does)

I used to treat integration failures as data problems.

Restore the database.

Re-run the job.

Patch the gap.

It works — until it doesn’t.

Because most of the time, the data isn’t missing.

The event just never made it where it needed to go.

The gap no one owns

Most systems are built around state.

Databases, backups, snapshots — all focused on what the data looks like.

But integrations are about how data moves.

A typical flow:
ERP → Service A → Service B → API

Now imagine:

ERP emits an order.created event
Service A forwards it
Service B times out
No retry is triggered
The upstream system assumes success

Nothing crashes.

No alerts fire.

Until someone notices the order was never fulfilled.

Why backups fall short

When this happens, teams usually:

restore a backup
re-run a job
patch data manually

But backups only restore state.

They don’t tell you:

what failed
what never arrived

And they can’t reconstruct a missing event.

The real problem: delivery

At some point it becomes clear:

This isn’t a data problem.

It’s a delivery problem.

Most systems rely on:

webhooks
ad hoc retry logic
logs spread across services

When something fails mid-flight, debugging becomes guesswork.

And if something is lost entirely, recovery becomes manual.

A different approach: replay

Instead of restoring state, replay flow.

If events are:

captured
stored
replayable

You can recover without guessing.

Not by re-running jobs.

Not by patching data.

But by replaying what actually happened.

What this enables

Re-deliver only failed events
Trace what happened to a specific event
Apply updated retry or routing policies

Recovery becomes predictable — assuming idempotent consumers.

Trade-offs

Storage overhead
Idempotency requirements
Architectural shift

Final thought

If you can’t replay what happened between your systems,

you don’t really have a recovery strategy.

You have a snapshot.

Top comments (1)

Sverre Senneset • Apr 20

Curious how others are handling this in practice.

Do you rely on retries, queues, or something else when events fail?

At one place we used to call it “heroes of the night” — people manually fixing things when integrations broke.

Probably a sign the system isn’t doing its job.