DEV Community

Cover image for Why database backups don’t fix integration failures (and what actually does)
Sverre Senneset
Sverre Senneset

Posted on

Why database backups don’t fix integration failures (and what actually does)

Why database backups don’t fix integration failures (and what actually does)

I used to treat integration failures as data problems.

Restore the database.

Re-run the job.

Patch the gap.

It works — until it doesn’t.

Because most of the time, the data isn’t missing.

The event just never made it where it needed to go.


The gap no one owns

Most systems are built around state.

Databases, backups, snapshots — all focused on what the data looks like.

But integrations are about how data moves.

A typical flow:
ERP → Service A → Service B → API

Now imagine:

  • ERP emits an order.created event
  • Service A forwards it
  • Service B times out
  • No retry is triggered
  • The upstream system assumes success

Nothing crashes.

No alerts fire.

Until someone notices the order was never fulfilled.


Why backups fall short

When this happens, teams usually:

  • restore a backup
  • re-run a job
  • patch data manually

But backups only restore state.

They don’t tell you:

  • what failed
  • what never arrived

And they can’t reconstruct a missing event.


The real problem: delivery

At some point it becomes clear:

This isn’t a data problem.

It’s a delivery problem.

Most systems rely on:

  • webhooks
  • ad hoc retry logic
  • logs spread across services

When something fails mid-flight, debugging becomes guesswork.

And if something is lost entirely, recovery becomes manual.


A different approach: replay

Instead of restoring state, replay flow.

If events are:

  • captured
  • stored
  • replayable

You can recover without guessing.

Not by re-running jobs.

Not by patching data.

But by replaying what actually happened.


What this enables

  • Re-deliver only failed events
  • Trace what happened to a specific event
  • Apply updated retry or routing policies

Recovery becomes predictable — assuming idempotent consumers.


Trade-offs

  • Storage overhead
  • Idempotency requirements
  • Architectural shift

Final thought

If you can’t replay what happened between your systems,

you don’t really have a recovery strategy.

You have a snapshot.

Top comments (1)

Collapse
 
queuey profile image
Sverre Senneset

Curious how others are handling this in practice.

Do you rely on retries, queues, or something else when events fail?

At one place we used to call it “heroes of the night” — people manually fixing things when integrations broke.

Probably a sign the system isn’t doing its job.