DEV Community

Cover image for I had to restore an entire database just to recover one deleted row
huangfei_dev
huangfei_dev

Posted on

I had to restore an entire database just to recover one deleted row

Let me vent for a second, because last week broke me a little.

I've been doing SRE / database ops for about 5 years, and I keep relearning the same painful lesson the hard way: having backups and being able to recover are two completely different things.

Here's what happened.

A developer ran a SQL statement against production and accidentally deleted a single row of live data. One row. Sounds trivial to fix, right? Just put it back. But there was a hard constraint: I couldn't touch or overwrite any of the other live data while doing it. No "just restore the table and hope nothing else changed."

So here's what "having backups" actually looked like in practice:

  1. The only usable backup was the previous day's XtraBackup physical backup.
  2. I had to spin up a brand-new database instance on a matching environment.
  3. Restore the entire backup into that throwaway instance — the whole dataset, just to get at one row.
  4. Dig through the restored data to find the exact row that got deleted.
  5. Carefully write that one row back into production without disturbing anything else.

It worked. But it took hours, every step was manual, and it was way more fiddly than it had any right to be. All for one row.

And here's the part that really got me: at no point in that whole process was I actually sure the backup was even good until I'd already restored it. If that XtraBackup had been silently corrupted or incomplete, I wouldn't have found out until I was already in the middle of an incident — which is the worst possible time to find out.

That's the thing nobody talks about. Almost every team I've worked with has backup jobs running. Almost none of them ever test whether those backups can actually be restored. The bar is usually "the cron job exited 0, so it's probably fine." But "the backup file exists" and "the backup restores cleanly into a working database" are not the same thing — and the gap between them is exactly where disasters live.

A few things I'm taking away from this:

  • An untested backup is just a hope, not a backup. If you've never restored it, you don't actually know you have it.
  • Recovery granularity matters. Full physical backups are great for "the whole server is gone" disasters, but painful when you just need one row back without side effects. And honestly, most real incidents are small surgical mistakes — not the whole thing burning down.
  • Recovery time is a feature. "We can recover eventually" and "we can recover in 10 minutes" are completely different things. In the middle of an incident, that difference is everything.

So I'm genuinely curious how other people handle this, because I suspect a lot of us are quietly doing the same manual dance:

  1. Do you ever actually test that your backups restore? How often — and is it manual or automated?
  2. When you need to recover a small amount of data (one row, one table) without affecting the rest, what's your process?
  3. Has anyone here had a "the backup looked fine but didn't restore" moment? What happened?
  4. If you don't test restores regularly — is it time, tooling, or just trust?

Really want to hear how others deal with this. My honest guess is that for most teams the real answer is "we don't test, we just pray" — tell me I'm wrong.

Top comments (1)

Collapse
 
jgsabedra profile image
João Gabriel Sabedra Vieira

I'm starting out in programming, so I've never dealt with backups on this scale, but this completely changed the way I think about it. I always thought that 'I have a backup' was enough. The idea that an untested backup is basically just hope really stuck with me. Saving this for my future self. Thanks for sharing!