DEV Community

François Gingras
François Gingras

Posted on

Do backup drills

Story time! Not long ago, (read Christmas) we had an unfortunate event where all our main production data was wiped out because Murphy was being naughty. Phone were ringed, text were sent and finally we converted out backup into the master database. This was not ideal as now we were running without actual backup and at risk of losing everything if one disk fail. The reason we did it this way was very simple, no one known how the backup system works.

Now we are back at office and the first order is obviously to fix the situation. The reason no one known how it works is simple, the guy who did it left the company and we never had to use it in a real scenario. To make the matter worse, the solution was build before our Kubernetes stack so we were not using fancy replication toys but instead, a bunch of custom scripts.

As a team, we had to spend around 30 developer hours to understand, propose and fix the problem. After that exercise, half of the team understand how the backup system works and we would probably spend much less time without proper backup.

This event made me reflect on how we NEVER drill backup recovery like how we do for fire emergency. I strongly suggest that any team with a real production server should do backup drill where:

  1. A real case is used. For example, delete PostgreSQL data folder while the system is running and serving users.
  2. The drill should targets new team member as a learning experience.
  3. The drill should highlight potentials confusion or better methods as your stack evolves in time.

Like a fire drill.

Top comments (0)