DEV Community

Discussion on: Learnings from a 5-hour production downtime!

Collapse
 
garvit_gupta profile image
Garvit Gupta • Edited

Hi John, no we haven't disabled automated storage scaleup, we have added alarms when remaining storage is <30% so that we can increase the storage manually before the auto-scaleup threshold.

Also, it sounds like you hadn't tested restoration of backup before

What makes you think so? We have restored backups earlier but unlike this time we never faced bottlenecks due to CPU or IOPS.

Collapse
 
rouilj profile image
John P. Rouillard

Hello Garivt:

We have restored backups earlier
Sorry, bad assumption on my part.

but unlike this time we never faced bottlenecks due to CPU or IOPS.
Exactly this. How was/were your previous restore(s) different from this restore?
Why didn't you see bottlenecks before?

I assume the smaller system (CPU bound) should have been able to restore the 3 hour
old backup quickly based on prior experience. The service then would backfill all the data from
the time of the backup to the time the restore was completed. When you started the
restore this time, what was the expected completion time: minutes, hours?

Since you worked around the issues by provisioning more CPU and IOPS, the restore
wasn't physically limited by the ongoing disk activity from the automatic storage
migration. Hmm, maybe that's a bad assumption. Did provisioning more IOPS move
you to a different storage subsystem (away from the one handling the migration)?
Also did you notice what happened to the CPU use when you increased the IOPS?

I've had to do/assist with a few DR's in my career (catastrophic storage failure, virus, bugs, etc.).
But it's all been on the hardware in the company's DCs. I've never had an issue where DR fell well
outside the predicted and tested times. So I'm curious if this is a new issue for cloud services.

What are your thoughts?