DEV Community

Cover image for Learnings from a 5-hour production downtime!

Learnings from a 5-hour production downtime!

Garvit Gupta on March 02, 2024

As with all the incidents, it happened on a Friday evening! In this article, I’ll delve into the causes and prolonged recovery time of a recent 5-...
Collapse
 
thomasmoreee profile image
Thomas More

Thank you for sharing your insights. Proactive measures are crucial in maintaining the stability and performance of our database servers.

Considering your points, it's clear that maintaining adequate free storage on our database servers is essential to avoid storage bottlenecks, especially during critical incidents. In hindsight, increasing storage capacity proactively could have mitigated the risk of encountering such bottlenecks.

Furthermore, your suggestion to over-provision resources during the restoration of backups is well noted. Over-provisioning resources can help ensure smoother operations and minimize the impact of potential bottlenecks during such critical processes.

Lastly, implementing rate limiting proactively to manage sudden traffic spikes is a sensible approach to prevent server overload and maintain optimal performance. By anticipating potential traffic spikes and implementing appropriate measures beforehand, we can better safeguard against disruptions and ensure the seamless functioning of our servers.

Moving forward, we must prioritize proactive measures to address potential challenges before they escalate into critical incidents. By doing so, we can enhance the resilience and reliability of our database infrastructure.

If you need further assistance or coursework help in implementing these proactive measures, please feel free to reach out.

Collapse
 
rouilj profile image
John P. Rouillard

Have you disabled the automatic RDS storage scaling? Also, it sounds like you hadn't tested restoration of backup before. Is that true?

Collapse
 
garvit_gupta profile image
Garvit Gupta • Edited

Hi John, no we haven't disabled automated storage scaleup, we have added alarms when remaining storage is <30% so that we can increase the storage manually before the auto-scaleup threshold.

Also, it sounds like you hadn't tested restoration of backup before

What makes you think so? We have restored backups earlier but unlike this time we never faced bottlenecks due to CPU or IOPS.

Collapse
 
rouilj profile image
John P. Rouillard

Hello Garivt:

We have restored backups earlier
Sorry, bad assumption on my part.

but unlike this time we never faced bottlenecks due to CPU or IOPS.
Exactly this. How was/were your previous restore(s) different from this restore?
Why didn't you see bottlenecks before?

I assume the smaller system (CPU bound) should have been able to restore the 3 hour
old backup quickly based on prior experience. The service then would backfill all the data from
the time of the backup to the time the restore was completed. When you started the
restore this time, what was the expected completion time: minutes, hours?

Since you worked around the issues by provisioning more CPU and IOPS, the restore
wasn't physically limited by the ongoing disk activity from the automatic storage
migration. Hmm, maybe that's a bad assumption. Did provisioning more IOPS move
you to a different storage subsystem (away from the one handling the migration)?
Also did you notice what happened to the CPU use when you increased the IOPS?

I've had to do/assist with a few DR's in my career (catastrophic storage failure, virus, bugs, etc.).
But it's all been on the hardware in the company's DCs. I've never had an issue where DR fell well
outside the predicted and tested times. So I'm curious if this is a new issue for cloud services.

What are your thoughts?