In a cloud environment, professionals plan and release updates to speed up processes and adopt best-in-class practices. Most of the time, it goes well, but once in a blue moon, a bad push turns into a disaster, bringing the systems to a halt.
On Monday, 20 October, AWS experienced a worldwide outage that affected many apps, websites, and social media platforms. Initially, no one had expected it to turn into a global cloud outage, but soon users began reporting issues on Twitter.
Soon, it was confirmed that the single-point failure in the US-East region had reached a significant threshold.
Give this blog post a read to learn in depth about how to recover from a cloud outage, ensuring the app remains available.
How a Routine AWS Update Triggered a Global Cloud Outage?
Initially, AWS identified the root cause of the DNS disturbance affecting the AWS DynamoDB endpoint and resolved it within 3 hours. However, EC2 subsequently crashed, which is responsible for creating virtual servers.
This EC2 function is integrated with DynamoDB, but after the database was lost, the EC2 stopped working. Later, the load balancer service was disrupted, causing network issues.
When these three core services stopped responding, AWS services suddenly experienced a chain of failures and crashes.
Lambda, CloudWatch, SQS, and 75+ other services were affected by this outage. Now you can understand how massive it is.
Because servers lost communication, everything just paused for 15+ hours, and that's an unbearable situation. Nowadays, when everything is moving to the cloud and everything is accessible, this pause was hard to swallow.
All the services were mapped and integrated, thus requests piled up and crashed in the queue, so it took a whole day to get back on track.
Now, many news sites are circulating a headline that cloud outage can happen again so it's better to be proactive and have a plan to prevent the issues.
They confirmed that the problem started with DynamoDB, but the twist is even more foolish. What's happening inside AWS is that they're just completing their weekly testing, so they pushed the update, but it went wrong—a bad one that crashed the database and took down all the web apps and sites, leaving them in a non-working state.
Let's jump into learning more about how to detect cloud outage and fix it, and how to ensure the application's availability.
How to Discover the AWS Services Outage?

If you encounter an unexpected failure when accessing any application or website, the app may be crashing and unavailable at the moment due to an uninstalled update. But in another way, the problem could be related to the cloud, which is causing the app or website to fail to load.
Let's detect what may trouble here causing global cloud outage:
Increase in Service Response Time
If you're experiencing delays accessing any AWS app or service that's taking longer than usual, and this issue persists for an hour, it may be due to an outage, a failed request, or a delayed response.
AWS Dashboard Reflects the Service's Health
Visit the official source to identify the scope and root of the issues affecting the applications.
Navigate to the AWS service health dashboard, which provides a comprehensive view of the latest updates on AWS services.
Social Media Updates
Whenever such incidents occur, the relevant authority issues a circular on the official pages of major social media platforms so everyone can understand what has happened.
Use third-party support services and tools if other users have reported the same problems. Stay calm, you're not the one.
Cascading Effect Creates Disturbance
If it's a cascading effect, it must disrupt other connected services across all regions, even if the app operates normally by default. Still, the DNS server or a related component encounters unpredictable issues.
Let's jump into the cloud outage recovery tactics to save from the upcoming disaster.
5 Ways to Recover from an Outage or Cloud Disaster
Outages are one of the most annoying things, just like headaches. After these mishaps, restoring the system's availability is a hectic task. Cloud service providers use the following tactics to recover from outages.
AWS Route 53, elastic load balancing, or other failover mechanisms can automatically redirect you to a safer space to manage the traffic and availability.
All cloud service providers offer disaster recovery strategies and plans. So follow their user manual or books to deal with such circumstances.
Manual failover mechanisms are good, but don't forget to utilize the automated failover mechanism provided by default troubleshooting, which boosts uptime through Recovery Time Objectives and Recovery Point Objectives.
Lastly, if all the core AWS services are affected by this outage, nothing we have mentioned above will fix the situation. Backstep and wait for the official AWS announcement confirming that everything is under control and restored.
Don’t forget to perform sanity checking after the cloud outage is over.
How to Ensure Application Availability During a Cloud Outage?

Outages are unpredictable; to reduce downtime and ensure reliability and resiliency, you must know your approach. Let's get to know how to make system outage proof so you don’t need to face it again, like the recent AWS worldwide outage.
Single Server Reliance isn’t Enough
Reduce reliance on a single region server to improve resiliency. Build the architecture to operate applications across multiple AWS regions.
Mapping Hybrid or Multicloud Service Providers
Whether you're a mid-sized, large, enterprise, or startup, to keep your application operating hassle-free even during outages, consider on-premises support and accessibility from other established cloud service providers.
If services are well mapped and connected across AWS and other providers, it will reduce the risk of a single failure or vendor lock-in.
Perform Automated Failovers for Quick Recovery
You can also tailor automated failovers beyond AWS, assuming both worst-case and best-case scenarios. Define them, align them with your specific needs, and periodically refine and test your tactics.
Practicing with RTO, RPO, and DR policies enabled us to survive these cloud disasters.
Offline Mode and Local Caching
Whether it's primary or secondary services, data is a critical asset that can't be neglected. To ensure the application remains available and active in all scenarios, consider adopting local caching or offline functionality so that some applications can continue to function even when offline.
Reattempt Unfulfilled Requests
During the recovery process, not all services are fully restored or active, and some may also fail. So, it would be best to retry the failed requests or use fallback mechanisms to get the services back up and running.
Revisit Service Agreements/ Contract
While accessing any service from AWS or any other cloud service provider. Go through the offerings and support clauses.
How will they provide support in the event of an outage, recovery, process interruption, or escalation?
How will they ensure all components are available?
It's a wrap now!
One thing to say here!
Disasters are common, but what matters is how we deal with it, making our system resilient for the future. Be innovative and proactive, refine the recovery plan to proactively deal with the outage, and stay ahead of the competitors.
Final Words:
Cloud Outages often irritate us, no matter how hard we try to make the architecture perfect. Whether it is for short term or long term outage that causes disruption, keeping the system still. To improve the uptime, having the right strategy and a proactive automated recovery plan is worth it.
At Eternalight Infotech, we build cloud-based solutions that automate and consult our clients on best-in-class recovery plans to fix cloud outages, keeping applications available 24 hours a day.
Top comments (0)