Cloud Outages: The Unpopular Truth No One Wants to Hear

#aws #azure #cloudflare #outage

When looking at public documentation in the last quarter of 2025, we can say it was not the best time for cloud providers in terms of their SLA agreement for availability to their customers, and the global impact of those cloud outages had on customers all over the world:

October 20, 2025 - AWS US-East-1 had an outage caused by a DynamoDB DNS race condition that disrupted multiple dependent services
October 29, 2025 - Azure Incident Retrospective: Azure Front Door/Portal
November 18, 2025 - Cloudflare’s Nov 18, 2025, outage stemmed from a faulty bot-management config
November 25, 2025 - Google Meet outage

From a customer perspective, it is a bad sign for how customers are dependent on public cloud services, and how a large portion of services are impacted when one of the hyperscale providers suffers from an outage (whether it’s a large volume DDoS attack or a misconfiguration due to a human mistake).

The question is – what can we, as customers, due to lower the risk of similar events in the future?

I’ve called this blog post – unpopular opinion, and I will explain why.

Single Cloud Provider – Single Point of Failure, but…

As architects, we always say – Don’t put all the eggs in the same bucket.

Relying on a single cloud provider equals a single point of failure, since once the CSP has an outage or misconfiguration, there is a good chance our systems will be down as well, and we will not be able to serve our customers.

A common way to look at this problem is to design a multi-cloud environment (such as decoupling everything based on Kubernetes), but we lose a lot of the cloud-native capabilities that come with each CSP's services (different capabilities offered by each of the CSPs).

On the other hand, migrating to a multi-cloud architecture increase the overall complexity of the entire solution (deployment, maintenance, observability, incident response, etc.), not to mention the need for highly skilled personnel (DevOps, SRE, architects, cybersecurity, FinOps, etc.) with experience working with different cloud providers, trying to have a single place for visibility and control.

DDoS mitigation

In the past, when organizations wanted to mitigate the risk of DDoS attacks on their perimeter, they either deployed a DDoS protection physical (or virtual) appliance on their DMZ, or used to purchase a DDoS protection service from their ISP, which obviously had much better control over their Internet bandwidth.

Today, most organizations will simply purchase a SaaS solution such as Cloudflare, Akamai, or any DDoS solution from their CSP of choice, which will handle large volumetric attacks much better than any organization or even an ISP.

Could your organization handle DDoS by itself or through your local ISP? Perhaps, up to a certain volume, but not many can actually handle extremely large DDoS attacks over time, so at the end of the day, most organizations choose the same DDoS protection solutions, through which a large amount of the Internet traffic passes.

Public Cloud? I can do it better by myself…

Very naïve opinion I’m reading on many websites, social networks, and forums – moving to the pubic cloud and relying on an external provider with my data and systems was a mistake; I can do a better job in-house.

Really? Your organization can build and maintain a data center with the same efficiency, resiliency, and high availability then the hyper-scale cloud providers?

The hyper-scale cloud providers' main business is to build large and highly resilient data centers all around the world. They put a lot of resources into efficiency, observability, expansion, cybersecurity, and personnel recruitment.

How many organizations (private or public sector) can honestly say they have similar resources and expertise to invest in something that does not bring them any business value?

If you are a decision maker and believe that the past outages and misconfigurations bring you to the conclusion that you can do a better job than the hyper-scale CSPs, good luck with that.

Where’s my compensation?

As a paying customer for a public cloud solution, you might be wondering – I’m paying for a service, the CSP promised me an SLA for service availability, and if I cannot provide service to my customers, the CSP should compensate me for the time my services were not available, and I lost money.

This is a logical belief; however, even if you choose to file a lawsuit against a hyper-scale CSP and ask for compensation, the CSP will show you in their contract a section saying specifically how this will never happen, and in the best case, the CSP will offer you credits to spend on its cloud services.

I’m not saying credits are not worth actual money, but they are far from being a compensation for revenue or reputation loss.

If you are a big cloud customer (in terms of monthly spend) or a valuable customer from the CSP point of view (doing something that is in the focus or interest of the CSP), you might be able to negotiate your terms with the CSP, and who knows, perhaps you will be able to get something out of the standard terms from your CSP.

Risk Acceptance

Systems break and outages occur, and unfortunately, this is not something new, but rather something organizations need to live with that fact.

I know that no CEO would like to come to their board of directors and explain about an outage and loss of revenue, and claim that it all happened because of the choice to move workloads to the public cloud, but let me ask you this: do you honestly believe that you could have done a better job in-house? I seriously doubt it.

Moving workloads to the public cloud, or building systems in the cloud in the first place, similar to many other things in life, has its risks.

The organization needs to understand what is his risk appetite is regarding system availability.

I personally believe that for most organizations, a short outage once (or perhaps twice) a year is within their risk appetite. Consider the alternative of having to build and maintain your own data center, hoping you could achieve even a close SLA compared to the hyperscale cloud providers.

Summary

Throughout this blog post, I have shared my personal opinion about the different ways organizations are looking at cloud outages.

I wanted to close it with something optimistic or a practical advice how to mitigate or at least lower the risk of cloud outage, but the heard reality is that there is no simple solution, and at the end of the day, each organization has to do his own risk management, and decide which risk it can live with, and which risks are beyond their risk appetite and look for other alternatives.

I still believe that the public cloud is the best place for most organizations' IT systems, considering the on-prem or legacy alternatives.

Don't get me wrong – the cloud providers should be accountable for their mistakes, they must raise the bar in terms of observability and change management processes to make sure misconfigurations will not happen, otherwise they breach their customers' trust, not to mention the huge impact they have on the global Internet.

You could claim that co-location, alternate clouds, neo-clouds, or even sovereign cloud is the right solution for your organization, but at the end of the day, everything breaks eventually, so there are no foolproof solutions, and be realistic – do you really expect all services to be up 100% of the time?

About the author

Eyal Estrin is a seasoned cloud and information security architect, AWS Community Builder, and author of Cloud Security Handbook and Security for Cloud Native Applications. With over 25 years of experience in the IT industry, he brings deep expertise to his work.

Connect with Eyal on social media: https://linktr.ee/eyalestrin.

The opinions expressed here are his own and do not reflect those of his employer.