Denis Cooper

Posted on Nov 19 • Originally published at deniscooper.co.uk on Nov 7

Is the Cloud Really Fault Proof?

#dr #azure #guide #ha

What is the Cloud, Really?
Designing for Cloud Reliability

Recent events have seen large-scale disruptions across two major cloud platforms – first AWS, then Azure. The high profile outages have sparked fresh questions about cloud reliability and design resilience. I’ve seen plenty of posts and comments on forums and LinkedIn suggesting that “the cloud is doomed” and that we should all go back to running everything on-premises.

It seems there are still some cloud skeptics out there – and rightly so. The recent outages are a good reminder that simply running your systems in “the cloud” doesn’t make them immune to failure.

In this article, I want to explore what we can (and should) do to design resilient systems — and bust a few myths along the way. Using a cloud provider doesn’t automatically mean your systems are protected, backed up, or guaranteed 100% uptime. Reliability is something we design for, not something that comes out of the box.

What is the Cloud, Really?

Let’s start by being clear about what “the cloud” actually is. It’s a collection of interconnected data centres offering platforms and services you can deploy to or consume. Whether you’re using IaaS, PaaS, or a mix of both, you still have responsibility for how your workloads are configured, deployed, and maintained.

Yes, cloud providers offer automation and protection for certain failure scenarios, but if you want to guarantee uptime, you must design for it.

I’ve heard people argue that relying on a single cloud provider is “putting all your eggs in one basket.” When an AWS or Azure region experiences issues, multiple businesses are impacted at once — and the scale of those events makes headlines.

But we should remember: the same could be said for on-premises solutions. Consider the impact if VMware pushed an update that broke virtual environments globally, or if a major telecom provider like BT, AT&T, or Verizon suffered a nationwide outage. Those events would take down thousands of businesses too.

The real takeaway isn’t where we host workloads — cloud or on-premises — but how we design them to handle real-world disruptions. Resilience comes from engineering systems that anticipate and mitigate failure, regardless of the platform.

Designing for Cloud Reliability

There are far too many technologies and architectures to cover in one post, so instead of listing specific tools, let’s focus on the core design principles from the Microsoft Well-Architected Framework and how they influence reliability.

The Five Pillars of the Well-Architected Framework

Reliability – Ensure your applications recover from failures and continue to function.
Security – Protect applications and data from threats.
Cost Optimisation – Deliver business value by managing costs effectively.
Operational Excellence – Keep systems running smoothly through automation and continuous improvement.
Performance Efficiency – Ensure your solution scales to meet demand efficiently.

True cloud reliability comes from understanding your shared responsibility model and architecting for redundancy, not from assuming the platform will never fail.

Let’s focus on the first pillar – Reliability – and explore how to apply it across both cloud and on-premises environments.

The Reliability Pillar – Building Resilient Systems

Goal : Ensure a system can recover from failures, continue operating correctly, and meet availability commitments.

Mindset : Reliability isn’t about never failing — it’s about failing gracefully and recovering predictably.

1. Design for Failure and Graceful Degradation

Cloud

Assume everything can and will fail – design stateless services, redundant regions, and fault-tolerant architectures.
Use managed services with built-in SLAs (e.g., Azure SQL HA, AWS RDS Multi-AZ).
Implement retry logic with exponential backoff and circuit breaker patterns.

On-Premises

Use redundant hardware (power, NICs, storage paths).
Implement clustering and heartbeat monitoring (e.g., Windows Failover Cluster, VMware HA).
Regularly test failover procedures.

Key principle: Failure is expected – resilience is engineered.

2. Redundancy and High Availability

Cloud

Deploy across Availability Zones and paired regions.
Replicate data asynchronously (e.g., Azure GRS Storage, AWS S3 Cross-Region Replication).
Use load balancers and global routing (e.g., Front Door, Traffic Manager) for failover. Remember, global services can and do fail too, so don’t assume global means resilient. Combine multiple options where necessary.

On-Premises

Design redundant power, network, and cluster paths.
Use stretched clusters or DR sites with replication (e.g., SQL Always On, Veeam).

Key principle: No single point of failure.

3. Monitoring, Telemetry, and Health Modelling

_ Cloud _

Use telemetry and health probes (Azure Monitor, Application Insights, CloudWatch).
Automate recovery actions and alerting.
Detect degradation early with availability tests and service health alerts.

On-Premises

Centralise monitoring (SCOM, Prometheus, Zabbix, or Nagios).
Correlate logs in a SIEM (Sentinel, Splunk).
Measure end-to-end service health, not just uptime.

Key principle: You can’t fix what you can’t see.

4. Backup, Recovery, and Disaster Recovery (DR)

Cloud

Define RPO and RTO per workload.
Use Azure Backup, Site Recovery, or multi-region replication.
Automate DR testing and validate recovery playbooks.

On-Premises

Use snapshot-based backups and offsite replication.
Test restores regularly — not just backups.
Use immutable or offline storage to defend against ransomware.

Key principle: Backup is not recovery until tested.

5. Capacity and Scalability Planning

Cloud

• Use autoscaling (VMSS, AKS, App Service).

• Design for scale-out, not scale-up.

• Use queue-based load levelling to handle burst traffic.

On-Premises

• Forecast capacity and monitor utilisation trends.

• Use HCI or modular infrastructure for flexibility.

• Consider hybrid cloud bursting for peak loads.

Key principle: Reliability fails when capacity is exhausted.

6. Change Management and Chaos Testing

Cloud

Use IaC and CI/CD pipelines for predictable environments.
Deploy updates gradually with Blue/Green or Canary models.
Test resilience with chaos engineering (Azure Chaos Studio, AWS FIS).

On-Premises

Manage configuration with version control (Ansible, DSC, Puppet).
Validate updates in staging before production rollout.
Maintain rollback plans for firmware and software.

Key principle: Reliability is operational discipline, not luck.

7. Dependency Management

Cloud

Map dependencies with Application Insights or Service Map.
Use queues and event-driven design to decouple services.
Prefer managed dependencies (databases, DNS, storage).

On-Premises

Segment workloads with VLANs or SDN.
Document dependencies in your CMDB.
Use APIs and message buses for internal decoupling.

Key principle: Loosely coupled systems fail independently, not catastrophically.

Summary

Principle	Cloud Focus	On-Prem Focus	Key Concept
Design for Failure	Fault-tolerant microservices	Clustered services	Fail gracefully
Redundancy	Multi-zone, multi-region	Hardware & site redundancy	Eliminate SPOFs
Monitoring	Azure Monitor, Log Analytics	SCOM, SNMP, SIEM	Detect & respond early
Backup & DR	Geo-redundant, automated	Offsite & tested	Recover predictably
Scalability	Autoscale, scale-out	Capacity planning	Avoid resource exhaustion
Change Control	IaC, pipelines, chaos testing	Config management, rollback	Controlled evolution
Dependency Mgmt	Queues, retries, isolation	Segmentation, decoupling	Contain failure domains

Final Thoughts

Cloud isn’t fault-proof — and it never will be. But neither is on-premises. Outages are inevitable, regardless of where systems live. What truly matters is how we design for those failures.

If you design with reliability in mind — by building for redundancy, automating recovery, monitoring intelligently, and testing relentlessly — you can deliver systems that stay resilient in the face of almost anything. And remember, the key takeaway is that not every system needs full reliability — focus your investment and resilience design on mission-critical systems.

Remember, cloud reliability isn’t about perfection — it’s about anticipating failure, mitigating impact, and keeping mission-critical systems running no matter where they live.

Because reliability isn’t a checkbox you tick once; it’s a discipline you live by.