I recently completed a project where I built both resilient and non-resilient cloud resources using Terraform to test a cloud resilience monitoring tool that's still under construction. The objective was to calculate the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for individual cloud resources, with the proof of concept being done on Azure.
As an AWS-native DevOps engineer, I had to get familiar with Azure fast. I took a 10-hour Udemy course and dove straight into provisioning infrastructure-as-code. It was my responsibility to set up all the resources, and that meant learning a completely new cloud platform while implementing disaster recovery solutions that actually work.
Understanding Disaster Recovery: RTO and RPO
The main differentiators between resilient and non-resilient resources came down to two things: disaster recovery configuration and the tier you choose (Basic, Standard, or Premium). What you deploy your resources with determines how much they can withstand during a disaster.
Recovery Time Objective (RTO) is how long it takes to get operations back to normal and restore systems after a disruption. It's the acceptable amount of time recovery must be achieved to avoid a significant business impact. For example, a system with an RTO of 5 hours means it must be back online within 5 hours after a failure.
Recovery Point Objective (RPO) defines the maximum amount of data loss tolerable, measured in time. It indicates how frequently data backups or replication occur. For example, an RPO of 15 minutes means data is backed up or replicated at least every 15 minutes. It shows how much data an organisation can afford to lose if systems go down.
RTO and RPO differ for every organisation, and the goal of disaster recovery is to ensure these objectives are met in case of any disaster. How do you ensure this? By provisioning resources with failure in mind. You can't build a perfect system, but you can build a perfect recovery system.
How long will it take to bounce back after downtime, and how much data would you lose? Those are things within your control.
What Makes Cloud Resources Resilient
One key thing that makes cloud resources resilient is redundancy, how many times it's replicated and where it's being replicated. You can have resources replicated in the same zone, but that doesn't make them resilient because when something happens in that zone, all the resources there are affected.
Azure has different redundancy options, each with different percentages of uptime represented from 99.9% (three 9s) to 99.999999999% (twelve 9s). When it comes to storage, it's critical to replicate your data across different data centers, zones, and regions.
Azure Storage Redundancy Options
Locally Redundant Storage (LRS): Data is replicated three times within a single data centre to protect against hardware failures, with a durability of 99.99%. It protects against local hardware failures but not data centre-level faults.
Zone-Redundant Storage (ZRS): Data is synchronously replicated across three separate Availability Zones within a region, enhancing resilience against zone or data centre failures. Your data remains accessible even if one zone fails.
Geo-Redundant Storage (GRS): Combines LRS in the primary region with asynchronous replication to a secondary region to guard against entire regional failures.
Geo-Zone-Redundant Storage (GZRS): Combines ZRS in the primary region with asynchronous replication to a secondary region for both zonal and regional fault tolerance.
The kind of storage you choose determines how resilient the whole system is. It's literally how your data is stored and protected.
Resilience Is Component-Specific
Each component has different ways to be resilient, so how do you compare them? You don't. Resilience is calculated individually as a result of the efficiency of each component.
The resources I deployed include:
- Virtual Machines
- SQL Database
- Azure Kubernetes Service (AKS)
- Azure Container Registry (ACR)
- Managed Disks
- Azure Data Lake Storage (ADLS) Ensuring they're all resilient is different for each and must be worked on individually. Each service has its own disaster recovery mechanism - SQL failover groups, ADLS native replication, and so on.
Infrastructure as Code with Terraform
I provisioned each component using Terraform and used modules to set up individual resources. I configured a remote backend using Azure Blob Storage and implemented state locking with Azure storage to prevent concurrent modifications.
Disaster Recovery Solutions by Resource
Virtual Machines
I configured Azure Site Recovery (ASR) for the resilient VM with replication to a paired region, set up a backup policy with scheduled snapshots in a Recovery Services Vault, used Premium SSDs, and deployed across multiple Availability Zones.
Azure Kubernetes Service (AKS)
The disaster recovery solution for AKS included multiple node pools spread across Availability Zones and a backup solution using Azure Backup. Interestingly, the backup vault couldn't be deleted even after running terraform destroy until after a week, a safety feature to prevent accidental data loss.
Managed Disks
I configured GRS storage with Premium SSD, automated snapshot policies to back up the disks, and cross-region replication for both the managed disk and snapshots.
Azure Container Registry (ACR)
For ACR, I set up geo-replication to a secondary region (from westus2 to centralus), ensuring container images and tags are automatically copied and synchronised.
Azure Data Lake Storage Gen2
ADLS was configured with GRS replication and cross-region failover readiness.
SQL Database
The SQL database was set up with zone redundancy, active geo-replication to a paired region, and failover groups for automatic coordinated failover.
Deep Dive into Key DR Solutions
Azure Site Recovery (ASR)
Azure Site Recovery is a Disaster Recovery as a Service solution provided by Azure. I configured it separately from the resources and passed it as a data source in the root module. I created a separate directory named asr-setup-vault with its own state file and provider block.
ASR setup
involves configuring the ASR fabric, ASR protection container, replication policy, and ASR protection container mapping. A shared vault was configured for both the managed disk and VM, and a backup policy was set up for the shared vault. The expected outcome is cross-region replicated VMs and managed disks.
Data Protection Backup Vault
I used a Data Protection backup vault to store snapshots and AKS backups. A data protection backup vault is a secure, centralised storage entity designed to store and manage backup data and recovery points over time. It acts as a container for backups, providing protection through encryption, data isolation, and access control mechanisms to ensure the integrity and availability of backup data even if production systems are compromised.
Multiple Node Pools Across Availability Zones
When multiple node pools in an AKS cluster are spread across Availability Zones, it means that each node pool's virtual machines are distributed across different isolated physical locations within the same Azure region. This configuration boosts the cluster's resilience and availability because even if one AZ experiences an outage, nodes in other AZs remain functional.
Geo-Replication for Container Registry
Geo-replication for ACR means that the container registry's contents, including container images and tags, are automatically copied and synchronized from the primary region to one or more secondary Azure regions. This ensures high availability and reduces latency for pulling images from different geographic locations.
SQL Database Failover Groups
Failover groups for Azure SQL Database enable automatic and coordinated failover of a group of databases from a primary server in one Azure region to a secondary server in another region. This ensures high availability and disaster recovery by replicating databases geo-redundantly and allowing seamless switching to the secondary region if the primary becomes unavailable due to an outage or disaster.
Decision Framework for Disaster Recovery
Here's a simple decision tree I used to determine the right DR strategy:
Is this data critical to business operations?
If YES: Can you afford to lose ANY data?
No data loss acceptable: Use active geo-replication (SQL failover groups, GZRS storage)
Some data loss is acceptable:
• Under 1 hour data loss: ASR for VMs, GZRS for storage
• 1-24 hours data loss: Daily backups, GRS storage
If NO (data not critical): Use the most cost-effective option, LRS storage, Basic SKUs, no ASR.
Key Takeaways
Building resilient infrastructure isn't about making everything highly available; it's about understanding your business requirements and making informed decisions about where to invest in redundancy and disaster recovery. Each Azure service has its own DR mechanisms, and you need to understand them individually to build a truly resilient system.
This project taught me that moving from AWS to Azure isn't just about learning new service names; it's about understanding fundamentally different approaches to disaster recovery and resilience. The skills are transferable, but the implementation details matter.
Got questions about building resilient infrastructure or want to discuss disaster recovery strategies? I'm always happy to discuss!
I'm a DevOps engineer and technical writer currently open to new opportunities. If you're hiring or want to connect, reach out on LinkedIn or drop me an email at [nkwochaijeawele@gmail.com].
Top comments (0)