On March 10, 2021, our news feeds featured the fire that destroyed a data center at one of Europe’s major public cloud providers OVHCloud.
Right after OVHCloud’s official Twitter update, they recommended customers launch their disaster recovery plans.
Many organizations are still running their production workloads on premise or have several production workloads in the public cloud. But how many planned for a disaster and how many tested their disaster recovery plans to see if meets business needs? And how many ever checked to see that their disaster recovery plans work?
I cannot stress enough how important it is to plan your architecture in advance and make sure it supports your business goals.
When organizations first migrate workloads to the cloud, they tend to believe that everything in the cloud is fully redundant – and will always be available.
To debunk this myth, we need to understand how services are built.
Some services, like compute (Virtual machines, managed databases, etc.) are deployed to a specific availability zone (AKA “Data center”) by default. In case of a disaster in the specific availability zone, customers lose resources and perhaps even data.
To add resiliency to compute services, we need to deploy them in multiple locations / availability zones, behind a load-balancer, and make sure they remain available in any scenario (data center disaster, DDoS attack, high volume of concurrent customers, etc.).
Other services, such as storage (Object storage, block storage, file storage), are copied to multiple availability zones within the same geographic area (AKA “Region”), without any action required on the customer side.
Before designing a new service, either for an organization or for customers, it is important to evaluate exactly what customers expect from the service. These customers could be senior management, business owners, internal or external customers, etc.).
Active-Active disaster recovery
If the primary business goal is to build a complete mirror of IT infrastructure in the cloud and ensure it is available all the time, several issues need to be addressed:
- Compute – Deploy virtual machines (or containers) in different availability zones.
- Database – Deploy a cluster of managed database engines – each node in a different availability zone.
- Storage – Use managed storage services (object storage, block storage, file storage), instead of maintaining your own file servers.
- DNS – Use managed DNS services to reroute traffic between primary and secondary data centers.
Building a highly available service has its own pros and cons:
- Services remain available all the time.
- Minimal customer effect.
- The cost of maintaining active-active architecture might cost more than the risk of downtime.
- Data replication issues between sites needs to be addressed.
- Both sites must be maintained at the same software/patching level.
Active-Standby disaster recovery
If the goal is to have a secondary data center available in case of disaster, there are several options:
- Manual deployment and fail-over:
- Compute - Create images of virtual servers (or containers) and replicate the images to both sites. In a disaster, manually deploy the compute nodes from pre-configured images.
- Database – Deploy a cluster of database servers in fail-over mode (or use Read Replica)
- Storage – Use managed storage services’ built-in replication capabilities.
- Automatic deployment and fail-over:
- Compute – Use infrastructure as a code (HashiCorp Terraform, AWS CloudFormation, Azure Resource Manager, Google Cloud Deployment Manager, etc.) to automatically deploy entire production environments using declarative languages.
- Database – Deploy database servers using infrastructure as a code (deploy, configure, restore data)
- Storage - Use managed storage services’ built-in replication capabilities.
Building active-standby disaster recovery environments has pros and cons:
- Easily deploy multiple identical environments within minutes
- Low to zero cost for cloud resource consumption when not in use
- Learning curve for scripting languages
- Data replication issues between sites
- Both sites must be maintained at the same software/patching level
A disaster recovery plan is not bullet-proof.
To stay on the safe side, advanced planning is a must when building production environments. Customer expectations must be taken into consideration and the question of how long unexpected downtime can be tolerated.
Keep testing, at least once a year. If possible, conduct fail-over tests a few times a year.
If you have not tested your disaster recovery plan for a long time, do not expect it to work.
Your resources might have changed over time, employees might have changed or may not remember how to recover your compute environments, etc.
Top comments (1)
A very important topic, thanks for writing this.