Orel Bello for AWS Community Builders

Posted on Jul 23

From Bare Metal to Serverless: How to Evolve Your Disaster Recovery Strategy

#serverless #platformegineering #aws

Intro

Imagine this scenario:
You’re working in a successful and even profitable company, you’re using the latest cutting-edges technologies out there, you’re feeling good. Things can’t get any better than this.

But one day, you wake up in the morning to 10+ missed calls and dozens of messages that yell “Production is Down!!!”.

You found out that a disaster has occurred (your Data Center was set on fire, there was a regional electricity outage — you name it!), and your entire system is down.

You are probably thinking by now, “Those are on-prem problems, I’m using AWS — I have nothing to worry about!”

But what happens if a hacker encrypts your entire environment? Or maybe you used an LLM-generated code that accidentally deleted the Production Database?
Or even a lighter option — AWS’s entire region is down?

What can you do?

That’s where Disaster Recovery comes into play.

About Me
I’m Orel Bello, an AWS Community Builder and a passionate DevOps Engineer with over 4 years of experience, including the past 3 years at Melio. My tech journey began during my military service as a Deputy Commander in the Technological Control Center for the Israel Police. After earning a B.Sc. in Computer Science, I started as a Storage and Virtualization Engineer before discovering my true calling in DevOps. Now an AWS Certified Professional in both DevOps and Solutions Architecture, I specialize in building scalable, efficient, and cost-effective cloud solutions.

One thing you should know about Melio is that our entire architecture is fully serverless. We run a large-scale environment of Lambda functions, and naturally, Lambda has become our go-to solution for nearly every challenge we need to address.

What is Disaster Recovery?

Disaster recovery (or DR for short) is what it sounds like: recovering from a disaster. There are a lot of use cases that fall into this category.

The bottom line is that you need to define a workflow to get your application up and running when your main site is down.

Your Disaster Recovery Plan (DRP) shouldn’t be a separate initiative, it needs to be fully integrated with your architecture and application logic.

Before you start building your DRP, you need to decide on your desired recovery time objective (RTO) and recovery point objective (RPO).

RTO and RPO
RTO and RPO are the main components when designing a DRP, they will decide your DR strategy.

Let me explain:

RTO is the amount of downtime you have. How long will your system be down?
RPO is the amount of data loss that you are willing to endure. What can you afford to lose? Lower RTO and RPO = less data loss and less downtime, but also (probably) a more expensive DRP.

(Many aspects can affect our decision to choose our desired RPO and RTO, like KPIs, SLAs, or our commitment to our clients and partners).

For example:
DRP with RTO of 5 hours and RPO of 15 minutes means that we will have a data loss of up to 15 minutes (for example, by taking a scheduled snapshot every 15 minutes), and it will take up to 5 hours to get our system up and running again.

Rollback or cutover?

One more thing that you need to consider when designing a DRP is Rollback or Cutover.

Let’s say we’re designing a DRP for an entire regional outage on AWS, we’ve initiated a failover to our backup region, and what happens when our main region is back online?

Should we go back to our main region, or stay in the new one?

If we’re dealing with a hacker who encrypted our entire region, we may not have a main region to go back to.

So it’s really important to ask ourselves those questions before defining our DRP; the answer to those questions will determine our strategy.

How does DR work in an on-prem situation?

OK, now we know what a DR is a little better, but before we jump into DR on cloud, let’s get back to the basics.

How is the good, old-fashioned DR working on-prem?

On AWS, we can just put a new DR with a few clicks (yes, I know I’m exaggerating)

But on-prem, that’s a whole different story.

We have to plan ahead and run our entire workload accordingly.

What does this mean? Let’s start with a real-life example.

A while ago, when I served in the Israel Police, we needed a DR for the Israeli 911 emergency center, and the cloud wasn’t an option.

So we needed to build a new emergency center from scratch, in a different, physically isolated place, with all the required equipment (computers, phones, communications devices — you name it!) and it may seem like a different use case which has nothing to do with a Cloud DR, but the basic principles are all the same when you come to design a DRP for the cloud.

I wanted to understand how DR behaves on-prem, so I met with a Director of Storage Architecture to shed some light on it.

When designing a DRP on-prem, we have two main methods:

The first one is to build a DR site within 300 meters of the main one, using FC cables.
The second one is compatible with a 10 km radius, with a single cable running through the sites. (Some solutions support extended distances beyond 10 km, but with higher complexity and various downsides, it’s out of scope for this blog post.)

When talking about DR on-prem, we also need to choose:
Do we want a failover DR, which will be activated only when a disaster occurs?

Or do we want to utilize our DRP to be fast and resilient, and run on an active-active architecture?

Active-active means that we have both our main site and backup site working at the same time!

When using active-active, we want each site to be able to withstand all the traffic routed from the failed site when a disaster occurs. So every site must run only 50% of its capacity at each given time. When needed, it will be able to run 100% with double the traffic (failed site + back site, all at once!), and it’s a huge waste of underutilized resources!

We have some serious trade-offs here, but on the cloud is it really better?

Different strategies and approaches for DR

So, when talking about an AWS DRP, we have a few strategies:

Backup and restore, pilot light, warm standby, and multi-site, from the lowest cost and the poorest RTO, to the most expensive with the minimal RTO.

Let’s break them down:

Backup and restore:
This is the most basic one and pretty straightforward:

We take snapshots from our RDS every X time (And from our EC2 or Container images — depending on what computing services we are using) and save them on our backup region.

Pros: It’s the simplest and cheapest one.

Cons: When we face a disaster, we will need to deploy all of our services from scratch and restore our RDS from a snapshot, which will result in a longer downtime.

Pilot light:
Similar to backup and restore, but here we keep our core functionality up and running on our backup region, so when we need to initialize a failover to the backup region, it will be faster.

Of course, as we said before, we get better RTO and the price goes up as well.

Warm standby:
Here we don’t only have our core functionality up and running on our backup region, but our entire scaled-down system is running on our backup region.

So when a disaster occurs, we just need to scale up our backup environment instead of deploying it from scratch!

Multi-site:
Here we have an active-active architecture, we have both our main region and the backup region running our fully scaled-up workload!
This method requires a different approach, and is harder to maintain; think about it, now you have twice the production to give you a headache!

But you get the ultimate RTO and RPO! You’re always live, and your users won’t be able to tell the difference if your main site is down — you will just need to have double the budget.

How is a serverless DR different from traditional DR?

So we learned about DR on-prem and on AWS, but what about serverless?

Here’s where things get interesting.

The Trick: Pre-Deployment Without Paying for Idle
We can get the benefits of an active-active approach (like minimum RTO), but here’s the trick — we won’t pay for most of our backup resources as long as we don’t use them!

Keeping Environments in Sync with Stack Sets
So, we can deploy our services ahead of time, making sure they will be ready to serve traffic immediately when needed, but we won’t need to pay for the time they’re IDLE!

Now we can keep both of our environments up to date by deploying to both of them regularly at the same time with CloudFormation Stack Sets, which allows us to deploy a CloudFormation stack to multiple regions and even multiple accounts!

Now all of our serverless components are deployed and ready for action, but we have many more resources to take care of, depending on how robust we want our solution to be.

Let’s tackle them one by one, starting with the most important one — our database (DB)!

Let’s Talk About Databases
Without DB, we practically don’t have anything, so it’s one of (if not the most) crucial aspects to pay attention to when designing a DR.

Like we’ve seen before, we have many different approaches we can take, depending on the trade-off we want between the RTO and the cost.

High Availability DB Options
We can start with cross-region snapshots, to cross-region read replicas (and promote them to master when a disaster occurs), and finally, Aurora Global DB (Or DynamoDB if you are using a NoSQL DB).

Aurora Global Database ensures rapid recovery (< 1 minute RTO = downtime) with minimal data loss (RPO of 1 second), enabling robust business continuity.

But despite all that — There is a potential data loss of 1 second of writing because of the Asynchrony replication (if the DB itself is ok, the data will be available on the original cluster as soon as it recovers — like when the entire region is down), so it’s important to pay attention to this.

Aurora Global DB is sure great! But even if you’re using it (Or any other DB), it’s still extremely important to set up an immutable backup!

Immutable Backups: Your Last Line of Defense
You can do so by using AWS Backup, which supports it natively.
The purpose is to have a backup that no one can change or delete! So if, for example, a hacker got into your system, and encrypted/deleted your entire data, you will still have your immutable data to recover from! (It’s recommended to save this copy on another region or even another account!).

Ok, back to our Serverless DRP.

How Do We Know the Region Is Down?
How does our system even know that our main environment is down? We can’t use any regional service (like an ELB) for this purpose since they will be down as well if our entire region is on outage; so we have to use a global service — Route53.

We can set a failover routing policy with health checks and enable automatic failover to our backup region whenever our main environment becomes unavailable to ensure a seamless failover mechanism (and even trigger a Cloudwatch alarm to trigger different actions we need to take when our main site is down).

Other Key Services: S3 and CloudFront
Ok, but what about other services? Like S3 bucket or Cloudfront?

For S3 bucket, it’s pretty simple — we can set up a cross-region replication to our backup region, and all new files will be replicated to the new bucket!

In Cloudfront distribution, we can set a failover origin, and whenever our main region becomes unavailable, we will automatically route requests to the other distribution.

But serverless is not just about saving money, it’s about high availability too!

Serverless = Built-in High Availability
When we are using traditional computing services, like an EC2, we are bound to a specific AZ, and even if just an AZ will experience an outage — and not the entire region — our system will still be down.

When using serverless, it’s no longer a concern!
Since we’re using Lambda functions in addition to managed services (Like API Gateway in integration with SQS and SNS, which it’s a common practice to use them in a serverless architecture), we get the Multi-AZ feature natively!

Conclusion

You’ve seen what a DR is and how it’s being implemented on-prem, different approaches to DR on Cloud, and finally — how DR is taken to a whole other level when dealing with serverless!

DRs are now easier than ever to implement, working automatically, and also cheaper!

DR is one of the most important aspects of your workflow, it’s like insurance — you can not have one for years and nothing will happen, but as soon as something bad happens, you don’t want to be caught without one.

So whether you choose backup and restore, pilot light, warm standby, or a multi-site, it doesn’t matter — as long as you make sure to implement one!

In a best case scenario, a DR will add to your monthly billing, and you won’t see it giving value most of the time. In a worst case scenario it can save your company’s time, money, and reputation when a disaster occurs.

Orel Bello
DevOps Platform Engineer @ Melio | AWS Community Builder
Passionate about scaling DevOps with simplicity and impact.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.