Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Disaster Recovery (DR) with AWS Elastic Disaster Recovery Service (COP356)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Disaster Recovery (DR) with AWS Elastic Disaster Recovery Service (COP356)

In this video, Yanzhu Ji, Senior Product Manager at AWS, presents AWS Elastic Disaster Recovery Service (AWS DRS). She explains how downtime costs Fortune 1000 companies $500K-$1M annually, with 76% experiencing outages in two years. The session covers RPO and RTO metrics, comparing backup, disaster recovery, and high availability strategies. AWS DRS uses agent-based replication to achieve seconds-level RPO and minutes-level RTO at moderate cost. The service continuously replicates block-level data to AWS staging areas, supports failover and failback processes, and offers point-in-time recovery for ransomware protection. Advanced features include post-launch actions and Systems Manager integration for automated validation. Pricing is $20/month per server plus staging and temporary failover costs.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Business Case for Disaster Recovery: Understanding Costs, Metrics, and Strategy Selection

Hi, everyone. My name is Yanzhu Ji, and I'm so grateful that you still decided to come here after a long day. I'm a Senior Product Manager from AWS Elastic Disaster Recovery Service. Today, I'm thrilled to talk to you about something every organization cares deeply about but often underestimates until it's too late: disaster recovery. I will show you how AWS makes disaster recovery simple, cost-effective, and resilient with AWS Elastic Disaster Recovery Service, or as we call it, AWS DRS.

This is today's agenda. I will start by talking about the importance of disaster recovery with real-life examples, then discuss how to choose the right disaster recovery solution based on your use case, and explain how AWS Elastic Disaster Recovery works. More importantly, we will go beyond the basics to talk about advanced features.

Let's look at some compelling business cases for resilience, starting with the real cost of downtime, and then we'll discuss why acting now makes more sense than ever before. Let's start with a concrete example. A global community platform lost 8.2 million dollars in revenue due to an outage. To be noted, this is just the pure revenue loss, not even counting the reputational damage and loss of customer trust. This is not a single instance. A leading U.S. airline lost 150 million dollars in profit due to an outage.

These are two examples, but we have more data showing that according to an IDC report, among Fortune 1000 companies, application downtime costs can be 500 thousand to 1 million dollars per year. Additionally, 76 percent of companies have had an outage in the past two years. So 76 percent—that's more than half. Now let's look at why acting now makes business sense. Customers have already realized that having a resilience strategy is so important, and they have already implemented it in their systems. For these customers, they have already observed a 12 to 20 percent increase in profit margins. Making a disaster recovery solution isn't just about avoiding losses, but also gaining competitive advantage because it protects your bottom line.

The question is not whether you should have disaster recovery, but whether you can afford not to. Before we dive into the disaster recovery solution, we need to establish two critical metrics that help you decide what strategy you need: RPO and RTO. The RPO, or Recovery Point Objective, answers the question: how much data can you afford to lose? Can you lose data from the past hour? Can you lose data from the past day? It's measured in time. RTO, or Recovery Time Objective, asks how quickly you want your service or application back to work and what the impact of every minute or every hour is to your business.

Given the numbers we saw earlier—500 thousand to 1 million dollars in downtime losses for Fortune 1000 companies—every minute counts. Now that we know what RPO and RTO are, we can use these metrics to choose what strategy you want to use. We will start with the fundamental resilience strategy: backup. This means you make a copy of your data and can restore it in case any loss or corruption happens. For backup, the RPO can be hours, and the RTO can also be hours, making it the most cost-effective approach among these three strategies.

We approach these three strategies, and the middle layer is what we'll focus on today: disaster recovery. Disaster recovery is about returning to operations, not just restoring your data. You can set up specific targets when a highly impacted application experiences a failure. The metrics improve significantly because the RPO is reduced to seconds, RTO can be minutes, and the cost is a moderate investment for you.

Finally, we have high availability. This is the premium tier and provides resistance to common failures through design and operational mechanisms. The RPO is near real-time and RTO is also near real-time. This represents a significant investment for your most critical applications that you want to protect.

AWS Elastic Disaster Recovery Service: Architecture, Implementation, and Advanced Automation Features

After knowing the strategies, let's look at what disaster recovery solution AWS offers. AWS DRS is an agent-based replication service. When we say agent-based, it enables high performance capability because it can distinguish snapshots based on the model and continuously replicate your data to give you data protection. That's why it can reduce the RPO to seconds.

The way you get started with AWS DRS is by installing a lightweight AWS replication server on your source server. Your source server can be an EC2 instance on AWS, on-premises servers, VMware, Hyper-V, or even other cloud-hosted instances. Once this agent is installed, it runs in the background and continuously replicates the block-level data of your source server to a lightweight staging area within AWS. You can specify which region you want it to replicate the data to. This way, when a disaster strikes, your data on AWS is only seconds behind your primary site.

After you finish the setup, AWS DRS allows you to run drills or testing to see if your DR is actually working. You don't want to wait until a disaster happens to test it. You can easily test on AWS. When a disaster event happens, such as an outage, cyberattack, or hardware failure, you can quickly launch a recovery instance in AWS. This recovery process is called a failover. It's a fully provisioned EC2 instance that maps from your source server on AWS and is linked to your replicated data. Your server applications or databases running on your primary site can now link to AWS and be ready to serve your traffic.

When your disaster is resolved on your primary site, that's when you want to return to that site from AWS. We call that process failback. You can fail back to your primary site anytime you're ready. Just don't forget to terminate the instance from AWS to save your cost and avoid future expenses. There are many benefits to using the AWS DRS service. Besides the RPO and RTO we discussed, it also has lower cost because the replication isn't one-to-one. We use a very low-cost staging area in AWS to replicate your data continuously. There's no third-party licensing fee or other fees because we manage all this process for you. It also avoids manual setup because it's highly automated.

When we convert your on-premises server to AWS, we also copy your configuration from your server. For example, it converts the disk, launches the correct instance type, and creates your recovery network. Additionally, it supports point-in-time snapshots, which means you can fail over to the most recent data set, or you can choose a specific time point you want to recover to. In the case of a ransomware attack, other points in your snapshot can be used as your recovery points.

It is very easy to test. You can run a drill with one click on the DRS console. It also supports network settings. If your source server is in AWS, you can easily copy your source network infrastructure, and you can include your security groups or capture any port or route changes. Data banking means that by configuring DRS with a service control policy, there is a specific account you can use to ensure that when a cyberattack happens or someone tries to actively delete your data, you have a very clean environment and the data is isolated from this attack.

Now, let's look at some architectural details on how DR works and how you can apply this to your application. There are four DR patterns. All these four DR patterns have the same destination, which is the AWS cloud. For the source server, it can be your on-premises servers, it can be from other clouds, or if you are already on AWS, you can still set up a DR. You can do the DR from one region to another, or from one availability zone to another availability zone.

We will dive into details for recovery from on-premises or from other clouds to AWS. This is the architecture. From the left side, you can see that you have servers from your own premises. You can install the AWS replication agent to each of these servers. Once you install this agent, it will automatically and continuously run and replicate your data to a lightweight staging area on AWS. One replication server can serve up to 15 servers, and we ensure that this replication is in parallel, which ensures high performance of the replication.

When a disaster happens, such as a cyberattack or hardware failure, a recovery area will be set up, and all your servers will be spun up with the mapping to AWS EC2 instances, and there will be EBS attached to it. The replication server does a few things. One thing is that it sends the replication status to AWS DRS so you can easily check the status of the current replication. DRS can automatically create and terminate this replication, so you have full control of this process with the console, and you can gather health updates and other status in one place.

When the recovery instance is already set up and your application is running successfully, and you solve your problem from your primary site, you can launch a failback, and then the agent control protocol will allow you to have this traffic coming back to your primary site.

When you have disaster recovery set up on AWS from region to region, the workflow is very similar, but the data replication can leverage native AWS services, including S3 cross-region replication. This makes data transfer easy and efficient. The data transfer is more performant, and it can efficiently build the AWS backbone with minimal latency. Because both the source server and the target server are on AWS, the whole workflow is exactly the same, just with more performance.

You might be curious at this point about how we charge for this service when it's fully managed and automatic for your disaster recovery use case. We have three tiers of charging. For the service itself, we have a flat rate charging of $20 per month per server. For the staging area, which continuously replicates your source server to AWS, we charge based on the minimum of EC2 and EBS usage depending on your data volume and the EC2 instance type that will be used. When there is a drill or disaster, we have a temporary failover cost, which is the EC2 instance that will be mapped to your source server. This charge is temporary, and once you recover from the disaster, the charge will be terminated.

Now that we know all the core functions and offerings from AWS Elastic Disaster Recovery, I will introduce an advanced feature that can be beneficial to you and simplify your workflow. It's called post-launch actions. The post-launch action is a framework that can automatically execute your predefined or customized actions by launching recovery instances. It can enable automatic validation, configuration, or testing tasks that you can define in a script. Previously, you might need to manually do a lot of work once the instance is launched, but now it can be fully managed by these predefined actions. We also offer Systems Manager integration. This feature leverages Systems Manager documents, and you can run commands and automate scripts on the recovery instance. Your administrative team can create actions and tasks like connectivity checks and permission validation. The framework reduces the complexity and human error in the entire disaster recovery process.

If you want to know more information or step-by-step instructions on how to set up AWS Elastic Disaster Recovery, there are some resources available. Feel free to scan the QR code. We have details about how to do cross-region disaster recovery, how to do cross-account disaster recovery, and also information for different source servers, like other cloud providers or on-premises environments. We have all of this available. Thank you for attending this session, and we would appreciate it if you could fill out the session survey in the mobile app. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community

AWS re:Invent 2025 - Disaster Recovery (DR) with AWS Elastic Disaster Recovery Service (COP356)

Overview

Main Part

The Business Case for Disaster Recovery: Understanding Costs, Metrics, and Strategy Selection

AWS Elastic Disaster Recovery Service: Architecture, Implementation, and Advanced Automation Features

Top comments (0)