DEV Community

Swapnil Pandey
Swapnil Pandey

Posted on

AWS EDR step by step (on premise to AWS and back)

Recently I was tasked to do a POC and training of AWS Disaster recovery services AWS EDR (being an AWS partner) and while doing the POC and testing, realized that though there is good bit of documentation from AWS, however a few points were not clearly explained. I came up with the below procedure to make the process simpler for customer to understand and follow. Kindly consult AWS partner if you are planning to do any other setup, as there can be many architectural and budgetary benefits if you plan to do a deployment and go with AWS or AWS partners, instead of trying it. Feel free to contact me too, if any clarification assistance needed.

In our scenario we have picked an VM on VMware and have tested a failover from and failback to on premise and have captured steps and details on configurations. We have used steps which you might need in actual setups and also provided explanations where necessary.

First things first, assumptions and pre-requisites:

  1. Assumption is you know and understand and have setup the VPN site-2-site tunnels and have configured connectivity
  2. You understand security groups and routes and configured rules and routes between on premise and AWS VPC (in my case I had assistance from my internal security team)
  3. You must test connectivity between on premise and VPC by creating a test machine 4.Create a key, which you can use to login to AWS EC2, which can be used to login to the VM after failover (this is not clearly mentioned in the document)
  4. Create VPC and mark Staging and Target subnets (if you don't know what is staging an Target subnets we will discuss briefly and also the below image can be referred for more details)

First let's do some copy-paste :) I see many bloggers do that to make article look lengthy, however I am putting basic information and lesser texts and will go to actual steps after that.

What is AWS EDR (Elastic disaster recovery service)

AWS Elastic Disaster Recovery (AWS DRS) minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery.

You can increase IT resilience when you use AWS Elastic Disaster Recovery to replicate on-premises or cloud-based applications running on supported operating systems. Use the AWS Management Console to configure replication and launch settings, monitor data replication, and launch instances for drills or recovery.

Set up AWS Elastic Disaster Recovery on your source servers to initiate secure data replication. Your data is replicated to a staging area subnet in your AWS account, in the AWS Region you select. The staging area design reduces costs by using affordable storage and minimal compute resources to maintain ongoing replication.

You can perform non-disruptive tests to confirm that implementation is complete. During normal operation, maintain readiness by monitoring replication and periodically performing non-disruptive recovery and failback drills. AWS Elastic Disaster Recovery automatically converts your servers to boot and run natively on AWS when you launch instances for drills or recovery. If you need to recover applications, you can launch recovery instances on AWS within minutes, using the most up-to-date server state or a previous point in time. After your applications are running on AWS, you can choose to keep them there, or you can initiate data replication back to your primary site when the issue is resolved. You can fail back to your primary site whenever you’re ready.

How it works

AWS Elastic Disaster Recovery (AWS DRS) minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery.

Image description

Now without wasting anytime lets get to our setup

Architectural explanation

Below is a diagram of connectivity from if you are doing replication from on premise to AWS

Image description

Components as per diagram:

  1. On premise Datacentre On premise datacenter will have original source server (I said original, because after failover the AWS EC2 becomes source server)
  2. Site to Site VPN or DirectConnect connectivity to send data to AWS
  3. AWS region API's endpoints (which can be found here https://docs.aws.amazon.com/general/latest/gr/rande.html)
  4. AWS staging subnet, where replication server is created, and data disks will be kept along with snapshots

There are two options and architecture based on the options I have seen. Second option is not specifically discussed in any documentation and Second option is what I have demonstrated and explained here.

There is slight difference between Option 1 and Option 2 which I will explain, once we have seen both options:

Option 1

Below is the architecture diagram for Option 1. Here we have VPN connectivity, however the AWS regional endpoints are connected using AWS privatelink and endpoint reginal endpoints will be created and some of the endpoints will incur extra costs. There won't be internet access on source or replication server in this case. More details are on this below URL. Since the topic is covered here well and I have tested it to be working, hence didn't covered, however please let me know if you need demonstration/documentation for this option too.

https://aws.amazon.com/blogs/storage/cross-region-aws-elastic-disaster-recovery-agent-installation-in-a-secured-network/

Image description

Option 2

This scenario we have demonstrated here is this option. Below are points to be mentioned:

  1. We will enable Public IP on Replication server and keep replication server in Internet/public subnet
  2. Source servers should have access to internet and use port 443 to connect to DRS server endpoint in AWS over internet and Replication server will also access AWS regional endpoints using Internet Public IP, but the replication data transfer will happen over private network, using site-to site VPN. We can give/open ports only for the endpoints with support from internal security team
  3. We will have staging subnet in a public subnet and will use a public IP to connect AWS staging area Replication server to AWS endpoints over internet. (this is one of the tweak we have used). The data transfer will still happen over the site-to-site or DirectConnect, however the connectivity will happen over the internet to simplify things. This is explained in the steps below, so don't worry if this is not clear yet

More details on network configuration in this link for reference:

https://docs.aws.amazon.com/drs/latest/userguide/Network-Requirements.html

Below is the reference diagram for the architecture

Image description

Having understood and seen architecture let's get in configuration

AWS Service initialization, Replication settings

Lets start with initializing AWS EDR service in the region needed for replication (this is same region where we have created AWS staging and Private subnet)

Image description

Editing default Replication settings, select Staging area (this is the public subnet, where we are creating our Replication server and disks/snaphsot will be created. This traffic is just outwards and during service initialization the security group will be created which will allow only port 443 outwards and 1500 inwards (inwards rule will be edited to allow on premise subnet over private connection). Select "Replication server instance" or leave it default which is "t3.small" and can replicate 15 source servers at a time. Other options we can leave default

Image description

Second screen is where the tweak is coming into picture, here We will use "Always use AWS EDR security group option", we will edit the SG for inwards traffic for port 1500 from on premise. Second option to note is "use private IP for data replication" and create public IP.
Some secret tip: Again not documented anywhere ;)
If you don't have VPN site-2-site, you can enable option and select "create public ip", and disable the private IP option, this will enable us to replicate data over internet, if you don't have VPN site-2-site and firewall and want to use internet (EDR does encrypts data by default). Change number of days to set retention (generally RPO are 10 minutes, so number of days will keep 10 minutes snapshots for those many days)

Image description

Do any MAP tagging if, needed. MAP can accelerate migrations and deployment if needed with partners assistance.

Image description

A few more important settings to note. Instance right sizing, if you disable/inactivate, you will need to make one more change in an upcoming setting in launch templates (will highlight it once we are there), other options are self explanatory

Image description

And there we are "launch template". Here we will select our Target Subnet our security group. This can be private subnet with security group as per our needs. Here the instance type selection will make previous setting of disabling the "right sizing" complete

Image description

Advanced option not required but good to set a few, like "public IP" if instance/target subnet is public subnet and you want to access application over internet after failover, also select and apply key pair to be able to login to EC2 instance after failover, as the VM will be converted to AWS EC2 format from on premise VMDK format

Image description

Confirm the launch template is also default one

Image description

A view of security group created as part of the deployment

Image description

Above ICMP is added just to ping and last two are added to allow traffic between VPN for data transfer consistency

Image description

Create IAM user with appropriate role

Image description

Image description

Create access key

Go to user in IAM and select "Security Credentials" option

Image description

Select option, "Application running outside AWS"

Image description

Click save and download as the same Key we will use when installing the software at source VM

Image description

We will create a Failback user with same procedure, with Failback access policy in roles.

Install software

Windows Installation

To install the AWS Replication Agent on a Windows source server, you should ensure that your source meets all the requirements list in the Supported Operating Systems documentation.

https://docs.aws.amazon.com/drs/latest/userguide/Supported-Operating-Systems.html

Before installing the AWS Replication Agent, AWSReplicationWindowsInstaller.exe, it needs to be downloaded. Copy or distribute the downloaded agent installer to each Windows source server that you want to add to AWS Elastic Disaster Recovery.

The agent installer follows the following format:

https://aws-elastic-disaster-recovery-.s3..amazonaws.com/latest/windows/AwsReplicationWindowsInstaller.exe

Note Replace with the AWS Region into which you are replicating.

  1. Run the agent installer file AWSReplicationWindowsInstaller.exe as an Administrator.

The installer will confirm that the installation of the AWS Replication Agent has started.

The installation of the AWS Replication Agent has started.

  1. The installer will prompt you to enter your AWS Region Name, the AWS Access Key ID and the AWS Secret Access Key that you previously generated. Enter the complete AWS Region name (for example: eu-central-1), and the full AWS Access Key ID and AWS Secret Access Key.

The installation of the AWS Replication Agent has started.
AWS Region name: us-east-1
AWS Access Key ID: AKIAI0SF0DNN71EXAMPLE
AWS Secret Access Key: wJalrXUtnFEMI/K71MDENG/bPxRfiCYEXAMPLEKEY

  1. Once you have entered your credentials, the installer will verify that the source server has enough free disk space for Agent installation and identify volumes for replication. The installer will display the identified disks and prompt you to choose the disks you want to replicate.

...
AWS Secret Access Key: wJalrXUtnFEMI/K71MDENG/bPxRfiCYEXAMPLEKEY
Verifying that the source server has enough free disk space to install the AWS Replication Agent.
(a minimum of 2GB of free disk space is required)
Identifying volumes for replication.
Choose the disks you want to replication. Your disks are: c:
To replication some of the disks, type the path of the disks, separated with a comma (for example, C:,D:).
To replication all disks, press Enter:

To replicate some of the disks, type the path of the disks, separated by a comma, as illustrated in the installer (for example: C:, D:, etc). To replicate all of the disks, press Enter. The installer will identify the selected disks and print their size.

...
Identifying volumes for replication.
Choose the disks you want to replication. Your disks are: c:
To replication some of the disks, type the path of the disks, separated with a comma (for example, C:,D:).
To replication all disks, press Enter:
Disk to replciate identified: c:0 of size 30GiB
The installer will confirm that all of the disks were successfully identified.

...
Identifying volumes for replication.
Choose the disks you want to replication. Your disks are: c:
To replication some of the disks, type the path of the disks, separated with a comma (for example, C:,D:).
To replication all disks, press Enter:
Disk to replciate identified: c:0 of size 30GiB
All volumes for replication were successfully identified

  1. After all of the disks that will be replicated have been successfully identified, the installer will download and install the AWS Replication Agent on the source server.

...
All volumes for replication were successfully identified
Downloading the AWS Replication Agent onto the source server... Finished
Installing the AWS Replication Agent onto the source server... Finished

  1. Once the AWS Replication Agent is installed, the server will be added to the Elastic Disaster Recovery Console and will undergo the initial sync process. The installer will provide you with the source server's ID.

...
All volumes for replication were successfully identified
Downloading the AWS Replication Agent onto the source server... Finished
Installing the AWS Replication Agent onto the source server... Finished
Syncing the source server with the Elastic Disaster Recovery Console... Finished
The following is the source server ID: s-3146f90b19example
The AWS Replication Agent was successfully installed.
Press Enter to close...
You can review this process in real time on the Source servers page.

Image description

Linux installation steps:

A tip, for Linux, download and copy from your laptop to linux machine and run installer, so you can skip wget command as sometimes it fails.

Before installing, please ensure that you are aware of the following:

You need root privileges to run the Agent installer file on a Linux server. Alternatively, you can run the Agent Installer file with sudo permissions.

The Linux installer creates the "aws-replication" group and "aws-replication" user within that group. The Agent will run within the context of the newly created user. Agent installation will attempt to add the user to "sudoers". Installation will fail if the Agent is unable to add the newly created "aws-replication" user to "sudoers".

  1. Download the agent installer aws-replication-installer-init onto your Linux source server.

The Agent installer download location follows this format:

https://aws-elastic-disaster-recovery-.s3..amazonaws.com/latest/linux/aws-replication-installer-init

Depending on OS, we can pick one of the below commands. We need to have latest patches and packages installed on the source server, so any updates needed or dependency packages are installed beforehand.

wget -O ./aws-replication-installer-init https://aws-elastic-disaster-recovery-us-east-1.s3.us-east-1.amazonaws.com/latest/linux/aws-replication-installer-init

curl -o aws-replication-installer-init https://aws-elastic-disaster-recovery-us-east-1.s3.us-east-1.amazonaws.com/latest/linux/aws-replication-installer-init

The installer will prompt you to enter your AWS Region Name, the AWS Access Key ID and AWS Secret Access Key that you previously generated. Enter the complete AWS Region name (for example, eu-central-1), the full AWS Access Key ID and the full AWS Secret Access Key.

$ chmod +x aws-replication-installer-init; sudo ./aws-replication-installer-init
The installation of the AWS Replication Agent has started.
AWS Region name: us-east-1
AWS Access Key ID: AKIAI0SF0DNN71EXAMPLE
AWS Secret Access Key: wJalrXUtnFEMI/K71MDENG/bPxRfiCYEXAMPLEKEY

  1. Use the following command on your source server in order to run the installation script.

chmod +x aws-replication-installer-init; sudo ./aws-replication-installer-init

Failover from On-premise to AWS

Once replication is in sync, you will see the server status shows healthy and pending action as initiate drill

Image description

Failover from On-premise to AWS

Select the instance under “source server” to failover and select “initiate recovery job” and initiate recovery (for testing you need to do drill, but since we are doing a failover, we will select recovery directly. A conversion server is launched and deleted and the purpose is to convert the disks from VMDK/Vmware or source to AWS format and to install any packages needed for AWS EC2 post launch

Image description

Image description

We can connect to target/recovery instance which was failed over to AWS and confirm data and make changes

Image description

Image description

Failback from AWS to on premise

For details, follow the below links for more details (we will run though the setup meanwhile)

https://docs.aws.amazon.com/drs/latest/userguide/failback-performing-main.html

We need to download and attach the “failback iso” to original VM on premise, depending on region you select download relevant installer

https://aws-elastic-disaster-recovery-eu-north-1.s3.eu-north-1.amazonaws.com/latest/failback_livecd/aws-failback-livecd-64bit.iso

Upload ISO to datastore and attach to the source VM, original source server ;)

Image description

Change boot option to boot from ISO, for physical servers attach the ISO as USB etc.

Image description
Provide region and IP details (same as original server before failover)

Image description

Image description

Follow the prompts, and put in the details as needed. The Access Key and Secret key will be for a user with "Failback role associated"
The disks will be detected and instance will start replication

Image description

We can see reverse replication being triggered from AWS EDR console

Image description

Image description

Once replication have finished we will see message to complete Failback

Image description
The original (on premise) server will reboot and the data can be confirmed and the server is back as original

Image description

Do complete failback in AWS EDR console for us to conclude on the Failback

Image description

After failback reboot the VM on premise

Image description

This concludes failover and failback testing.

Hope this have given complete picture and also feel free to add and comment if any feedback, clarification and highlight any mistake or even any point which is missed.
Hope this have helped. Will also do some writeup on application migration if it is needed, and any other topics. Please mention in comments.

See details in the job as the process will follow automatically and instance will get launched in AWS target subnet with same data as on premise

We can connect to target/recovery instance which was failed over to AWS and confirm data and make changes

Failback from AWS to on premise
Follow the below links for more details:
https://docs.aws.amazon.com/drs/latest/userguide/failback-performing-main.html

We need to download and attach the “failback iso” to original VM on premise, depending on region you select download relevant installer

https://aws-elastic-disaster-recovery-eu-north-1.s3.eu-north-1.amazonaws.com/latest/failback_livecd/aws-failback-livecd-64bit.iso

Change boot option to boot from ISO, for physical servers attach the ISO as USB etc.

Provide region and IP details (same as original server before failover)

The key will be for “failback user” with below permissions

The disks will be detected and instance will start replication

Once replication have finished we will see message to complete Failback

The original (on premise) server will reboot and the data can be confirmed and the server is back as original

Top comments (0)