My journey on AWS Region Migration: What I wished I had aware of

#aws #cloud

TL;DR

Always consult with Solution architect for end-to-end of use-cases before go-ahead, to save yourself of the troubles you have to improvise for the solutions.

Research on available solutions, deployment can go much smoother when you have plans for them.

Never go on migrating the production environment without foolproof plan that can ensure they will not fail.

Background

For a brief introduction, my company runs a production workload on AWS, recently AWS launched a new region which happens to be much closer to our customers than the current one. The decision was clear that we had to move towards the new region to benefit more from the Cloud environment we used on AWS.

Before, we had not adopted the DR across region and cross-region was not planned for architecture design. We deployed our main workload (containers) on EKS using Aurora/RDS as database, and integration via State Machines and Lambda Functions (via AWS API Gateway). And distribution via CloudFront and S3.

To wrap it up, the image is just a rough and simplified version of the real deployment.

The migration of EKS is smooth enough (with AWS Secrets Manager and External Secrets (in EKS) for variable injections).

The Problems

The requirements for some of the deployments though were not cleared before the migration took place. (IP allowed list for the third party locks the outbound traffic to the public outbound IP on the original region and scheduling the migration to be in the same window would be challenging at the very least)
- Solution: leave the workload that cannot be migrated in the original region, and schedule the migration to be in a later phase. This introduces the use of cross-region communication (via Transit Gateway Peering in this case) which drives the cost and latency up a bit but probably a small cost to pay to get the migration going.
A part of the workload that can't be migrated right away was placed on API Gateway (Private), the caller to the API (deployed on EKS) was not aware of the fact that it place the call without using VPC Endpoint and it was working well before the migration hiding the fact that it is not invoked via the intended design (via Route 53 and redirect the call via ELB). But before the end-to-end test on the migration region was conducted this was not too obvious.
- Solution: via the cross-region communication, call via private API in API-Gateway is resolvable (with VPC Interface endpoint and private DNS entry) but I wanted to mention here as it might get overlooked.
Data integration via S3, the data pipeline (not shown in the diagram) consumed from S3 and was marked not ready for the migration has a strong dependency on the S3 bucket. On the other end S3 also has data-dependency with Aurora/RDS as well (SQL Query statement does the upload to S3 bucket). A gentle reminder, S3 bucket name has to be unique across regions while services on VPC cannot reach S3/S3 Gateway endpoint on different regions.
- Solutions: There are various ways of achieving this depending on the requirements.
  - S3 Replication so that the data can be replicated for each region to serve the workload in the new region, while data-pipeline integration can be served in the origin region (needs Versioning and a carefully designed rule to avoid replication looping)
  - Multi Region Access Points for S3 instead of replicating the whole S3 bucket into destination region, allocating a MRAP could facilitate the access via globally accessible access point (needs to refactor the S3 caller to use the access point)
  - VPC Endpoint Interface for S3 service (+ Route 53 Private Hosted Zone + VPC Peering/Transit Gateway Peering) if Private networking was already a strict requirement for your Cloud Infrastructure you are probably at least half way there and small additional would not be so bad (depends on how much data needs to be transferred between the region)

PS The transit gateway usage here is due to the abstraction of the other VPC attachments (for the data pipeline VPC) and you can use VPC Peering instead if your network topology is simple enough and the VPCs do not have no overlapping IP CIDR and your Network team would not bite your head off for that.

Conclusion

Plan ahead and go through revisions to ensure that you got everything covered.

I have learned to apply those concepts and I appreciate the support I got from my team and the result is a more efficient Cloud platform and I am satisfied with it.

Thank you for reading and following through, I hope you learn something and I am more than happy to answer if you have any questions regarding this.