DEV Community

Keisuke FURUYA for AWS Community Builders

Posted on • Originally published at builder.aws.com

What would you do if you were asked to expand the subnets where ECS and Aurora clusters are running?

I would like to share the process and lessons learned from a project involving subnet reconfiguration. While subnet modification is a task best avoided if possible, I hope this serves as a helpful guide for those facing similar challenges.

TL;DR

  • Due to challenges in the VPC network architecture, I migrated the subnets for the ALB, ECS cluster, and Aurora cluster.
  • All migrations were completed successfully with zero or minimal downtime.

The Challenge

In a system I was responsible for, we faced the following issues regarding network configuration:

  • IP address exhaustion: Due to service growth, we were running out of available IP addresses in our subnets.
  • Batch processing impact: Specifically, when a large number of batch tasks were executed simultaneously, the lack of available IPs became a critical bottleneck.

To resolve these issues, we decided to secure future scalability by expanding the subnet ranges.

The Ideal State

The system in question has a simple architecture: an API server and batch jobs running on ECS (Fargate) behind an ALB, with an Aurora (MySQL) database. Although we were already using public/private subnets, we reorganized them into the following structure based on AWS best practices:

  • Public Subnets: Housing the ALB and NAT Gateways.
  • Private Subnets: Housing the ECS clusters.
  • Isolated Subnets: Housing the Aurora clusters.

We secured these across three Availability Zones (AZs) using significantly wider CIDR ranges than before.

Subnet Before After

Step 1: ALB Migration

To create subnets with wider CIDR ranges, the existing subnets needed to be deleted. This required moving the ALB to a temporary "evacuation" subnet first. I considered two methods for the ALB subnet migration:

  1. Create a new ALB in the destination subnets and switch traffic using Route 53 weighted records.
    • Pros: Gradual traffic transition; fast rollback if issues occur.
    • Cons: If clients have DNS caching enabled, they might continue hitting the old ALB after the switch, leading to errors.
  2. Modify the existing ALB's subnet settings.
    • Pros/Cons: Essentially the inverse of the above.

I chose Method 2 (Changing ALB subnet settings) because I determined we could not fully control client-side DNS caching errors.

Lessons & Tips

Even when changing subnet settings, AWS performs a rolling update internally, allowing us to migrate the ALB without downtime. You can see exactly what is happening by monitoring the Elastic Network Interfaces (ENI). First, new ENIs are created for the ALB in the new subnets; once they reach a "Ready" state, the old ENIs are deleted. (I previously wrote a blog post about how understanding ENI behavior is the key to mastering VPC networking: "Mastering ENIs is mastering your VPC network.")

Note: Behavior may vary depending on your application characteristics, so always verify this in a staging environment first.

Step 2: Aurora Migration

After securing the wide isolated subnets, I migrated the Aurora cluster. Again, there are several ways to do this:

  1. Schedule a maintenance window, create a snapshot, restore it to a new cluster in the new subnets, and update the application connection strings.
  2. Create a Reader instance in the new subnet, scale down the cluster to one old Writer and one new Reader, and perform a failover.

Since Aurora failover typically completes in under 60 seconds, I chose Method 2 to minimize downtime.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.AuroraHighAvailability.html#Aurora.Managing.FaultTolerance

While often overlooked, a DB Subnet Group can include multiple subnets within the same AZ. For example, you could set up a configuration like this (assuming us-west-2d was not previously used):

  • us-west-2a: Old Subnet A
  • us-west-2b: Old Subnet B
  • us-west-2c: Old Subnet C
  • us-west-2d: New Subnet D

In this state, you can create a new Aurora instance and explicitly select us-west-2d. Once the new instance is ready, reduce the cluster to one old Writer and the one new Reader, then fail over.

Lessons & Tips

By performing the subnet switch via failover, we didn't need to change the Aurora cluster endpoints, allowing us to complete the change with a very short maintenance window.

Step 3: ECS Service & Task Migration

The API servers running on the ECS cluster also needed to move. Since we already use rolling updates for our deployments, we were able to transition to the new subnets without downtime simply by updating the service's network configuration.

Summary

I have shared the process and lessons learned while reorganizing subnets for future scalability. For the services discussed here—ALB, ECS, and Aurora—it is possible to switch subnets with zero or minimal downtime. I hope this article provides a helpful reference for your own infrastructure projects.

Top comments (0)