Sebastiano Caccaro

Posted on Jan 29

Bring your VPC to 2026, with no downtime: how we added HA, IPv6 and more to our legacy VPC, using an IaC-first approach

#aws #vpc #devops #iac

Your First VPC

What is the first thing you do when creating a new AWS account? You might start tinkering with Lambdas, DynamoDB, API Gateway and other serverless services. But when it comes time to get serious work donπe, you will eventually need to create a VPC.

A VPC (Virtual Private Cloud) is a logically isolated virtual network in the cloud where you can run resources (like servers and databases) with control over IP addresses, subnets, routing, and security

You do your due diligence, you know about CIDRs, and you are no stranger to traditional network design principles. So naturally, you head to the VPC console and put together a setup.

Fast forward a few years: your account is serving traffic, and your services are being used by real paying customers.
You now have multiple environments across different accounts and you are starting to care more about High Availability, Scalability, and Security: your network setup is starting to hold you back.

So... What do you do? How can you evolve your VPC reliably, without causing disruption to the user (and to your developers)?

This was us

Let's take a hop into our time machine and get back to 2011.

There was no such thing as Docker and Lambda functions, and the whole pletora of cloud technologies and services we came to get used to was just beginning to get some traction.
The VPC service launched in 2009 but still was not the polished and managed experience we know today, with resources such as managed NAT Gateways becoming available as late as December 2015.
Most notably, the widespread expertise and standardized architectural references that are commonplace today were absent at the time.
It is this context that THRON was migrating part of its infrastructure to the public AWS Cloud. THRON's first VPC was created in late 2011 and it is designed to maintain connection to the existing on premise-servers.

Fast forward to 2025, THRON fully ditched the on-prem infrastructure and it is fully hosted in AWS.
THRON's core workloads all run in a single account (which will be referenced as MAIN), whose VPC was the subject of our makeover. However, a couple of other accounts exist in the same organization and host secondary workloads. Every account is replicated for the development, quality and production environments.

The following diagram is a high level view of the starting VPC setup.

Let's walk through the main pain points of our network infrastructure:

Subnet Fragmentation

The MAIN VPC spanned a /19 block and was segmented into 12 /24 subnets (6 private and 6 public). Half of the subnets were marked as static and others as dynamic. Originally, this allowed dynamic subnets to reach on-premises servers. Eventually, the fragmentation simply led to some subnets being over- or under-utilized.

Why was this a problem?

We were starting to experience a shortage of private IP addresses on private subnets. A /24 subnet has 256 available IP addresses. This was sufficient for EC2-based setups, where a single IP can be shared by multiple services.
With the increasing adoption of serverless options, this was no longer the case:

A single lambda function takes up 3 IP addresses (1 address per subnet)
Using Fargate, every single task in a service takes its own IP address

Single-AZ NAT Gateway

Our private subnets all routed through a single NAT Gateway in AZ A. This worked fine in practice: most services in private subnets didn't even need external internet access, relying instead on S3 and DynamoDB Gateway endpoints. In the unlikely event that AZ A went down, these services would lose their route to the public internet but would still function normally for the most part. Still, this was a limitation we wanted to remove.

The original network design was pragmatic and effective for its time and had proven reliable through years of production traffic. THRON's architecture had been carefully crafted to work within these constraints:

Critical workloads were strategically distributed to ensure resilience
Services in private subnets were architected with minimal external dependencies, using VPC endpoints for AWS services
Some services run on an secured ECS cluster deployed in public subnets to work around the single NAT limitation.

However, as our infrastructure matured and our adoption of serverless technologies grew, time had come to move to a less constrained solution.

VPC Features shopping list

After analyzing our setup issues, we established what we wanted. First, we needed to solve the existing problems:

High Availability should be enabled on all private subnets, without exception
CIDR Allocation:
- Just 6 subnets (3 private, 3 public, one per AZ per type) to eliminate ambiguity when creating new resources
- Wider private subnets to solve our IP shortage problem
Bonus: Add IPv6 support to further future-proof our infrastructure.

We also had other crucial requirements:

We could not afford any downtime whatsoever
We wanted to manage everything with AWS CDK, our IaC tool of choice
We wanted to minimize disruption to our development teams

Easy, let's start from scratch

It turns out, it is not as easy as it may seem. One might try to create another VPC with all new requirements, peer it to the existing one, and slowly migrate components one by one.
While the plan might be feasible on paper, a few caveats make it extremely tricky:

Some services, most notably ALBs, work well across multiple subnets but cannot work across different VPCs, making the migration much more complex.
Every piece of infrastructure would need to be redeployed, and parallel versions of the same infrastructure would need to exist and be maintained for a period. This would require massive effort from both us and our development team.

All of this left us with a single option: Evolve the existing VPC

Defining the BluePrints

Old VPC, new CIDR

One thing not everyone knows is that you can add up to 4 secondary CIDR blocks to an existing VPC. This alone made our plan to evolve our existing VPC possible.

The existing VPC CIDR was 10.74.128.0/19. After mapping out our organization, peering requirements, and services, we decided to add the 10.74.160.0/19 and 10.74.192.0/18 CIDR blocks.
This way, the VPC could be seen as a single 10.74.128.0/17 block. This is particularly important because peerings with MongoDB require a single CIDR block to target. Plus, it makes any modifications much cleaner.

Next, the subnets. Unfortunately, you cannot resize or move subnets. At the same time, we could not afford to recreate and move lots resources, as this would have been quite time-consuming.
To minimize the need to migrate resources, we decided to keep 3 of the existing 6 /24 public subnets. For our purposes, 256 addresses in each public subnet was sufficient.
In contrast, all existing private subnets had to be replaced. New private subnets were allocated in the newly added CIDR blocks, each being a /19 and hosting up to 8,187 IP addresses. This is equivalent to the size of the original VPC, but for each individual subnet.

The picture below is the planned CIDR allocation for the expanded VPC.

Still, this allocation was not perfect. Wasted space sat between the public subnets and was hardly usable in the future due to its small size. If we needed to allocate additional subnets (for instance, isolated ones), we could add the 10.74.0.0/17 CIDR block (the block before the VPC) to make the entire VPC a /16 block.

With this, our IP shortage problem was solved.

IPv6 and NAT

Adding HA for NAT was quite straightforward. We added a NAT gateway in each subnet and then edited the route tables accordingly.

IPv6 support was far more interesting.
You can add an IPv6 CIDR to your existing VPC: AWS does not let you choose it, but a random /56 block from the AWS pool gets assigned to you.
The common practice is to use /64 subnets. A single /56 block contains 256 different /64 subnets. Each /64 subnet contains approximately 1.8 × 10^18 IP addresses, which is enormous (keep in mind IPv6 uses 128-bit addresses). We assigned a single /64 block to each of our subnets.

Moreover, with IPv6, there is no real distinction between private and public addresses. Every address was unique in the global address space.
This brings the need for a new component, the Egress Only Internet Gateway, to keep IPv6 addresses in private subnets, well... private.

Egress Only Internet Gateway (EIGW): An AWS VPC component that allows outbound-only IPv6 traffic to the internet while blocking inbound connections.

While IPv6 is not the focus of this article, it's worth noting that:

Unlike public IPv4 addresses, public IPv6 addresses are free in AWS
Unlike with a NAT gateway, you do not pay for egress traffic through an EIGW
Not all AWS services supported IPv6 at the time. A hybrid approach (such as the one we implemented) was probably the wisest compromise.

In order to actually use IPv6, you need to enabled when creating new resources. Lambdas, EC2, ECS all have flags to enable it, and most other services have some dualstack endpoint option available.
If you are using modern OSs in your workload, they will automatically use IPv6 over IPv4 whenever a dualstack domain is presented.

Detailed view of hybrid IPv4 IPV6 Routing

Bringing everything together

Now that we had an end goal, it was time to put a plan into action. We proceeded as follows:

Import the current VPC with CDK
Evolve the current VPC
Updating Network resources
Migrating resources from old subnets
Delete the old subnets

With our blueprint defined, we moved to execution.

Importing the VPC

Importing resources in CDK/CloudFormation means writing a template that matches the current infrastructure and then manually looking up and matching the ARN/logical ID of every resource in the stack via the CLI or console.
There was little that could go wrong during this process, but we made sure to do our homework beforehand:

We mapped out our network infrastructure thoroughly. Using the CLI to view resources gave us a close mapping to the CloudFormation attributes. It was crucial that our template exactly matched the underlying infrastructure: CloudFormation drift detection and correction is not on par with Terraform or other IaC tools, and we could have had problems during later stacks updates.
We checked that all VPC-related resources could be imported in CloudFormation. Here you can find the list of all supported resources for import.

CloudFormation does not work very smoothly when importing resources due to a combination of a poor interface and lack of a decent CLI. The process was quite tedious. Here are some lessons we learned:

Not all resources' logical IDs were available in the console. "Association" type resources especially required some AWS CLI investigation.
We proceeded by importing small chunks at a time. Imports can fail for trivial reasons, and this means having to re-insert all the IDs manually.

What should you expect when importing resources

Evolve the VPC

Evolving the VPC by adding new features was probably the most delicate part of this whole operation, as one small mistake could bring down the entire company infrastructure. One might think it would be enough to create a new version of the template and let CDK do its thing. In 99% of cases, this would be the right approach.

CloudFormation's standard behavior can be summed up as "Create First, Update and Destroy Later." This behavior was not always consistent across services and needed to be tested in development environments for critical cases like ours.
We concluded that some resources needed to have their name changed multiple times in order to be updated without any hiccups.

For some others, most crucially AWS::EC2::SubnetRouteTableAssociation, this was not enough. In this case, trying to update it caused CloudFormation to leave some subnets without a routing table for more than 15 seconds. The same operation, when performed via the AWS Console, was instantaneous (of course, we tested).
To avoid such downtime, we devised the following strategy:

Synthesize the new template via cdk synth
Take the current stack template and add some resources from the new template, most notably, the new RouteTables. Deploy the old template with the new mixins.
Remove the SubnetRouteTableAssociation resource from the stack without deleting it. This can be achieved by setting its DeletePolicy to Retain and then removing it from the stack. Then manually update the SubnetRouteTableAssociation from the AWS Console.
Import the new SubnetRouteTableAssociation in the template
Deploy the original synthesized template with cdk deploy
Manual Update Procedure
The whole procedure was devised and tested repeatedly in development environments. When it came time to apply it to production, having a written checklist with all commits neatly tagged ensured the VPC evolution went surprisingly smoothly.

Updating Network Resources

Updating the MAIN VPC alone was not enough: we needed to ensure connections to and from the VPC were not blocked when resources started to be deployed on the new CIDR blocks.

The first step was to update any security group rules referencing the old VPC CIDR. With AWS Config enabled, we could find them all with a single advanced query. Otherwise, we would have needed to resort to some quick scripting. This was also a good time to stop hardcoding the CIDR directly and use a prefix list instead.

At the same time, all peered VPCs had their route tables updated to route traffic to the new expanded CIDR. Likewise, any security group rules referencing the old VPC had to be updated. This was also true for the managed MongoDB Atlas clusters.

We confirmed everything was working with some connectivity tests by spinning up some EC2 instances in the subnets.

Migrating resources from old subnets

At this point, we had 15 subnets, 9 of which had to go.

We assessed all resources to move by filtering all Elastic Network Interfaces (ENIs) in the soon-to-be-deleted subnets and found approximately 260 of them.

The effort required to migrate each varies greatly by project type and resource. As a rule of thumb, stateless and serverless resources require little work, while less managed and storage-related resources take more time and planning.

Lambda, Fargate And ECS Cluster

These were by far the easiest resources to move. Luckily, they were also the majority of our resources.

For the ECS cluster, it was enough to change the Auto Scaling Group settings and start an instance refresh. In a couple of hours, the entire cluster and its services were successfully migrated.

For the majority of Lambda and Fargate projects, we pushed an update to our internal cdk-construct library that modified the default subnets for deployments and asked developer teams to update their projects. This took some time but required minimal effort from both sides.

While the majority of cases went smoothly, some required more attention:

Scheduled Fargate tasks can fly under the radar when scanning ENIs. We spotted them by using an EventBridge rule to catch and log all events of tasks starting in soon-to-be-deleted subnets.
Some Lambda functions can share the same ENI if they were deployed in the same subnets and shared the same security group
Some Lambda functions had multiple versions, and old versions retained their original VPC subnet configuration. These ENIs were very tricky to find; fortunately, we found the Lambda ENI Finder tool from the awslabs/aws-support-tools repository.

Other Resources

Other resources we had to move were:

EC2, RDS, Elasticsearch clusters: once deployed, these resources had to be recreated from scratch if they needed to be moved. For most of them, we had a routine recreation procedure, while for some EC2 instances, we had to dig deeper.
ALB, NLB, API Gateway APIs, EFS: the key for these services was to update one AZ at a time. We could encounter errors if resources in the same AZ we temporarily disabled had open connections with these services. We chose times with known low traffic to minimize any possible issues.

A Few Notes for CDK-based projects

CDK introduced abstractions to improve developer experience, but that convenience came with reduced control:

CDK often let us select types of subnets (public/private) without explicitly defining which subnets were used.
Higher-level (L3) constructs could provision a large number of underlying resources that were opaque, tightly coupled, and difficult to customize or modify later. This caused significant time waste—for example, in a project that included EFS—where implicit resources forced us to rely on complex CDK and CloudFormation workarounds just to modify a single security group rule.

As a rule of thumb, to avoid these issues, we learned to avoid over-relying on CDK abstractions and prefer slightly more verbose, explicit code that gives tighter control.

Finally, Delete the old Subnets

The last step was also the easiest. By this point, we handled everything cleanly via IaC.

Traffic was blocked on the soon-to-be-deleted subnets via a deny-all NACL for a couple of weeks as an extra safety measure.

After seeing no errors at all, it was finally time to bid farewell to our subnets—this time, for good.

Key Takeaways

Know your IaC: Whether you use CDK, plain CloudFormation, or Terraform, you need a deep understanding of how your tool choice will handle critical updates to avoid disruption or downtime. Be prepared to tinker and confident in using unorthodox patterns.

Do not blindly trust the docs: Sometimes "Update requires no interruption" may be true from a formal, resource-based perspective but not from a functional one.

Do not assume everyone cares about or understands what you are doing: Let's be honest—the only time people care about networking is when it is not working properly. To get everyone on board, try to minimize the effort required from other team members.

TEST, TEST, and TEST again: This is probably the golden rule. Test as many times as you can in development environments before applying your changes to production. Create and follow procedures for what you do, and know what to do if anything goes wrong.

DEV Community