Glenn Gray

Posted on Mar 12 • Edited on May 19 • Originally published at graycloudarch.com

Zero-Downtime AWS Transit Gateway Hub-Spoke Migration

#aws #networkfirewall #networking #terraform

This post was originally published on graycloudarch.com.

The request came from the security team: they needed network-level\
access from the nonprod account to the dev account so a vulnerability\
scanner could reach internal services. Simple enough on the surface. In\
practice, it exposed a gap we'd been living with for months --- and forced\
us to fix the network architecture we'd been deferring.

We had three standalone Transit Gateways: one in each workload\
account, dev, nonprod, and prod. Completely isolated from each other. No\
cross-account connectivity at all. The security scanner couldn't reach\
its targets, and adding more point-to-point peering connections to fix\
it would have made everything worse.

But the TGW isolation was only part of the problem. We also had no\
inspection of traffic crossing our network boundary. Egress from\
workload pods went straight to the internet with no filtering. Ingress\
came through per-account load balancers with no centralized enforcement\
point. As the platform scaled toward additional workload accounts, this\
pattern was going to get expensive and hard to reason about.

So we didn't just fix the TGW. We rebuilt the network foundation: a\
centralized Inspection VPC with a Network Firewall inline, a single hub\
Transit Gateway shared across all accounts, and centralized security\
tooling (GuardDuty, CloudTrail, Security Hub) aggregated in a dedicated\
Security account. Two maintenance windows, a few weeks of module work,\
and the platform went from fragmented per-account networking to a\
coherent hub-spoke design with full traffic inspection.

The Architecture We Were Replacing

Before the migration, each workload account was self-contained. It\
had its own TGW, its own internet gateway, its own NAT gateways.\
Security tooling ran independently in each account with no aggregation.\
The management account had no single-pane visibility into what was\
happening across the environment.

Before: Three isolated workload accounts — each with its own IGW, NAT Gateway, and standalone Transit Gateway, no cross-account connectivity

The cost of running this way was about \$150/month in TGW charges plus\
duplicated NAT gateway charges in each account. Every new workload\
account would multiply this cost again and add another independent\
security configuration.

The Target: Inspection VPC + Hub Transit Gateway

The target was AWS Security Reference Architecture Pattern B: an\
Inspection VPC that sits between the internet and all workload VPCs. All\
internet traffic --- ingress and egress --- flows through this VPC and\
through a Network Firewall before reaching any workload account.

After: Centralized hub with inline Network Firewall inspection — all traffic flows through the Infrastructure Account's Inspection VPC before reaching any workload

Egress path: workload pod → TGW → Inspection VPC TGW subnets →\
Network Firewall → NAT Gateway → IGW → internet.

Ingress path: internet → IGW → centralized ALB (public subnet) →\
Network Firewall → TGW → workload VPC → pod.

Nothing crosses the network boundary without passing through the\
firewall. Workload accounts carry no internet-facing infrastructure at\
all --- no IGW, no NAT gateways, no public load balancers.

Phase 1: Module Changes

All Terraform work happened before scheduling any maintenance. The\
goal was to reach a state where the migration itself was just running\
pre-staged plan files in a specific sequence.

Transit Gateway: add a conditional create flag

The existing network module always created a TGW. We needed spoke\
accounts to declare the same module without spinning up their own\
gateway:

::: {#cb1 .sourceCode}

variable "create_transit_gateway" {
  description = "Whether to create a Transit Gateway (false for hub-spoke spokes)"
  type        = bool
  default     = true
}

resource "aws_ec2_transit_gateway" "this" {
  count       = var.create_transit_gateway ? 1 : 0
  description = var.tgw_description
}

output "transit_gateway_id" {
  value = var.create_transit_gateway ? aws_ec2_transit_gateway.this[0].id : null
}

:::

default = true means existing configurations need no\
changes. The flag only flips to false after the spoke\
attachment is confirmed working.

New module: vpc-attachment

The vpc-attachment module handles the spoke side of the hub\
relationship: create the TGW attachment, associate it to the hub's route\
table, and add routes to every private route table in the spoke VPC\
pointing at the hub TGW.

::: {#cb2 .sourceCode}

resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
  transit_gateway_id = var.transit_gateway_id
  vpc_id             = var.vpc_id
  subnet_ids         = var.subnet_ids

  tags = merge(var.tags, {
    Name = "${var.name}-hub-attachment"
  })
}

resource "aws_ec2_transit_gateway_route_table_association" "this" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.this.id
  transit_gateway_route_table_id = var.transit_gateway_route_table_id
}

resource "aws_route" "to_hub_tgw" {
  for_each               = toset(var.vpc_route_table_ids)
  route_table_id         = each.value
  destination_cidr_block = "10.0.0.0/8"
  transit_gateway_id     = var.transit_gateway_id
}

:::

The 10.0.0.0/8 supernet covers all workload and\
Inspection VPC CIDRs without maintaining per-prefix route entries. It\
also covers the Inspection VPC CIDR (10.100.0.0/20) ---\
that's how return traffic from the centralized ALB finds its way back to\
pods in workload VPCs.

The Terragrunt config for a spoke account reads VPC details from the\
existing network dependency and hardcodes the hub TGW identifiers:

dependency "network" {
  config_path = "../network"
  mock_outputs = {
    vpc_id                  = "vpc-mockid"
    private_subnet_ids      = ["subnet-mock1"]
    private_route_table_ids = ["rtb-mock1"]
  }
}

inputs = {
  transit_gateway_id             = "tgw-xxxxx"   # hub TGW, documented in runbook
  transit_gateway_route_table_id = "tgw-rtb-xxxxx"
}

We hardcoded the hub TGW and route table IDs rather than using\
cross-account data sources. The alternative --- reading TGW details from\
the Infrastructure account at plan time --- requires cross-account state\
access and adds complexity that isn't worth it for values that change\
maybe once in the platform's lifetime.

Hub route tables: workload isolation by default

A key design decision: workload accounts should not route to each\
other directly. Dev should not reach nonprod; nonprod should not reach\
prod. The hub TGW enforces this through route table structure:

default-association-rt: all workload attachments\ associate here. The only route is\ 0.0.0.0/0 → inspection attachment. Workloads can reach the\ internet via the Inspection VPC, but cannot reach other workload\ VPCs.
default-propagation-rt: the inspection attachment\ propagates workload CIDRs here for return traffic routing.

Inter-account communication is opt-in: you add an explicit route\
table entry for a specific attachment pair. By default, the architecture\
prevents lateral movement across workload accounts.

Inspection VPC subnet layout

The Inspection VPC has three tiers with carefully constructed route\
tables that force traffic through the firewall in both directions:

The asymmetric route table design ensures the firewall sees every\
packet crossing the network boundary, regardless of direction. Traffic\
entering from the internet hits the firewall before reaching workloads.\
Traffic from workloads hits the firewall before reaching the\
internet.

Security baseline: convert to delegated admin model

GuardDuty and CloudTrail were running independently per account. We\
added enable_guardduty and enable_cloudtrail\
boolean variables to the security-baseline module so workload accounts\
could switch from standalone to member without touching the module\
invocation itself.

In the Security account, we deployed:

GuardDuty as delegated admin with\ organization-level auto-enrollment. EKS Protection and S3 Protection\ enabled. All findings from all accounts visible in a single\ dashboard.
CloudTrail organization trail writing to a\ cross-account S3 bucket. Log file validation and KMS encryption enabled.\ Per-account trails archived after the cutover --- not deleted, in case\ historical log formats differed.
Security Hub with CIS AWS Foundations Benchmark and\ AWS Foundational Security Best Practices enabled across the full\ organization.

Phase 2: Two Maintenance Windows

Window 1: Deploy the hub (~45 minutes, low risk)

With no existing attachments and no workload traffic, deploying the\
hub infrastructure carried minimal risk. We applied the Infrastructure\
account TGW and Inspection VPC in a single window. The Network Firewall\
takes 5--10 minutes to reach READY state after creation --- account for\
that in your timing.

At the end of this window: hub TGW running, Inspection VPC active,\
Network Firewall endpoints healthy in both AZs, centralized ALB\
deployed. Nothing attached yet. We documented the TGW ID and route table\
IDs in the runbook before scheduling window 2.

Window 2: Spoke cutover (~2 hours)

The key insight for keeping applications running: create the\
hub attachment before destroying the standalone TGW. While both\
exist simultaneously, traffic continues flowing through the standalone\
path. The actual cutover is updating routes to point at the hub --- that's\
a single terragrunt apply, not the destruction of the old\
TGW.

T+0 --- Accept RAM share. Infrastructure account\
shares the hub TGW via Resource Access Manager. Workload accounts accept\
the share invitation. Pure metadata operation; zero network impact.

T+15 --- Deploy VPC attachments. Apply the\
vpc-attachment module in each workload account. At this\
point each spoke VPC has two routes for 10.0.0.0/8: the\
existing one pointing at the standalone TGW, and the new one pointing at\
the hub. With identical prefix lengths, traffic still flows through the\
standalone path. Rollback at this stage is\
terragrunt destroy on the attachment module --- under five\
minutes.

T+30 --- Verify routes and test cross-account\
connectivity. Confirm hub routes are present in every private\
route table:

::: {#cb4 .sourceCode}

aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxxxx" \
  --query 'RouteTables[*].Routes[?DestinationCidrBlock==`10.0.0.0/8`]'

:::

Then test actual cross-account traffic: connect from a dev instance\
to a service in the nonprod VPC. The hub TGW and Inspection VPC should\
route it correctly. This also validates that the firewall rule groups\
are permitting expected traffic --- catch any rule issues here, before\
cutting over production.

T+45 --- Migrate security tooling. Apply the updated\
security-baseline to each workload account. GuardDuty converts from\
standalone admin to member; findings flow to the Security account\
delegated admin. CloudTrail local trail disabled; organization trail\
confirmed logging events from the account. Zero network impact.

::: {#cb5 .sourceCode}

# Verify GuardDuty membership
aws guardduty get-administrator-account --detector-id <id>
# Returns the Security account as administrator

# Verify organization trail is capturing events
# Make an API call, wait ~15 minutes, check the Security account's S3 bucket
aws s3 ls s3://<org-trail-bucket>/AWSLogs/<account-id>/

:::

T+60 --- Set create_transit_gateway = false in\
each spoke. This is the cutover. Run\
terraform plan first and confirm it shows only the TGW and\
its attached resources being destroyed --- nothing else. Apply dev first,\
watch the destruction complete, confirm application traffic is flowing\
through the hub. Then apply nonprod. About 3 minutes per account.

T+90 --- Health checks and close. Spot-check API\
endpoints, database connectivity, anything that traverses the network.\
Confirm egress traffic is hitting the firewall logs in the\
Infrastructure account. The maintenance window closed at the 90-minute\
mark; actual work was done by T+75. We kept the window open for the last\
15 minutes as a buffer.

The parallel attachment approach ensured there was never a moment\
where a workload account had no routing path. Even if the hub TGW had\
been misconfigured, traffic would have continued flowing through the\
standalone gateway until we chose to destroy it.

What We Ended Up With

One TGW in the Infrastructure account with three\
spoke attachments. Route tables that allow workload→internet traffic\
while preventing workload→workload lateral movement by default.

One Inspection VPC with Network Firewall endpoints\
in two AZs. All egress inspected against stateful domain filter rules\
and stateless port rules. All ingress from the centralized ALB\
inspected. Firewall policy updates apply to all workload accounts\
simultaneously --- no per-account changes needed.

One centralized ALB in the Infrastructure account,\
routing to EKS target groups in workload accounts via cross-account IAM\
role assumption. Workload accounts carry no public-facing load\
balancers.

One security console in the Security account.\
GuardDuty findings from all accounts in a single dashboard. CloudTrail\
logs from every account in one S3 bucket. Security Hub compliance\
posture for the full organization visible in one place.

Cost went from roughly \$150--200/month (standalone TGWs, per-account\
NAT, independent security tooling) to approximately \$50/month (single\
hub TGW plus attachment hours, shared NAT in the Inspection VPC,\
delegated security services). Cost savings validated against AWS Cost\
Explorer after 30 days.

The original security scanner request --- cross-account access from\
nonprod to dev --- was live the same day. The compliance team had a single\
GuardDuty and Security Hub dashboard the same week.

More importantly: adding a new workload account to this architecture\
now takes about an hour. Create the VPC, deploy the vpc-attachment\
module pointing at the documented hub TGW ID, invite the new account as\
a GuardDuty and Security Hub member, apply the security-baseline with\
enable_guardduty = false. Every new account inherits the\
full inspection and security posture without any per-account\
configuration. That's the actual value of a hub-spoke design --- not the\
one-time cost savings, but the fact that account seven is as\
well-secured and as easy to audit as account two.

Working through a multi-account network redesign, or building the\
inspection layer on top of an existing Transit Gateway setup? Get in touch --- this is the kind of platform\
architecture I work on regularly.

DEV Community