Glenn Gray

Posted on Apr 4 • Originally published at graycloudarch.com

Zero-Downtime AWS Transit Gateway Hub-Spoke Migration

#aws #terraform #transitgateway #networking

Originally published on graycloudarch.com.

The request came from the security team: they needed network-level access from the nonprod account to the dev account so a vulnerability scanner could reach internal services. Simple enough on the surface. In practice, it exposed a gap we'd been living with for months — and forced us to fix the network architecture we'd been deferring.

We had three standalone Transit Gateways: one in each workload account, dev, nonprod, and prod. Completely isolated from each other. No cross-account connectivity at all. The security scanner couldn't reach its targets, and adding more point-to-point peering connections to fix it would have made everything worse.

But the TGW isolation was only part of the problem. We also had no inspection of traffic crossing our network boundary. Egress from workload pods went straight to the internet with no filtering. Ingress came through per-account load balancers with no centralized enforcement point. As the platform scaled toward additional workload accounts, this pattern was going to get expensive and hard to reason about.

So we didn't just fix the TGW. We rebuilt the network foundation: a centralized Inspection VPC with a Network Firewall inline, a single hub Transit Gateway shared across all accounts, and centralized security tooling (GuardDuty, CloudTrail, Security Hub) aggregated in a dedicated Security account. Two maintenance windows, a few weeks of module work, and the platform went from fragmented per-account networking to a coherent hub-spoke design with full traffic inspection.

The Architecture We Were Replacing

Before the migration, each workload account was self-contained. It had its own TGW, its own internet gateway, its own NAT gateways. Security tooling ran independently in each account with no aggregation. The management account had no single-pane visibility into what was happening across the environment.

The cost of running this way was about $150/month in TGW charges plus duplicated NAT gateway charges in each account. Every new workload account would multiply this cost again and add another independent security configuration.

The Target: Inspection VPC + Hub Transit Gateway

The target was AWS Security Reference Architecture Pattern B: an Inspection VPC that sits between the internet and all workload VPCs. All internet traffic — ingress and egress — flows through this VPC and through a Network Firewall before reaching any workload account.

Egress path: workload pod → TGW → Inspection VPC TGW subnets → Network Firewall → NAT Gateway → IGW → internet.

Ingress path: internet → IGW → centralized ALB (public subnet) → Network Firewall → TGW → workload VPC → pod.

Nothing crosses the network boundary without passing through the firewall. Workload accounts carry no internet-facing infrastructure at all — no IGW, no NAT gateways, no public load balancers.

Phase 1: Module Changes

All Terraform work happened before scheduling any maintenance. The goal was to reach a state where the migration itself was just running pre-staged plan files in a specific sequence.

Transit Gateway: add a conditional create flag

The existing network module always created a TGW. We needed spoke accounts to declare the same module without spinning up their own gateway:

variable "create_transit_gateway" {
  description = "Whether to create a Transit Gateway (false for hub-spoke spokes)"
  type        = bool
  default     = true
}

resource "aws_ec2_transit_gateway" "this" {
  count       = var.create_transit_gateway ? 1 : 0
  description = var.tgw_description
}

output "transit_gateway_id" {
  value = var.create_transit_gateway ? aws_ec2_transit_gateway.this[0].id : null
}

default = true means existing configurations need no changes. The flag only flips to false after the spoke attachment is confirmed working.

New module: vpc-attachment

The vpc-attachment module handles the spoke side of the hub relationship: create the TGW attachment, associate it to the hub's route table, and add routes to every private route table in the spoke VPC pointing at the hub TGW.

resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
  transit_gateway_id = var.transit_gateway_id
  vpc_id             = var.vpc_id
  subnet_ids         = var.subnet_ids

  tags = merge(var.tags, {
    Name = "${var.name}-hub-attachment"
  })
}

resource "aws_ec2_transit_gateway_route_table_association" "this" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.this.id
  transit_gateway_route_table_id = var.transit_gateway_route_table_id
}

resource "aws_route" "to_hub_tgw" {
  for_each               = toset(var.vpc_route_table_ids)
  route_table_id         = each.value
  destination_cidr_block = "10.0.0.0/8"
  transit_gateway_id     = var.transit_gateway_id
}

The 10.0.0.0/8 supernet covers all workload and Inspection VPC CIDRs without maintaining per-prefix route entries. It also covers the Inspection VPC CIDR (10.100.0.0/20) — that's how return traffic from the centralized ALB finds its way back to pods in workload VPCs.

The Terragrunt config for a spoke account reads VPC details from the existing network dependency and hardcodes the hub TGW identifiers:

dependency "network" {
  config_path = "../network"
  mock_outputs = {
    vpc_id                  = "vpc-mockid"
    private_subnet_ids      = ["subnet-mock1"]
    private_route_table_ids = ["rtb-mock1"]
  }
}

inputs = {
  transit_gateway_id             = "tgw-xxxxx"   # hub TGW, documented in runbook
  transit_gateway_route_table_id = "tgw-rtb-xxxxx"
}

We hardcoded the hub TGW and route table IDs rather than using cross-account data sources. The alternative — reading TGW details from the Infrastructure account at plan time — requires cross-account state access and adds complexity that isn't worth it for values that change maybe once in the platform's lifetime.

Hub route tables: workload isolation by default

A key design decision: workload accounts should not route to each other directly. Dev should not reach nonprod; nonprod should not reach prod. The hub TGW enforces this through route table structure:

default-association-rt: all workload attachments associate here. The only route is 0.0.0.0/0 → inspection attachment. Workloads can reach the internet via the Inspection VPC, but cannot reach other workload VPCs.
default-propagation-rt: the inspection attachment propagates workload CIDRs here for return traffic routing.

Inter-account communication is opt-in: you add an explicit route table entry for a specific attachment pair. By default, the architecture prevents lateral movement across workload accounts.

Inspection VPC subnet layout

The Inspection VPC has three tiers with carefully constructed route tables that force traffic through the firewall in both directions:

The asymmetric route table design ensures the firewall sees every packet crossing the network boundary, regardless of direction. Traffic entering from the internet hits the firewall before reaching workloads. Traffic from workloads hits the firewall before reaching the internet.

Security baseline: convert to delegated admin model

GuardDuty and CloudTrail were running independently per account. We added enable_guardduty and enable_cloudtrail boolean variables to the security-baseline module so workload accounts could switch from standalone to member without touching the module invocation itself.

In the Security account, we deployed:

GuardDuty as delegated admin with organization-level auto-enrollment. EKS Protection and S3 Protection enabled. All findings from all accounts visible in a single dashboard.
CloudTrail organization trail writing to a cross-account S3 bucket. Log file validation and KMS encryption enabled. Per-account trails archived after the cutover — not deleted, in case historical log formats differed.
Security Hub with CIS AWS Foundations Benchmark and AWS Foundational Security Best Practices enabled across the full organization.

Phase 2: Two Maintenance Windows

Window 1: Deploy the hub (~45 minutes, low risk)

With no existing attachments and no workload traffic, deploying the hub infrastructure carried minimal risk. We applied the Infrastructure account TGW and Inspection VPC in a single window. The Network Firewall takes 5–10 minutes to reach READY state after creation — account for that in your timing.

At the end of this window: hub TGW running, Inspection VPC active, Network Firewall endpoints healthy in both AZs, centralized ALB deployed. Nothing attached yet. We documented the TGW ID and route table IDs in the runbook before scheduling window 2.

Window 2: Spoke cutover (~2 hours)

The key insight for keeping applications running: create the hub attachment before destroying the standalone TGW. While both exist simultaneously, traffic continues flowing through the standalone path. The actual cutover is updating routes to point at the hub — that's a single terragrunt apply, not the destruction of the old TGW.

T+0 — Accept RAM share. Infrastructure account shares the hub TGW via Resource Access Manager. Workload accounts accept the share invitation. Pure metadata operation; zero network impact.

T+15 — Deploy VPC attachments. Apply the vpc-attachment module in each workload account. At this point each spoke VPC has two routes for 10.0.0.0/8: the existing one pointing at the standalone TGW, and the new one pointing at the hub. With identical prefix lengths, traffic still flows through the standalone path. Rollback at this stage is terragrunt destroy on the attachment module — under five minutes.

T+30 — Verify routes and test cross-account connectivity. Confirm hub routes are present in every private route table:

aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxxxx" \
  --query 'RouteTables[*].Routes[?DestinationCidrBlock==`10.0.0.0/8`]'

Then test actual cross-account traffic: connect from a dev instance to a service in the nonprod VPC. The hub TGW and Inspection VPC should route it correctly. This also validates that the firewall rule groups are permitting expected traffic — catch any rule issues here, before cutting over production.

T+45 — Migrate security tooling. Apply the updated security-baseline to each workload account. GuardDuty converts from standalone admin to member; findings flow to the Security account delegated admin. CloudTrail local trail disabled; organization trail confirmed logging events from the account. Zero network impact.

# Verify GuardDuty membership
aws guardduty get-administrator-account --detector-id <id>
# Returns the Security account as administrator

# Verify organization trail is capturing events
# Make an API call, wait ~15 minutes, check the Security account's S3 bucket
aws s3 ls s3://<org-trail-bucket>/AWSLogs/<account-id>/

T+60 — Set create_transit_gateway = false in each spoke. This is the cutover. Run terraform plan first and confirm it shows only the TGW and its attached resources being destroyed — nothing else. Apply dev first, watch the destruction complete, confirm application traffic is flowing through the hub. Then apply nonprod. About 3 minutes per account.

T+90 — Health checks and close. Spot-check API endpoints, database connectivity, anything that traverses the network. Confirm egress traffic is hitting the firewall logs in the Infrastructure account. The maintenance window closed at the 90-minute mark; actual work was done by T+75. We kept the window open for the last 15 minutes as a buffer.

The parallel attachment approach ensured there was never a moment where a workload account had no routing path. Even if the hub TGW had been misconfigured, traffic would have continued flowing through the standalone gateway until we chose to destroy it.

What We Ended Up With

One TGW in the Infrastructure account with three spoke attachments. Route tables that allow workload→internet traffic while preventing workload→workload lateral movement by default.

One Inspection VPC with Network Firewall endpoints in two AZs. All egress inspected against stateful domain filter rules and stateless port rules. All ingress from the centralized ALB inspected. Firewall policy updates apply to all workload accounts simultaneously — no per-account changes needed.

One centralized ALB in the Infrastructure account, routing to EKS target groups in workload accounts via cross-account IAM role assumption. Workload accounts carry no public-facing load balancers.

One security console in the Security account. GuardDuty findings from all accounts in a single dashboard. CloudTrail logs from every account in one S3 bucket. Security Hub compliance posture for the full organization visible in one place.

Cost went from roughly $150–200/month (standalone TGWs, per-account NAT, independent security tooling) to approximately $50/month (single hub TGW plus attachment hours, shared NAT in the Inspection VPC, delegated security services). Cost savings validated against AWS Cost Explorer after 30 days.

The original security scanner request — cross-account access from nonprod to dev — was live the same day. The compliance team had a single GuardDuty and Security Hub dashboard the same week.

More importantly: adding a new workload account to this architecture now takes about an hour. Create the VPC, deploy the vpc-attachment module pointing at the documented hub TGW ID, invite the new account as a GuardDuty and Security Hub member, apply the security-baseline with enable_guardduty = false. Every new account inherits the full inspection and security posture without any per-account configuration. That's the actual value of a hub-spoke design — not the one-time cost savings, but the fact that account seven is as well-secured and as easy to audit as account two.

Working through a multi-account network redesign, or building the inspection layer on top of an existing Transit Gateway setup? Get in touch — this is the kind of platform architecture I work on regularly.

Top comments (1)

Gergo Vadasz • Apr 5

Great article, with clear goal setting and execution. Please fix the non-existent images for better view.