Originally published on graycloudarch.com.
Two weeks after the platform went live — right after we onboarded our first high-volume content provider — I pulled up AWS Cost Explorer. ~$1,300/day in data transfer, and still climbing.
The architecture made sense when we designed it. Hub-and-spoke with a centralized inspection VPC: all internet-bound egress routes through Transit Gateway, then Network Firewall, then a NAT Gateway out. At the traffic volumes we anticipated pre-launch, the per-GB processing fees were a rounding error. At the traffic we were actually running, they weren't.
Four hours later, it was ~$175/day. The savings: ~$34,000/month.
Here's what was actually wrong, what we changed, and the incident that happened anyway.
What $0.130/GB buys you
The hub-and-spoke design routes all egress through a shared inspection VPC in a separate AWS account. The intent: centralize threat detection at the perimeter, enforce uniform security policy across workload accounts, keep each workload VPC clean. On paper, it's the right call.
Every byte of internet-bound traffic from an ECS task crosses three metered hops before it reaches the internet:
| Hop | Per-GB cost | Purpose |
|---|---|---|
| Transit Gateway | $0.020 | Routes from workload VPC into the inspection VPC |
| Network Firewall | $0.065 | Deep packet inspection |
| NAT Gateway | $0.045 | Provides public IP for internet egress |
| Total egress | $0.130 |
At the traffic volumes this platform was running, that added up to over $1,000/day in the workload VPC alone — before fixed attachment and endpoint fees.
What the firewall was actually inspecting
Here's the part that doesn't surface in the architecture review: AWS service traffic — S3, ECR, Secrets Manager, SSM, CloudWatch — was already exiting via VPC Interface Endpoints. Private DNS resolved those service names to VPC endpoint IPs inside the workload VPC. That traffic never entered the inspection VPC at all.
What was actually going through the Network Firewall? Outbound HTTP calls from application code to external APIs. And that traffic doesn't benefit from NFW inspection for a straightforward reason: Network Firewall is designed to block inbound threats at the perimeter. It has no meaningful way to filter outbound API calls made by application code without also breaking the application. You'd need an explicit deny rule for every legitimate destination — which is impossible at API-call volume and variety.
We were paying $0.065/GB to pass traffic through a firewall that couldn't act on it.
Moving tasks out of the inspection path
The fix is, embarrassingly, the standard AWS ECS Fargate deployment pattern.
Add an Internet Gateway and public subnets to each workload VPC. Move tasks there, assign public IPs, and scope the TGW default route from 0.0.0.0/0 down to 10.0.0.0/8. Internet-bound egress exits via the local IGW at $0/GB. Internal traffic — responses routed back to the ALB in the infrastructure account — still traverses TGW, which is required for cross-account routing.
The Terraform changes were small. The network module already had the flags:
# Workload VPC — network module (flags already existed, just needed enabling)
create_public_subnets = true
create_internet_gateway = true
# Workload VPC — network-attachment module
# was: destination_cidr_block = "0.0.0.0/0"
destination_cidr_block = "10.0.0.0/8"
# ECS service configs — all services, all environments
subnet_ids = dependency.network.outputs.public_subnet_ids
assign_public_ip = true
The apply sequence matters. Don't run these as a single run-all:
- Apply the network module (creates IGW, public subnets, public route table with
0.0.0.0/0 → IGW) - Apply the network-attachment module (replaces
0.0.0.0/0 → TGWwith10.0.0.0/8 → TGW; adds public route table to the TGW attachment scope) - Apply ECS service configs (rolling subnet replacement via ALB health-check drain — no downtime)
Step 2 has a brief window — seconds — between destroying the old default TGW route and creating the new 10.0.0.0/8 route, during which tasks in private subnets lose internet egress. We scheduled that apply during low-traffic hours.
The incident that happened anyway
We applied the ECS service configs pointing tasks at the public subnets. The deployment stalled almost immediately:
ResourceInitializationError: unable to retrieve secret … context deadline exceeded
New tasks couldn't reach Secrets Manager.
The cause was a gotcha buried in the Fargate documentation: map_public_ip_on_launch = true on the subnet is silently ignored by ECS Fargate. The task's network configuration must explicitly set assignPublicIp = ENABLED. Setting it only on the subnet does nothing.
Tasks in public subnets without a public IP have no path to the internet. With TGW now scoped to 10.0.0.0/8, there was no route to Secrets Manager either — the workload VPC had no Secrets Manager endpoint, and the previous internet path via the NAT Gateway was gone. The tasks couldn't initialize.
The existing tasks on the old deployment — still running in private subnets — kept serving all traffic throughout. No user-facing disruption.
Full timeline (all times UTC-6):
-
12:26— ECS service configs applied (public subnets,assign_public_ipnot yet set totrue) -
12:27— New tasks begin launching; fail withResourceInitializationError -
12:40— Root cause confirmed:assign_public_iphardcodedfalsein the ECS service module -
12:40— Second apply withassign_public_ip = true -
12:44— New tasks with public IPs launch successfully -
12:46— Old tasks drained -
12:48— All services steady state
18 minutes from first failure to resolution.
Public subnets and security groups
The question I had to work through before making this change: does assigning a public IP to a Fargate task actually change the security posture?
No — with one condition.
A public IP on a Fargate task does not open any inbound ports. The security group is the effective security boundary, not the subnet type. If your tasks accept inbound connections only from the ALB security group, assigning a public IP changes the routing path for egress but doesn't expand the attack surface. No port becomes reachable from the internet that wasn't already reachable via the ALB.
This is the documented AWS deployment pattern. The ECS console, every AWS sample deployment, and the official Fargate getting-started guide all default to public subnets with auto-assigned public IPs. The configuration we'd been running was the non-default, expensive variant — without a corresponding security benefit.
The one condition that matters: security group policy must not drift. With tasks on private subnets, a misconfigured security group that accidentally opens a port isn't directly internet-reachable. On public subnets, it is. IaC-only deployments and security group review in CI mitigate this, but it's worth knowing before you make the change.
Before and after
| Before | After | |
|---|---|---|
| Egress cost/GB | $0.130 | $0.000 |
| Ingress cost/GB | $0.085 | $0.085 |
| NFW endpoint fees | ~$570/mo | ~$570/mo |
| TGW attachment fees | ~$147/mo | ~$147/mo |
| Est. daily total | ~$1,300 | ~$175 |
| Est. monthly savings | — | ~$34,000 |
The TGW attachment hourly fees are unavoidable — the ALB lives in a separate account, and cross-account routing requires TGW regardless. But TGW data processing charges on egress ($0.020/GB) are eliminated because egress no longer traverses TGW. The NFW endpoint fees stay because inbound traffic still routes through the inspection VPC — the hub-and-spoke architecture is doing the right job for inbound, just not for egress.
What to verify before doing this
VPC Interface Endpoints for AWS services. If your workload VPCs don't have endpoints for the services your tasks call (Secrets Manager, ECR, SSM, S3), tasks need a working internet path to reach them. The incident above is what happens when you assume coverage you don't have. Audit your endpoint list before moving tasks to public subnets.
Security group inbound rules. Tasks should accept inbound only from the ALB security group. Anything broader — open to the VPC CIDR, open to a management CIDR — becomes internet-reachable when tasks get public IPs.
TGW route table coverage. The 10.0.0.0/8 → TGW route on the public route table has to cover every internal CIDR you need to route. If the ALB (or any other internal resource) is at an address outside 10.0.0.0/8, task responses will attempt to exit via the IGW and be silently dropped.
The apply sequence. Don't apply the network module changes and the ECS service configs in the same Terragrunt run-all. The route table changes and subnet reassignment need to be sequenced, and rolling tasks before the route tables are stable creates exactly the connectivity gap that caused our incident.
The hub-and-spoke design is the right architecture for inbound inspection at the perimeter. It's the wrong tool for filtering outbound application API calls at volume — and the cost difference at scale is significant.
Working through a similar cost problem, or figuring out which parts of your egress path are actually doing useful work? Get in touch — this is the kind of architecture review I do regularly.


Top comments (0)