DEV Community: Glenn Gray

AWS Egress Cost Elimination: ECS Fargate on Public Subnets

Glenn Gray — Tue, 30 Jun 2026 16:20:18 +0000

Originally published on graycloudarch.com.

Two weeks after the platform went live — right after we onboarded our first high-volume content provider — I pulled up AWS Cost Explorer. ~$1,300/day in data transfer, and still climbing.

The architecture made sense when we designed it. Hub-and-spoke with a centralized inspection VPC: all internet-bound egress routes through Transit Gateway, then Network Firewall, then a NAT Gateway out. At the traffic volumes we anticipated pre-launch, the per-GB processing fees were a rounding error. At the traffic we were actually running, they weren't.

Four hours later, it was ~$175/day. The savings: ~$34,000/month.

Here's what was actually wrong, what we changed, and the incident that happened anyway.

What $0.130/GB buys you

The hub-and-spoke design routes all egress through a shared inspection VPC in a separate AWS account. The intent: centralize threat detection at the perimeter, enforce uniform security policy across workload accounts, keep each workload VPC clean. On paper, it's the right call.

Every byte of internet-bound traffic from an ECS task crosses three metered hops before it reaches the internet:

Hop	Per-GB cost	Purpose
Transit Gateway	$0.020	Routes from workload VPC into the inspection VPC
Network Firewall	$0.065	Deep packet inspection
NAT Gateway	$0.045	Provides public IP for internet egress
Total egress	$0.130

At the traffic volumes this platform was running, that added up to over $1,000/day in the workload VPC alone — before fixed attachment and endpoint fees.

What the firewall was actually inspecting

Here's the part that doesn't surface in the architecture review: AWS service traffic — S3, ECR, Secrets Manager, SSM, CloudWatch — was already exiting via VPC Interface Endpoints. Private DNS resolved those service names to VPC endpoint IPs inside the workload VPC. That traffic never entered the inspection VPC at all.

What was actually going through the Network Firewall? Outbound HTTP calls from application code to external APIs. And that traffic doesn't benefit from NFW inspection for a straightforward reason: Network Firewall is designed to block inbound threats at the perimeter. It has no meaningful way to filter outbound API calls made by application code without also breaking the application. You'd need an explicit deny rule for every legitimate destination — which is impossible at API-call volume and variety.

We were paying $0.065/GB to pass traffic through a firewall that couldn't act on it.

Moving tasks out of the inspection path

The fix is, embarrassingly, the standard AWS ECS Fargate deployment pattern.

Add an Internet Gateway and public subnets to each workload VPC. Move tasks there, assign public IPs, and scope the TGW default route from 0.0.0.0/0 down to 10.0.0.0/8. Internet-bound egress exits via the local IGW at $0/GB. Internal traffic — responses routed back to the ALB in the infrastructure account — still traverses TGW, which is required for cross-account routing.

The Terraform changes were small. The network module already had the flags:

# Workload VPC — network module (flags already existed, just needed enabling)
create_public_subnets   = true
create_internet_gateway = true

# Workload VPC — network-attachment module
# was: destination_cidr_block = "0.0.0.0/0"
destination_cidr_block = "10.0.0.0/8"

# ECS service configs — all services, all environments
subnet_ids       = dependency.network.outputs.public_subnet_ids
assign_public_ip = true

The apply sequence matters. Don't run these as a single run-all:

Apply the network module (creates IGW, public subnets, public route table with 0.0.0.0/0 → IGW)
Apply the network-attachment module (replaces 0.0.0.0/0 → TGW with 10.0.0.0/8 → TGW; adds public route table to the TGW attachment scope)
Apply ECS service configs (rolling subnet replacement via ALB health-check drain — no downtime)

Step 2 has a brief window — seconds — between destroying the old default TGW route and creating the new 10.0.0.0/8 route, during which tasks in private subnets lose internet egress. We scheduled that apply during low-traffic hours.

The incident that happened anyway

We applied the ECS service configs pointing tasks at the public subnets. The deployment stalled almost immediately:

ResourceInitializationError: unable to retrieve secret … context deadline exceeded

New tasks couldn't reach Secrets Manager.

The cause was a gotcha buried in the Fargate documentation: map_public_ip_on_launch = true on the subnet is silently ignored by ECS Fargate. The task's network configuration must explicitly set assignPublicIp = ENABLED. Setting it only on the subnet does nothing.

Tasks in public subnets without a public IP have no path to the internet. With TGW now scoped to 10.0.0.0/8, there was no route to Secrets Manager either — the workload VPC had no Secrets Manager endpoint, and the previous internet path via the NAT Gateway was gone. The tasks couldn't initialize.

The existing tasks on the old deployment — still running in private subnets — kept serving all traffic throughout. No user-facing disruption.

Full timeline (all times UTC-6):

12:26 — ECS service configs applied (public subnets, assign_public_ip not yet set to true)
12:27 — New tasks begin launching; fail with ResourceInitializationError
12:40 — Root cause confirmed: assign_public_ip hardcoded false in the ECS service module
12:40 — Second apply with assign_public_ip = true
12:44 — New tasks with public IPs launch successfully
12:46 — Old tasks drained
12:48 — All services steady state

18 minutes from first failure to resolution.

Public subnets and security groups

The question I had to work through before making this change: does assigning a public IP to a Fargate task actually change the security posture?

No — with one condition.

A public IP on a Fargate task does not open any inbound ports. The security group is the effective security boundary, not the subnet type. If your tasks accept inbound connections only from the ALB security group, assigning a public IP changes the routing path for egress but doesn't expand the attack surface. No port becomes reachable from the internet that wasn't already reachable via the ALB.

This is the documented AWS deployment pattern. The ECS console, every AWS sample deployment, and the official Fargate getting-started guide all default to public subnets with auto-assigned public IPs. The configuration we'd been running was the non-default, expensive variant — without a corresponding security benefit.

The one condition that matters: security group policy must not drift. With tasks on private subnets, a misconfigured security group that accidentally opens a port isn't directly internet-reachable. On public subnets, it is. IaC-only deployments and security group review in CI mitigate this, but it's worth knowing before you make the change.

Before and after

	Before	After
Egress cost/GB	$0.130	$0.000
Ingress cost/GB	$0.085	$0.085
NFW endpoint fees	~$570/mo	~$570/mo
TGW attachment fees	~$147/mo	~$147/mo
Est. daily total	~$1,300	~$175
Est. monthly savings	—	~$34,000

The TGW attachment hourly fees are unavoidable — the ALB lives in a separate account, and cross-account routing requires TGW regardless. But TGW data processing charges on egress ($0.020/GB) are eliminated because egress no longer traverses TGW. The NFW endpoint fees stay because inbound traffic still routes through the inspection VPC — the hub-and-spoke architecture is doing the right job for inbound, just not for egress.

What to verify before doing this

VPC Interface Endpoints for AWS services. If your workload VPCs don't have endpoints for the services your tasks call (Secrets Manager, ECR, SSM, S3), tasks need a working internet path to reach them. The incident above is what happens when you assume coverage you don't have. Audit your endpoint list before moving tasks to public subnets.

Security group inbound rules. Tasks should accept inbound only from the ALB security group. Anything broader — open to the VPC CIDR, open to a management CIDR — becomes internet-reachable when tasks get public IPs.

TGW route table coverage. The 10.0.0.0/8 → TGW route on the public route table has to cover every internal CIDR you need to route. If the ALB (or any other internal resource) is at an address outside 10.0.0.0/8, task responses will attempt to exit via the IGW and be silently dropped.

The apply sequence. Don't apply the network module changes and the ECS service configs in the same Terragrunt run-all. The route table changes and subnet reassignment need to be sequenced, and rolling tasks before the route tables are stable creates exactly the connectivity gap that caused our incident.

The hub-and-spoke design is the right architecture for inbound inspection at the perimeter. It's the wrong tool for filtering outbound application API calls at volume — and the cost difference at scale is significant.

Working through a similar cost problem, or figuring out which parts of your egress path are actually doing useful work? Get in touch — this is the kind of architecture review I do regularly.

Composable Terraform Modules: Default Every Resource to False

Glenn Gray — Tue, 23 Jun 2026 00:21:55 +0000

Originally published on graycloudarch.com.

The workload account had passed every review. Provisioned with the same VPC module we'd used for six months. All defaults. No customizations needed.

Three months later, an audit flagged it: traffic from that account was bypassing the centralized inspection VPC. The Network Firewall wasn't seeing it. Direct path out through an internet gateway the module had created by default.

No error. No alert. The module did exactly what it was designed to do. We just hadn't designed it for this context.

That account had an IGW it never needed, because nobody explicitly told it not to create one.

The natural instinct, and where it breaks

The pull toward "batteries included" modules makes sense early. Network module creates VPC, subnets, IGW, NAT gateways, route tables — all of it. For a single-account setup, that's convenient.

The problem appears by account three, where some VPCs should have IGWs and some shouldn't. By account six — workload VPCs routing through a hub, an inspection VPC that owns the IGW and NAT, a sandbox account with direct access — you're forking modules, adding count = 0 overrides at the call site, or writing if/else logic at every deployment root. Each workaround is a signal that the module wasn't designed for multiple contexts.

The fix is a design rule: if a resource is not universally needed, its creation variable defaults to false. The caller opts in explicitly.

variable "create_internet_gateway" {
  description = "Create an IGW and default route in the public route table. Defaults to false because
workload VPCs use hub-and-spoke routing through the centralized inspection VPC for all egress."
  type        = bool
  default     = false
}

variable "create_nat_gateway" {
  description = "Create NAT Gateways for private subnet egress. Defaults to false for hub-and-spoke
VPCs where egress routes through the TGW to the centralized egress/inspection VPC."
  type        = bool
  default     = false
}

variable "create_public_subnets" {
  description = "Create public subnets, route table, and route table associations. Defaults to false
for hub-and-spoke design. Set to true only for hub VPCs that own an IGW."
  type        = bool
  default     = false
}

The description isn't documentation for its own sake. It explains why the default is false — the specific architectural constraint that makes true wrong for most callers. When someone reads it at the call site, they know whether their context matches the assumption.

What the call sites look like

The hub VPC — the inspection VPC that owns the Network Firewall — explicitly opts in. Workload VPCs call the module with no overrides:

module "inspection_vpc" {
  source = "../../../..//common/modules/network"

  name               = "inspection"
  vpc_cidr           = "10.0.0.0/22"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]

  # These are true because this is the hub — explicit opt-in
  create_internet_gateway = true
  create_public_subnets   = true
  create_nat_gateway      = true
}

module "workload_vpc" {
  source = "../../../..//common/modules/network"

  name               = "workloads-prod"
  vpc_cidr           = "10.1.0.0/22"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]

  # No opt-ins needed — all defaults correct for hub-and-spoke
}

The workload_vpc call is safe to copy-paste for any new workload account. The security-relevant decisions are in the module, not scattered across caller configurations.

The resource count gate

Conditional creation only works if every resource that depends on the gated resource is also gated:

resource "aws_internet_gateway" "this" {
  count  = var.create_internet_gateway ? 1 : 0
  vpc_id = aws_vpc.this.id

  tags = merge(local.common_tags, {
    Name = "${var.name}-igw"
  })
}

# Routes that depend on the IGW must also be gated
resource "aws_route" "public_internet" {
  count                  = var.create_internet_gateway ? 1 : 0
  route_table_id         = aws_route_table.public[0].id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.this[0].id
}

Outputs have the same requirement:

output "internet_gateway_id" {
  value = var.create_internet_gateway ? aws_internet_gateway.this[0].id : null
}

A plan for a workload VPC shows zero IGW-related changes. Not suppressed — genuinely not there. The module doesn't create it, reference it, or output it.

The same pattern, applied everywhere

Networking is the clearest example because the security stakes are visible, but the principle applies to every module type:

ALB module:

enable_deletion_protection = false — dev environments don't need it; prod opts in
enable_access_logs = false — caller enables when the S3 bucket for logs is ready
enable_https_redirect = false — explicit, not assumed; avoids broken behavior on internal ALBs

Security baseline module:

enable_guardduty = false, enable_security_hub = false, enable_config = false
One module, two contexts: the bootstrap account enables everything; sandbox accounts enable nothing
Without this: you're writing conditional logic at the call site for every new account type

Observability baseline:

enable_cloudwatch_alarms = false, enable_container_insights = false
Nonprod may or may not want alarms — the caller decides, not the module author

The pattern: if a resource is conditional on the deployment context, the module expresses that conditionality as a boolean defaulting to false.

When to break it

Not every variable is a gate on resource creation. The rule doesn't apply to:

Configuration variables with opinionated defaults. instance_type = "t3.medium" should default to a sensible value. The question isn't "should we create this?" — the resource always exists, you're just setting its properties.

Required inputs with no safe default. vpc_cidr shouldn't have a default at all. Force the caller to declare it explicitly. A missing required input surfaces immediately; a wrong default doesn't.

Resources that must exist for the module to function. The VPC itself isn't gated — if the module is called, a VPC is created. If a resource is that foundational, don't hide it behind a boolean.

The line: create_* and enable_* variables gate resource existence. Configuration variables set properties of resources that always exist. Required inputs have no default.

What the audit actually fixed

The inspection gap in that workload account had existed for months. The fix was changing the module default to false and re-applying across all accounts.

Because every other resource in the module was already following this pattern, the re-apply was clean. Zero unexpected changes on correctly-configured accounts — which is the second-order effect of this design rule: the module becomes safe to re-apply.

When everything that shouldn't exist defaults to not existing, terraform plan on a correctly-configured account comes back empty. That emptiness is a signal you can rely on. It means the module isn't hiding state you didn't ask for.

That's harder to achieve if you're starting from "batteries included" defaults and trying to carve out exceptions. It's straightforward if you start from false and require callers to opt in.

Standardizing Terraform module design across multiple accounts and environments — or inheriting a module library where the defaults aren't working in your favor? This is one of the first patterns I help teams establish. Get in touch.

ECS Fargate as a Migration Bridge: Running Two Orchestrators at Once

Glenn Gray — Tue, 16 Jun 2026 18:26:36 +0000

Originally published on graycloudarch.com.

Three months into the EKS buildout, someone asked a reasonable question: do we actually need all of this right now?

The cluster was running. The services were containerized. But the team was also operating cert-manager, an ingress controller, external-secrets-operator, and Karpenter — each with its own version compatibility matrix, each capable of generating its own 2am incident. None of it was directly related to shipping the product.

We made the decision to migrate to ECS Fargate first, with EKS as a future destination if and when the operational capacity caught up. Not a retreat — a deliberate two-step. The container images were already built. The IAM patterns were transferable. The application code hadn't changed. Only the orchestration layer was moving.

This is what that migration looked like, and why running both orchestrators simultaneously during the transition was the right pattern.

Why not skip straight to EKS

The decision framework for ECS vs. EKS is covered in a prior post — if you've already worked through that, skip ahead. The short version relevant here: EKS adds roughly fifteen operational concepts on top of running a service, each capable of failing independently. The bridge pattern is for teams where the orchestration question and the containerization question are both open at the same time. Trying to answer them together multiplies the blast radius.

The ECS → EKS migration later is largely mechanical. Task definitions become Helm charts, task roles become IRSA service account annotations, ALB target group registration becomes ingress controller configuration. The container image — the actual artifact — doesn't change. Build ECS as if you'll migrate it, and you will.

What the ECS foundation looks like in Terraform

Three modules compose to support any service:

# Shared per cluster
module "ecs_cluster" {
  source = "./modules/ecs-cluster"

  name               = "platform-prod"
  log_retention_days = 30
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]
}

# Per service — IAM task role with least-privilege access
module "api_task_role" {
  source = "./modules/ecs-task-role"

  service_name   = "api"
  environment    = "prod"
  secrets_arns   = [aws_secretsmanager_secret.api_db.arn]
  ecr_account_id = var.shared_services_account_id
}

# Per service — ECS service + ALB registration
module "api_service" {
  source = "./modules/ecs-service"

  cluster_arn     = module.ecs_cluster.arn
  task_role_arn   = module.api_task_role.arn
  image           = "${var.ecr_registry}/api:${var.image_tag}"
  cpu             = 512
  memory          = 1024
  desired_count   = 2
  target_group_arn = aws_alb_target_group.api.arn

  environment_variables = {
    APP_ENV = "production"
  }

  secrets = {
    DB_PASSWORD = aws_secretsmanager_secret.api_db.arn
  }
}

The design constraint that matters most: keep the three modules independent. Don't build a composite "ecs-app" module that wraps all three. Independent modules mean each service can tune its task role and scaling behavior without touching the cluster, and the cluster can be upgraded without touching service configurations.

Cross-account ECR: the gotcha that hits every team

ECR lives in a shared-services account. ECS runs in the workloads account. This is standard multi-account architecture — and it means the ECS task execution role needs cross-account pull permissions that are easy to get wrong.

Two pieces are required:

# In the workloads account: task execution role policy
data "aws_iam_policy_document" "ecr_cross_account" {
  statement {
    actions = [
      "ecr:GetAuthorizationToken",
    ]
    resources = ["*"]  # GetAuthorizationToken is global; can't be scoped to a registry
  }

  statement {
    actions = [
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:BatchGetImage",
    ]
    resources = [
      "arn:aws:ecr:us-east-1:${var.shared_services_account_id}:repository/*"
    ]
  }
}

// In the shared-services account: ECR repository policy
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::WORKLOADS_ACCOUNT_ID:root"
    },
    "Action": [
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:BatchGetImage"
    ]
  }]
}

The common failure mode: the task execution role has the right IAM policy, but the ECR repository policy in the shared-services account doesn't grant the workloads account access. ECS pulls look like a permissions error, and the error message ("no basic auth credentials") is not helpful in pointing to the repository policy as the cause.

Logging: what changes from EKS

On EKS, Fluent Bit runs as a DaemonSet — one per node, automatically collecting logs from every container. On ECS Fargate, there is no shared host and no DaemonSet. You configure logging per task definition.

The simplest approach, and the right default for most services, is the awslogs driver:

"logConfiguration": {
  "logDriver": "awslogs",
  "options": {
    "awslogs-group": "/ecs/api",
    "awslogs-region": "us-east-1",
    "awslogs-stream-prefix": "ecs",
    "awslogs-create-group": "true"
  }
}

This sends all stdout/stderr from the container directly to CloudWatch. No sidecar, no additional IAM, no configuration beyond the task definition. The awslogs-create-group: true option creates the log group automatically if it doesn't exist — useful during initial deployment.

For services that need to ship logs to multiple destinations or apply structured filtering, FireLens is the right choice: a Fluent Bit or Fluentd container runs as a sidecar in the same task and routes logs where they need to go. The operational overhead is higher, but the routing flexibility is real.

Verify logging works before cutting traffic: aws logs tail /ecs/api --follow while a test request hits the new ECS service. If nothing appears, the task role is missing CloudWatch write permissions or the log group name doesn't match.

Running both orchestrators during the soak period

We migrated all production services to ECS Fargate, but we kept EKS running throughout a soak period. Not as a fallback — as a confirmed, immediate revert target.

The migration sequence for each service:

Deploy service on ECS Fargate, validate health checks and task stability
Cut DNS to the new ALB (see the companion post on zero-downtime DNS cutover)
Monitor for 72 hours: error rates, latency p99, ALB healthy host count
If metrics are nominal after 72 hours, deprovision from EKS

During the soak period, EKS was live and capable of receiving traffic within 60 seconds if the DNS record was reverted. This isn't a hypothetical backup — it was a committed operational state, with the rollback sequence documented and tested before we cut DNS.

The benefit of this pattern is that it changes the calculus on the cutover decision. If rollback requires re-provisioning on EKS from scratch, the team has every incentive to push through problems rather than revert. If rollback is "update one Route53 record and wait 60 seconds," the team can move fast and revert at the first real signal.

We didn't need to revert. But having the option meant we could make the migration decision cleanly.

The ECS Anywhere variation: running both indefinitely

For one service — a high-volume content delivery workload — the migration pattern extended beyond a time-limited soak period. That service runs on both ECS Fargate and ECS Anywhere simultaneously, with the ability to shift traffic between them at any time.

ECS Anywhere extends ECS to on-premises or edge nodes, registered as EXTERNAL capacity providers. The same ECS service, task definitions, and IAM patterns apply — what changes is the capacity provider:

resource "aws_ecs_service" "delivery" {
  name            = "delivery-central"
  cluster         = aws_ecs_cluster.platform.id
  task_definition = aws_ecs_task_definition.delivery.arn
  desired_count   = var.desired_count

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = var.fargate_weight  # adjust to shift traffic
    base              = 0
  }

  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.anywhere.name
    weight            = var.anywhere_weight
    base              = 0
  }
}

Shifting between Fargate and Anywhere is a Terraform variable change — no service restart, no DNS change, no downtime. The service is always running on both; only the task distribution changes.

This pattern works well for workloads that need geographic proximity to edge infrastructure or where data sovereignty makes cloud-only deployment impractical. It also provides a genuine multi-region/multi-location deployment model without requiring a separate orchestrator.

When to stay on ECS

ECS Fargate is the right long-term answer — not just the bridge — when:

Service count is small (under roughly 15-20 services) and autoscaling requirements are straightforward target-tracking
The team's operational capacity doesn't yet support cluster-level operations: node group upgrades, admission controller management, custom scheduler configuration
Deploys via Terraform or CI/CD pipelines are acceptable and GitOps isn't a hard requirement
No hard requirement for KEDA, HPA with custom metrics, or cluster-level bin-packing

The ECS vs. EKS decision framework is covered in more detail in an earlier post. The short version: it's an operational capacity question, not a features comparison.

The bridge pattern is valuable precisely because it decouples the containerization decision from the orchestration decision. You can containerize now, on ECS, without betting that the team is ready to operate Kubernetes. When the team is ready — and that readiness is genuinely there, not aspirational — the migration from ECS to EKS is mostly mechanical. The hard work of containerizing the application is already done.

Running a platform migration and figuring out the container orchestration path? This is the kind of decision I work through with teams regularly. Get in touch.

Zero-Downtime DNS Cutover with ACM and ALB on AWS

Glenn Gray — Tue, 09 Jun 2026 16:29:43 +0000

Originally published on graycloudarch.com.

We were migrating the services behind a high-volume content distribution platform from one orchestration layer to another — new ALB, new target groups, new ECS cluster — and the question came up: when do we touch the DNS record?

The answer was: last. After everything else is done, validated, and confirmed healthy under real traffic conditions. Not concurrently with the new infrastructure. Not "we'll validate it after we cut over." Last.

Most DNS-related outages aren't caused by the DNS change itself. They're caused by the things that weren't ready when the change was made — a certificate that hadn't finished validating, a health check that passed in staging but broke under production traffic patterns, a TTL that made rollback a 10-minute ordeal instead of a 30-second one. The pattern that avoids all of those is the same: do all the risky work before you touch DNS.

Pre-provision everything before touching DNS

The new ALB, ACM certificate, target groups, and health check configuration all exist before any DNS change. No user traffic touches the new infrastructure during this phase — you're building and validating it in parallel with the old system still serving.

ACM certificate validation is the step most teams rush. Request the certificate immediately and use DNS validation, not email validation. (If you're managing DNS in Cloudflare rather than Route53, the automation pattern is covered here — the principle is the same.)

resource "aws_acm_certificate" "new" {
  domain_name       = "api.example.com"
  validation_method = "DNS"

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_route53_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.new.domain_validation_options : dvo.domain_name => {
      name   = dvo.resource_record_name
      record = dvo.resource_record_value
      type   = dvo.resource_record_type
    }
  }

  zone_id = data.aws_route53_zone.primary.zone_id
  name    = each.value.name
  type    = each.value.type
  ttl     = 60
  records = [each.value.record]
}

resource "aws_acm_certificate_validation" "new" {
  certificate_arn         = aws_acm_certificate.new.arn
  validation_record_fqdns = [for r in aws_route53_record.cert_validation : r.fqdn]
}

The validation CNAME goes into Route53 while the old system is still serving traffic. You wait for ISSUED status — typically 5-30 minutes, but occasionally longer. This is not a step you do on the morning of the cutover.

After the certificate is validated, deploy the new ALB with listener rules and target groups. Register targets and confirm health checks are passing — not just passing in principle, but passing with the actual application container running.

resource "aws_alb_target_group" "new" {
  name        = "api-new-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"  # required for ECS Fargate

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 30
    timeout             = 10
    matcher             = "200"
  }
}

The key check before proceeding: aws elbv2 describe-target-health should show all targets as healthy. Not initial. Not draining. Healthy. ALB-level health checks and application-level smoke tests are different things — confirm both.

Lower the TTL — 48 hours in advance

This is the most commonly skipped step and the highest-leverage thing you can do before a DNS cutover.

Route53 lets you change TTL on an existing record at any time, and it takes effect immediately from Route53's perspective. The problem is DNS resolvers don't care about your new TTL — they cache based on the TTL they observed when they last fetched the record. A resolver that saw a 300-second TTL will hold that cached value for up to 300 seconds after you lower it.

In practice this means: if you lower the TTL 30 minutes before the cutover window, your effective rollback window is still the old TTL, not the new one. If the cutover has problems and you need to revert, you're waiting several minutes for caches to expire. With production traffic flowing to a misconfigured endpoint.

Lower it 48 hours in advance:

resource "aws_route53_record" "api" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_alb.old.dns_name
    zone_id                = aws_alb.old.zone_id
    evaluate_target_health = true
  }

  # Lower this to 30 at least 48 hours before cutover
  # TTL doesn't apply to alias records directly, but influences
  # downstream resolver caching behavior
}

For non-alias A records with an explicit TTL: 300 → 30, committed and applied 48 hours before the cutover window. This is a one-line Terraform change. Apply it, verify it, and move on.

After 48 hours, all resolvers that previously cached the record at 300 seconds have expired their cache and re-fetched with the new 30-second TTL. Your rollback window is now 30-60 seconds.

The actual cutover — boring is the goal

With 48-hour TTL prep and pre-validated infrastructure, the cutover itself should take under two minutes and produce no errors.

Update the Route53 alias record to point to the new ALB:

resource "aws_route53_record" "api" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_alb.new.dns_name       # ← changed
    zone_id                = aws_alb.new.zone_id        # ← changed
    evaluate_target_health = true
  }
}

Apply. Watch three metrics in parallel for the next 5 minutes:

New ALB: TargetResponseTime p99 and HTTPCode_Target_5XX_Count
New ALB: HealthyHostCount — should stay constant
Application error rates from your monitoring platform

With 30-second TTL and pre-validated infrastructure, you should see full traffic shift to the new ALB within 2-3 minutes. If you're using weighted routing for a gradual shift, start at 10% new and watch those same metrics before moving to 100%.

# Optional: weighted routing for gradual shift
resource "aws_route53_record" "api_old" {
  zone_id        = data.aws_route53_zone.primary.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "old"

  weighted_routing_policy {
    weight = 0  # reduce from 100 as you validate
  }

  alias {
    name                   = aws_alb.old.dns_name
    zone_id                = aws_alb.old.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_new" {
  zone_id        = data.aws_route53_zone.primary.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "new"

  weighted_routing_policy {
    weight = 100  # increase from 0 as you validate
  }

  alias {
    name                   = aws_alb.new.dns_name
    zone_id                = aws_alb.new.zone_id
    evaluate_target_health = true
  }
}

Hold before decommission — at least 72 hours

The cutover is done. Metrics are green. The temptation is to clean up immediately.

Don't.

Keep the old ALB running for at least 72 hours. Two reasons:

First, any clients with hardcoded IPs — rather than DNS names — will break when the ALB is decommissioned. ALB IPs are not static. You won't know these clients exist until they start generating errors after teardown. The 72-hour window surfaces them while you still have an easy fix.

Second, some clients have unusually long DNS TTL caches or are behind corporate proxies that cache aggressively. They'll still be resolving to the old ALB IP for a while after the cutover. Those requests need somewhere to land.

After 72 hours, verify: no Route53 records point to the old ALB, no CloudWatch alarms are still scoped to the old ALB's metrics, no ECS services are still registered to the old target groups. Then destroy.

Keep the old ALB in Terraform state for 30 days as a documented rollback artifact, even after physical decommission. A terraform destroy with a clear commit message is a better audit trail than a resource that disappeared from state.

Rollback: a scheduled option, not an emergency

The old ALB staying live through the hold period isn't just caution — it means rollback is a planned capability, not an emergency procedure.

Define the rollback trigger before the cutover window, not during it:

"If error rate exceeds 1% on the new ALB for more than 3 consecutive minutes after full traffic shift, revert."

Rollback is the same Terraform change in reverse — point the alias back at the old ALB. With 30-second TTL: traffic returns within 60 seconds. The old ALB never stopped running, so there's no warm-up time, no health check delay.

This framing matters for how the team experiences the cutover. If rollback requires scrambling, it creates pressure to push through problems rather than revert cleanly. If rollback is a pre-committed, 60-second operation, the team can move fast and be willing to revert at the first signal.

The cutovers that cause incidents are the ones where the rollback plan is "we'll figure it out if we need to."

Planning a production migration and want a second set of eyes on the cutover sequence? This is one of the higher-risk moments in a platform project, and the details that matter are usually in the runbook, not the architecture diagram. Get in touch.

Terraform CI Is Green. Here's What It Missed.

Glenn Gray — Tue, 02 Jun 2026 13:57:37 +0000

Originally published on graycloudarch.com.

The apply produced a diff nobody expected. The plan had been green. The PR had been approved. Two engineers had been moving fast through a Terraform monorepo — module changes, stack updates, new resources in parallel — and the CI was green on every single PR. Nobody saw the problem until the change was already in.

The cause wasn't bad code. It was a CI pattern so common it's nearly a default: run terraform plan only for stacks where files changed in the PR.

That sounds right. It is wrong.

The specific failure: changed-files detection doesn't know about consumers

Here's the shape of a typical monorepo CI setup:

modules/
  network/
    main.tf
stacks/
  prod-vpc/
    main.tf   ← sources from modules/network/
  dev-vpc/
    main.tf   ← sources from modules/network/

A PR modifies modules/network/main.tf. The changed-files action sees changes in modules/network/. It runs a plan for modules/network/. It does not run a plan for stacks/prod-vpc/ or stacks/dev-vpc/ — because those directories have no changed files.

Both of those stacks will produce a different plan when they're next applied. Nobody saw it before merge.

The logic is seductive: why run plans for stacks that haven't changed? But the premise is wrong. A stack that sources a changed module has changed — you just can't see it in the diff. The module change is the diff.

What actually works

Three approaches, in order of correctness:

Run plan for every stack on every PR. Expensive on a large monorepo, but correct. Terragrunt's run-all plan with --terragrunt-parallelism 8 makes this tractable in most codebases. If it's too slow, it's a signal the monorepo has grown past what a single pipeline can handle — and that's a different problem worth surfacing.

Build a dependency graph. Parse source = references to find all consumers of changed modules, add those stacks to the plan set. This is the right answer architecturally, but it requires build tooling to maintain the graph. Tools like Terragrunt's dependency blocks give you this for free if your dependency declarations are complete.

Practical middle ground. Run plan for all stacks in the same directory subtree as any changed module. Not as precise as a graph, but catches the most common failure: a module and its primary consumers living near each other in the directory structure. Works well for codebases where modules/ and stacks/ are adjacent siblings and team conventions keep related things together.

What doesn't work: paths-filter or the changed-files action scoped to the stack directory. It sees no diff, skips the plan, CI stays green, and the module change is invisible to reviewers until apply runs post-merge.

Three supporting fixes that complete the picture

The module consumer problem is the silent failure mode — it requires a deliberate fix to CI architecture. But there are three other common issues that are cheaper to address and eliminate most of the remaining review friction.

Put the plan in the PR comment, not in the logs.

A plan that lives in the Actions logs requires a reviewer to click through to the workflow run, find the right job, scroll to the plan output, and read it in isolation from the PR diff. Most reviewers don't. They check whether CI is green and click approve.

- name: Post plan to PR
  run: |
    PLAN=$(terraform show -no-color tfplan 2>&1 | head -200)
    gh pr comment ${{ github.event.pull_request.number }} \
      --body "### Terraform Plan
    \`\`\`
    ${PLAN}
    \`\`\`"
  env:
    GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

A reviewer who sees the plan inline — showing N resources to add, M to change, 0 to destroy — can make a real decision before clicking approve. The plan comment also becomes a lightweight audit trail: what did we expect to happen, and what actually happened.

Enable terraform fmt --check. For real this time.

Most codebases have it disabled. The comment is usually # TODO: fix formatting first. The fix is a one-time operation:

terraform fmt -recursive .
git commit -m "terraform fmt: normalize formatting before enforcing check"

Then enable the check as a separate fast job. It runs in under 10 seconds, has no false positives, and eliminates the category of review comments that are pure style — freeing reviewers to focus on substance.

Add tflint with the AWS ruleset.

terraform validate catches syntax errors. It does not catch deprecated resource types, instance types that no longer exist, missing required_providers, or module interface mismatches where a variable is passed to a module that no longer expects it. Those surface at apply time.

# .tflint.hcl
plugin "aws" {
  enabled = true
  version = "0.32.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

The practical value is catching things that Terraform itself won't catch until it's talking to the AWS API — like an instance_type that was deprecated, or a required_providers block that's incomplete after a module upgrade.

What good Terraform CI looks like end-to-end

PR opened
  → terraform fmt --check        (fast; fails on style)
  → tflint                       (fast; catches deprecated/missing config)
  → terraform plan (all affected stacks)
  → plan posted to PR comment

PR merged
  → terraform apply (gated on approval + merge)

The critical constraint is the last line: apply never runs on open PRs. Plan runs freely and often; apply runs exactly once per PR, after merge, and only on approved changes.

On a monorepo with Terragrunt, run-all plan handles the multi-stack case. The plan comment step posts one comment per stack with a summary header, so reviewers can scan affected stacks without opening each workflow run.

What "CI is green" actually means

Green CI on a Terraform PR means syntax is valid, the workflow ran, and the specific stacks with changed files produced a plan. It does not mean the change is safe. It does not mean the full blast radius is visible.

The module consumer problem is the clearest example of this gap, but it's not the only one. Infrastructure review requires actually reading the plan — which requires the plan to be somewhere reviewers will look. Green CI that nobody reads is a false signal, and a fast-moving codebase will eventually prove that.

The four fixes here don't require new tools or platform investment. They require deciding that CI should actually help reviewers make decisions, not just confirm the workflow completed.

Working through Terraform CI gaps in a fast-moving monorepo? This is the kind of platform work I do regularly. Get in touch.

S3 Table Buckets in Terraform: What Nobody Warned Me About

Glenn Gray — Wed, 27 May 2026 12:45:15 +0000

Originally published on graycloudarch.com.

The table bucket created without errors. The Terraform apply was clean. The KMS key was attached. The task definition had the right IAM role.

Then the first read operation hit AccessDenied on a KMS decrypt call, and the error gave no hint about what was actually wrong.

We'd added S3 Table Buckets to a data lake architecture — purpose-built Iceberg storage with 10x faster metadata operations compared to standard S3. The architecture decision was straightforward. The Terraform implementation had five gotchas that weren't in any documentation I found before running into them. This post documents all of them.

The KMS Key Policy Needs a Service Principal You Won't Guess

The KMS principal for S3 Table Buckets is s3tables.amazonaws.com — not s3.amazonaws.com.

This distinction matters because most teams already have a KMS key for their S3 buckets with a key policy that includes s3.amazonaws.com as a service principal. When you encrypt a table bucket with that same key without updating the policy, the bucket creates successfully. ACLs, tags, Terraform state — everything looks right. The failure happens on the first metadata read, when the s3tables service tries to decrypt and the key policy doesn't authorize it.

The error is AccessDenied from KMS, which sends you looking at IAM policies on your application role before you think to check the key policy. The required addition:

{
  "Sid": "AllowS3TableBuckets",
  "Effect": "Allow",
  "Principal": {
    "Service": "s3tables.amazonaws.com"
  },
  "Action": ["kms:GenerateDataKey", "kms:Decrypt"],
  "Resource": "*"
}

In Terraform, this goes in your KMS key resource's policy document — either as an additional statement block in an aws_iam_policy_document data source, or as a JSON merge if your key policy is managed elsewhere. The fix took about 10 minutes once we knew what to look for. Finding it took most of an afternoon.

If you're using a customer-managed key (and you should be for any production data lake), add this statement before you create the table bucket, not after. The bucket will create cleanly either way — the failure only appears at access time.

The IAM Permissions Are in a Different Namespace

Once the KMS issue was resolved, the next AccessDenied came from a different place: table operations.

Standard S3 permissions — s3:GetObject, s3:PutObject, s3:ListBucket — don't apply to Table Bucket operations. Table Bucket operations live in the s3tables:* namespace: s3tables:GetTableBucket, s3tables:CreateTable, s3tables:GetTableData, s3tables:PutTableData.

Roles with full S3 access have no access to table buckets. Roles with table bucket permissions still need standard S3 for underlying object operations. You need both, explicitly granted.

The minimal IAM policy for a role that reads from table buckets:

data "aws_iam_policy_document" "table_bucket_read" {
  statement {
    actions = [
      "s3tables:GetTableBucket",
      "s3tables:ListTables",
      "s3tables:GetTable",
      "s3tables:GetTableData",
    ]
    resources = [
      aws_s3tables_table_bucket.this.arn,
      "${aws_s3tables_table_bucket.this.arn}/*",
    ]
  }

  # Underlying S3 object access — required in addition to s3tables:* permissions
  statement {
    actions   = ["s3:GetObject", "s3:ListBucket"]
    resources = [
      "arn:aws:s3:::${aws_s3tables_table_bucket.this.name}",
      "arn:aws:s3:::${aws_s3tables_table_bucket.this.name}/*",
    ]
  }
}

For write access, add s3tables:PutTableData, s3tables:CreateTable, and s3tables:DeleteTable as needed. The principle of least privilege applies here more strictly than with standard S3 — there's no s3tables:* wildcard shortcut that's safe to use in production.

Don't Mix Table Buckets and Standard S3 in the Same Terraform Component

This one is subtle and doesn't always bite you immediately.

The aws_s3tables_table_bucket resource uses a different API endpoint from aws_s3_bucket. When both resource types are in the same Terraform root module or Terragrunt component, the AWS provider's resource graph can produce ordering conflicts on concurrent applies. The symptom isn't usually an apply error — it's unexpected diffs on subsequent plans, where a table bucket resource shows changes that shouldn't be there based on the configuration.

The fix is isolation: one Terragrunt component for table buckets, one for standard S3 buckets, both pulling encryption keys from a separate KMS component. The dependency chain is explicit and clean:

terraform/
└── data-lake/
    ├── kms/                  # KMS key — created first
    │   └── main.tf
    ├── s3-standard/          # landing, raw, curated S3 buckets
    │   ├── main.tf
    │   └── terragrunt.hcl    # depends_on kms/
    └── s3-table-buckets/     # table buckets in isolated state
        ├── main.tf
        └── terragrunt.hcl    # depends_on kms/

The standard S3 and table bucket components have no dependency on each other — they both depend on the KMS component and nothing else. If a table bucket apply fails, it doesn't touch standard S3 state. terraform plan for one doesn't show noise from the other.

Check Region Availability Before Designing Around Table Buckets

S3 Table Buckets launched in a limited set of regions and have been expanding, but as of mid-2026 they're still not available everywhere. The list includes us-east-1, us-west-2, eu-west-1, and a handful of others — but not all regions where you might run a data platform.

The check is fast:

aws s3tables list-table-buckets --region us-east-1
# Returns an empty list if available, an endpoint error if not

If Table Buckets aren't available in your required region, the fallback is the pre-Table Buckets architecture: standard S3 plus Glue Data Catalog for Iceberg metadata management. That architecture works well and is broadly available. The Iceberg lakehouse post covers it.

Don't let region availability be a late discovery. Run this check before the architecture is committed.

`terraform import` for Existing Table Buckets Doesn't Work Cleanly

If a table bucket was created manually — console, CLI, a one-off script — before your Terraform module existed, bringing it under IaC management is messy.

The terraform import command for aws_s3tables_table_bucket expects a resource ID format that's different from what you'd derive from the ARN. The exact format is the table bucket name, not the ARN, not the resource ID from the console. AWS documentation is inconsistent about this.

Even when the import runs without errors, the resulting state may show plan diffs for attributes like created_at and arn that Terraform can't manage but includes in the resource schema. These show up as perpetual diffs that you can't suppress cleanly.

The safer path:

# Reference the existing bucket with a data source — don't try to import it
data "aws_s3tables_table_bucket" "existing" {
  name = "my-existing-table-bucket"
}

# Reference it from other resources
resource "aws_s3tables_namespace" "raw" {
  table_bucket_arn = data.aws_s3tables_table_bucket.existing.arn
  namespace        = ["raw"]
}

Use the data source to reference the existing bucket, manage new namespaces and tables via Terraform, and defer full ownership transfer (replacing the manually-created bucket with a Terraform-managed one) to a planned migration window. It's more work than import, but the state stays clean.

The architecture itself — Table Buckets replacing Glue Data Catalog for Iceberg metadata — is solid. These are operational details that mostly show up after the design decision is made. Better to find them here than at 2am during a data pipeline deployment.

Building out a data lake architecture on AWS and running into Table Bucket or Iceberg issues? This is the kind of platform work I do regularly. Get in touch.

ECS vs EKS in 2026: The Decision Framework—Including ECS Anywhere

Glenn Gray — Tue, 19 May 2026 14:19:34 +0000

Originally published on graycloudarch.com.

The CTO wanted to know why the platform team had picked EKS for their new environment. They'd been running ECS for two years without issues. The team lead explained they needed GitOps, better autoscaling, and "industry-standard tooling."

Three months later, they were debugging a cert-manager webhook failure at 11am. Two engineers had spent 30 hours the previous month on cluster operations. They hadn't shipped a net-new feature in six weeks.

EKS wasn't wrong for them. The timing was. They had three engineers, twelve services, and no one who'd operated a Kubernetes cluster in production before. The ecosystem they wanted required them to operate it first.

This is the ECS vs EKS conversation most teams don't have until after they've made the choice.

The Actual Decision Axis

Feature comparisons miss the point. Both ECS and EKS run containers reliably. The real question is: what does your team have to operate to make that happen — and what's the cost of getting it wrong?

Two axes matter:

Operational capacity: How much complexity can your team absorb while still shipping product? A 3-engineer platform team and a 15-engineer platform team are not playing the same game.

Kubernetes maturity: Have your engineers operated k8s in production under pressure? "We've done some k8s" and "we've debugged etcd under load" are not the same thing.

The answer to which one you should use today often changes in 18 months. A team that's right for ECS now may be right for EKS after their platform engineers have shipped 6 months of Kubernetes work. Building with that arc in mind matters.

What ECS Actually Gives You

No control plane. That's the headline. With Fargate, there are no nodes to patch, no node groups to right-size, no kubelet to troubleshoot. AWS manages the underlying compute entirely.

The IAM model is simpler by design. Task roles attach directly to task definitions — no service accounts, no IRSA, no Web Identity tokens to wire up. For engineers coming from EC2-era IAM, this maps cleanly to what they already know.

ECS Fargate has no cluster fixed cost. EKS charges $0.10/hr per cluster — $72/month whether you're running one service or fifty. At low service counts or in non-production environments, that difference is real.

AWS integrations are first-class rather than plugged in. ALB target group registration, CloudMap service discovery, Secrets Manager injection via ECS container secrets — these work without Helm charts or CRDs. The AWS API surface and the ECS API surface are the same surface.

The internal tools team: 3 engineers, zero Kubernetes background, 8 services. ECS Fargate with a shared Terraform module got them to production in three weeks. No platform team required.

What EKS Actually Gives You

Ecosystem depth that ECS simply doesn't have. Karpenter for bin-packing and just-in-time node provisioning. KEDA for event-driven autoscaling off SQS, Kafka, or custom metrics. Argo CD or Flux for GitOps with real reconciliation loops. External Secrets Operator, Cert-manager, Prometheus Operator — the tooling is mature, battle-tested, and actively maintained.

ECS has no equivalent. The closest alternatives are either AWS-native (EventBridge Pipes, Application Auto Scaling) and less flexible, or custom-built and unmaintained after the engineer who wrote them leaves.

Karpenter in particular changes the EC2 cost math at scale. Intelligent bin-packing and spot interruption handling can cut compute costs 30-50% compared to fixed node groups. Below 20-30 nodes the savings often don't justify the operational overhead. Above that, it's hard to ignore.

Multi-cloud portability is real if you actually need it. Kubernetes manifests transfer to GKE or AKS. ECS task definitions do not. If "running this workload outside AWS" is a real scenario — not just theoretical — that matters.

The data platform I worked on: mixed batch and streaming workloads, KEDA scaling on SQS queue depth. ECS autoscaling would have required custom CloudWatch metrics and polling-based triggers. KEDA handled it natively in 20 lines of YAML. That alone settled the decision.

The Decision Tree

Walk through these in order. First yes wins.

Zero Kubernetes experience on the team? → ECS. The operational cost of learning k8s while building product is real and usually underestimated. The 40-hour/month cluster ops tax from the story above was paid by a team that had some k8s experience. Zero experience is worse.

Migrating from an existing ECS platform? → ECS. Rewrite and replatform simultaneously fails more often than it succeeds. Stabilize on ECS, migrate later when the workload is boring.

Need KEDA, custom-metric HPA, or Karpenter? → EKS. ECS autoscaling is Application Auto Scaling against CloudWatch metrics. It works, but the ceiling is lower and the custom metric path is significantly more work.

Need GitOps with Argo CD or Flux? → EKS. ECS has no native GitOps story. You can build one — CodePipeline + ECS deployment, Terraform-driven deployments — but you're building it. The operational difference is significant.

Five or more services sharing infrastructure? → EKS. The fixed cost justifies it; shared node pools improve utilization; the per-service overhead of ECS task definitions multiplies fast.

Default → ECS Fargate. Simpler, cheaper to start, and the migration path to EKS is well-understood.

ECS Anywhere: The Third Option

ECS Anywhere gets overlooked in most comparisons because it doesn't fit neatly into "cloud vs cloud" comparisons. It should be in the decision tree.

ECS Anywhere lets you register non-AWS compute — on-premises servers, VMs in other clouds, edge devices — as ECS external instances. Your task definitions, IAM roles, and tooling stay the same. The ECS control plane in AWS manages scheduling. The compute runs wherever you've registered it.

Where this actually wins:

Regulated environments with data residency requirements. If certain workloads must stay on-premises for compliance, ECS Anywhere lets you run them with the same tooling as your AWS workloads. On the GovCloud platform I built, we had ground system software that had to process flight data on local hardware before transmission. ECS Anywhere would have let us manage those workloads from the same ECS cluster as our cloud services — same Terraform modules, same IAM patterns, same observability pipeline.

Brownfield migration. If you're moving workloads from on-premises to AWS and want a consistent deployment target during the migration, ECS Anywhere gives you that. Register the on-prem servers, migrate task by task, deregister when done.

Edge compute. Consistent deployment tooling across dozens of edge nodes without running a k8s control plane at each site.

The constraint: ECS Anywhere instances are external infrastructure you own and patch. Fargate's "no nodes to manage" advantage disappears. The tradeoff is deliberate — you're accepting node management in exchange for placement control.

The Migration Path

ECS → EKS migration is well-understood and not particularly risky if the IaC is clean.

Containerized workloads move without changes. The two meaningful changes are IAM (task roles → IRSA service accounts — mechanical, not complex) and networking (ALB target group registration → Ingress or Service — also mechanical).

What breaks the migration is task definitions in CloudFormation or hand-managed console resources. If your ECS deployment is 100% Terraform with a module per service, the migration is boring. If it's six engineers' worth of one-off console configurations, it's archaeology.

Build ECS as if you'll migrate it. Keep task definitions in Terraform modules, service definitions composable, networking configuration explicit. The Jira ticket for "migrate from ECS to EKS" should feel like plumbing work, not a project.

Mistakes I See Repeatedly

Choosing EKS because it's "industry standard." Industry standard at Stripe is not industry standard at a 40-person SaaS company. The operational tax is the same either way.

Choosing ECS without accounting for the autoscaling ceiling. For workloads with bursty, event-driven traffic patterns, ECS autoscaling requires CloudWatch custom metrics and Application Auto Scaling policies that are genuinely annoying to tune. Know the ceiling before you hit it.

Single-cluster EKS for two services. The fixed cost of the control plane ($72/month), the operational overhead of running Kubernetes, and the learning curve are all real. For two or three services, this almost never makes sense.

Underestimating the Helm/CRD surface area. When a Helm-managed CRD conflicts with another controller at 2am, you need someone on the team who can debug it. "We'll figure it out" is not a plan.

Building a new platform or rearchitecting an existing container environment? The choice between ECS, EKS, and ECS Anywhere usually comes down to where your team is on the Kubernetes maturity curve and what your autoscaling requirements actually are — not which technology is more capable. Get in touch if you're working through this decision — it's a conversation I have with platform teams regularly, and the right answer depends on specifics that don't fit in a blog post.

Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

Glenn Gray — Mon, 18 May 2026 17:54:46 +0000

Originally published on graycloudarch.com.

The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.

The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.

We built the storage layer as a three-zone medallion architecture, fully managed with Terraform. Here's how we did it — including a few things about Table Buckets that don't show up in most writeups.

The Medallion Architecture

One table bucket per environment. Zones are namespaces inside the bucket — not separate buckets, not separate Glue databases in the legacy sense:

Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, produced by Spark transforms. Curated is the analytics layer that Athena queries and BI dashboards read from.

The namespace naming convention we used was {zone}_{domain} — raw_crm, clean_customer, curated_sales_metrics. When you're looking at a table in Athena or debugging a failed transform job, the namespace name tells you exactly what tier you're in and what domain you're touching. Data lineage is readable from table names alone.

Why Two Modules Instead of One

The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.

The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.

The KMS module:

# kms-key/main.tf
resource "aws_kms_key" "this" {
  description             = var.description
  enable_key_rotation     = var.enable_key_rotation
  deletion_window_in_days = var.deletion_window_in_days

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Service Access"
        Effect = "Allow"
        Principal = { Service = var.service_principals }
        Action = ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant"]
        Resource = "*"
      }
    ]
  })
}

The service_principals variable takes a list of service principal strings — ["athena.amazonaws.com", "glue.amazonaws.com", "emr-serverless.amazonaws.com"] and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.

The S3 Table Bucket Module

The table bucket itself is straightforward:

# s3-table-bucket/main.tf
resource "aws_s3tables_table_bucket" "this" {
  name = var.bucket_name
}

One important thing that trips people up: S3 Table Buckets are not standard S3 buckets. They use the S3 Tables API, not the standard S3 API. Several standard S3 resources will fail with NoSuchBucket (404) if you try to attach them to a Table Bucket:

aws_s3_bucket_versioning
aws_s3_bucket_server_side_encryption_configuration
aws_s3_bucket_public_access_block
aws_s3_bucket_intelligent_tiering_configuration

Encryption is managed internally — AES256 is applied on creation automatically. You'll want ignore_changes = [encryption_configuration] in your lifecycle block or Terraform will constantly detect drift.

The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:

# lake-storage/terragrunt.hcl
dependency "kms" {
  config_path = "../kms-key"
}

inputs = {
  bucket_name = "company-lake-${local.environment}"
  kms_key_arn = dependency.kms.outputs.key_arn
}

Glue Is Not the Catalog

This is the part that most S3 Table Bucket writeups get wrong, and it matters for how you structure the rest of your Terraform.

S3 Tables is the metadata source of truth. Glue is the integration layer. When you enable the S3 Tables analytics integration, AWS creates a federated catalog named s3tablescatalog in your Glue Data Catalog. Table buckets, namespaces, and tables are surfaced through that catalog hierarchy — Athena and EMR see them through Glue, but Glue doesn't own them.

This means you should not be creating aws_glue_catalog_database resources with location_uri S3 paths and trying to wire Iceberg metadata parameters onto them. That's the legacy Glue-over-S3-prefixes model. For S3 Tables, the catalog structure comes from the table bucket integration, not from manual Glue database provisioning.

In Terraform, the integration resource is aws_s3tables_table_bucket_policy (for access control) and the analytics integration is enabled at the account level. Once enabled, Athena queries S3 Tables through the s3tablescatalog namespace automatically.

The namespace naming convention (raw, clean, curated with domain suffixes) is defined in the table bucket itself, not in Glue. Glue reflects it — it doesn't own it.

The Cost Model

For a 100TB lake, the comparison against standard S3 holds:

Storage Class	When	Monthly Cost
Standard	Active data	~$2,300
Standard-IA equivalent	Less-accessed data	~$400
Glacier equivalent	Archive	~$100

The metadata acceleration charge for Table Buckets is $0.00025 per 1,000 requests — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. The performance improvement compounds the cost picture: 10x faster query planning means less Athena scan time, which means lower query costs as data volume grows.

One note: you cannot attach aws_s3_bucket_intelligent_tiering_configuration to a Table Bucket — it's a standard S3 resource and will fail. Storage cost optimization for Table Buckets happens through compaction and retention maintenance jobs (typically run on a schedule via MWAA or EMR), not through lifecycle policies.

Deployment Sequence

The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before the S3 Tables analytics integration (which creates the federated Glue catalog surface).

In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.

One deployment note: if you're using Athena and haven't enabled S3 Tables analytics integration in the account before, do that before the apply. Athena queries S3 Tables only after the integration is enabled and the s3tablescatalog namespace is visible in the Glue Data Catalog.

What the Data Team Inherited

When we handed this over to the data engineering team, they had a fully provisioned storage foundation — one table bucket per environment, three namespaces per bucket, encryption enabled, and Athena wired to query through the s3tablescatalog integration. They could start writing Spark jobs and creating tables immediately without worrying about storage configuration or catalog wiring after the fact.

The Terraform modules are reusable. Adding a new environment is one Terragrunt leaf config. Adding a new domain namespace is a namespace declaration on the existing bucket. The KMS key and integration configuration don't change.

S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains make a strong case for starting there rather than retrofitting later — just go in knowing they're a different API surface than standard S3, and structure your modules accordingly.

Building out a data platform and figuring out the storage and catalog architecture? Get in touch — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.

The 5-Minute Tax I Killed With GitHub Actions

Glenn Gray — Mon, 18 May 2026 17:43:31 +0000

Originally published on graycloudarch.com.

Every time I finished writing a blog post, I had to do this:

cd sites/graycloudarch
hugo --minify
aws s3 sync public/ s3://graycloudarch-website --delete
aws cloudfront create-invalidation --distribution-id E1234ABCDEF --paths "/*"

Five minutes. Doesn't sound like much.

But when you're trying to publish 2-3 posts per week while working full-time, those 5 minutes add up. Not just in time—in friction.

"I just finished writing. Now I need to context-switch to deployment mode. What was that CloudFront ID again?"

Friction kills momentum.

What I Wanted

git push → site updates automatically → I move on to the next thing.

Zero thinking. Zero context switching. Zero "oh crap, I forgot to invalidate CloudFront."

The Solution: GitHub Actions

GitHub Actions can build and deploy your site every time you push to main. For free.

Here's the whole workflow:

name: Deploy graycloudarch.com

on:
  push:
    branches: [main]
    paths:
      - 'sites/graycloudarch/**'
      - 'content/graycloudarch/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: true

      - uses: peaceiris/actions-hugo@v2
        with:
          hugo-version: 'latest'
          extended: true

      - name: Build site
        working-directory: ./sites/graycloudarch
        run: hugo --minify

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Deploy
        working-directory: ./sites/graycloudarch
        run: |
          aws s3 sync public/ s3://graycloudarch-website --delete
          aws cloudfront create-invalidation \
            --distribution-id ${{ secrets.CLOUDFRONT_DISTRIBUTION }} \
            --paths "/*"

That's it. Push to main, GitHub Actions handles the rest.

The Part That Tripped Me Up

Hugo themes are usually Git submodules. If you don't check them out, your build fails with cryptic errors about missing layouts.

- uses: actions/checkout@v4
  with:
    submodules: true  # Don't forget this

Cost me 20 minutes of debugging before I realized. Now it's documented in code, not lost in my bash history.

Path Filtering: The Secret Sauce

I run two sites in one repo: graycloudarch.com and cloudpatterns.io.

Without path filtering, every push rebuilds both sites, even if I only changed one. Wasted build minutes, unnecessary CloudFront invalidations, slower feedback.

on:
  push:
    paths:
      - 'sites/graycloudarch/**'
      - 'content/graycloudarch/**'

Now GitHub Actions only runs when files for that site change. Fast, efficient, no waste.

Why This Matters

I'm trying to hit $3K/month by March 31. That's 9 weeks.

Every minute I spend deploying is a minute I'm not writing, not reaching out to clients, not building the course I want to sell.

Manual deployments are a tax on my time. This workflow eliminated that tax.

Now when I finish writing, I commit and push. Two minutes later, it's live. I'm already working on the next post.

The Real Win

It's not the 5 minutes per deployment.

It's the mental overhead.

Before: "Okay, post is done. Now I need to switch gears, build Hugo, sync to S3, remember that CloudFront command..."

After: "Post is done. git push. What's next?"

No context switch. No friction. Just ship and move on.

That's worth way more than 5 minutes.

Want to set this up for your site? The workflow above works for any Hugo + S3 + CloudFront setup. Just plug in your bucket names and distribution IDs in GitHub Secrets.

Or reach out if you want help automating your deployments. I do this for a living.

I Spent 6 Hours Automating a 30-Minute Task (And I'd Do It Again)

Glenn Gray — Mon, 18 May 2026 17:43:30 +0000

Originally published on graycloudarch.com.

Look, I know what you're thinking. "Glenn, you could've just clicked through the AWS console and had both sites live in an hour."

You're not wrong.

But here's the thing—I'm allergic to clicking through consoles. It's a professional hazard from spending the last 5 years building enterprise platforms where "just do it manually" gets you fired.

So when I sat down to launch graycloudarch.com and cloudpatterns.io, I did what any reasonable person would do: I spent 6 hours writing Terraform to automate a 30-minute task.

The Manual Way (aka Hell)

If I'd done this the normal way:

AWS Console → ACM → Request Certificate
Copy the DNS validation CNAME
Cloudflare → Add DNS record
Wait. Refresh. Wait more.
AWS Console → CloudFront → Create Distribution
Copy CloudFront domain
Cloudflare → Add another DNS record
Test. Find typo. Fix typo. Test again.
Repeat for second domain.

Time: 40 minutes if nothing breaks (it always breaks).

Chance I'd screw up a DNS record: 80%.

The Automated Way (aka Overkill)

One Terraform apply. That's it.

terraform apply
# Go make coffee
# Come back to two working sites

But the real magic isn't the deployment—it's what happens when AWS generates those ACM validation records:

resource "cloudflare_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.site.domain_validation_options :
      dvo.domain_name => {
        name   = dvo.resource_record_name
        record = dvo.resource_record_value
        type   = dvo.resource_record_type
      }
  }

  zone_id = data.cloudflare_zone.site.id
  name    = each.value.name
  value   = each.value.record
  type    = each.value.type
}

Terraform reads the validation records from AWS, creates them in Cloudflare, and waits for validation to complete. Zero copy-paste. Zero switching between browser tabs. Zero forgetting which CNAME goes where.

I don't touch Cloudflare. I don't touch AWS Console. I just run terraform apply and go do something useful.

Why This Matters (Spoiler: It's Not About Terraform)

I'm trying to hit $3K/month by March 31. That's 9 weeks away.

Every hour I spend clicking through AWS is an hour I'm not:

Writing blog posts
Reaching out to potential clients on LinkedIn
Building the course I want to sell
Actually making money

Manual infrastructure doesn't generate revenue. Published content generates revenue.

So yeah, I spent 6 hours automating something I could've done in 30 minutes. But now when I launch my third brand (and I will), it takes 10 minutes and one terraform apply.

That's the bet: upfront investment for long-term velocity.

What I Actually Built

The module is dead simple:

ACM certificate with DNS validation
S3 bucket for static hosting
CloudFront distribution
Cloudflare DNS records (both root and www)

Call it twice (once per brand), different inputs, same code:

module "graycloudarch" {
  source      = "../../modules/static-site"
  domain_name = "graycloudarch.com"
  bucket_name = "graycloudarch-website"
}

module "cloudpatterns" {
  source      = "../../modules/static-site"
  domain_name = "cloudpatterns.io"
  bucket_name = "cloudpatterns-website"
}

That's it. No duplication. No drift. No "wait, which CloudFront ID goes with which domain?"

The Part Where I Screwed Up

Of course it didn't work perfectly the first time.

Turns out when you register a domain through Cloudflare, they helpfully create a default parking page DNS record. When Terraform tried to create my root CNAME, it failed with "record already exists."

Took me 20 minutes to figure out I needed allow_overwrite = true in the Cloudflare resource.

20 minutes I'll never get back. But at least it's documented in Git now, not lost in my bash history.

Would I Do This Again?

Absolutely.

Not because it's faster (it's not, the first time).

Not because it's easier (it's definitely not).

Because when I'm sitting at 2am writing my fifth blog post of the week and I realize I need to spin up a third site for a new product line, I can do it in 10 minutes instead of canceling my writing session to spend 45 minutes in AWS console.

Automation is a bet on future you. I'm betting future Glenn will appreciate not having to remember how SSL validation works.

Want the code? It's not open source (yet), but if you're building something similar and want to talk through the architecture, hit me up. I'm always down to talk Terraform.

Or if you just want to tell me I'm insane for spending 6 hours on this, that's cool too. My DMs are open.

The IAM Trust Policy Chicken-and-Egg (That Isn't)

Glenn Gray — Wed, 13 May 2026 17:55:53 +0000

Originally published on graycloudarch.com.

The pipeline role needed to trust the deployment role. The deployment role needed to trust the pipeline role. When I wrote both in Terraform and ran plan, it stopped:

Error: Cycle: module.pipeline.aws_iam_role.exec → module.deploy.aws_iam_role.target → module.pipeline.aws_iam_role.exec

The instinct is to create one role first, then go back and edit the trust policy of the other after it exists. A manual bootstrap step. It works. It also means you can't terraform apply from a clean state and get a working result — someone has to remember the second pass. The IaC tells half the story.

There's a better answer. IAM trust policies don't validate that the ARNs they reference actually exist. AWS stores the JSON document and moves on. The cycle Terraform sees is real — it's a real edge in its dependency graph. The underlying constraint that dependency represents is not.

ARNs are deterministic before creation

IAM role ARNs follow a fixed format:

arn:aws:iam::<account-id>:role/<role-name>

The account ID is fixed. The role name is chosen at definition time. Which means the full ARN is computable before terraform apply runs — before the resource exists — as long as the name is stable.

AWS does not validate that a referenced principal ARN exists when you create or update a trust policy. It stores the JSON. The role becomes assumable once both sides exist, regardless of which one was created first.

This is different from a configuration error like referencing a nonexistent IAM role in an aws_iam_role_policy_attachment — that fails at apply time because Terraform tries to call the API and gets an error. A trust policy is just a JSON document stored against the role. If the ARN in the Principal field doesn't resolve to an existing entity yet, IAM doesn't complain. It just doesn't match anything. Yet.

The cycle Terraform sees

The dependency graph problem is real. Here's the code that creates it:

resource "aws_iam_role" "role_a" {
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = aws_iam_role.role_b.arn }  # depends on role_b
    }]
  })
}

resource "aws_iam_role" "role_b" {
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = aws_iam_role.role_a.arn }  # depends on role_a
    }]
  })
}

Terraform resolves: role_a needs role_b's ARN before creation → role_b needs role_a's ARN before creation → cycle. It stops before creating either resource.

The fix removes the dependency by computing what you already know:

data "aws_caller_identity" "current" {}

locals {
  account_id = data.aws_caller_identity.current.account_id

  role_a_arn = "arn:aws:iam::${local.account_id}:role/${var.role_a_name}"
  role_b_arn = "arn:aws:iam::${local.account_id}:role/${var.role_b_name}"
}

resource "aws_iam_role" "role_a" {
  name = var.role_a_name
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = local.role_b_arn }  # string, no Terraform dependency
    }]
  })
}

resource "aws_iam_role" "role_b" {
  name = var.role_b_name
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = local.role_a_arn }  # string, no Terraform dependency
    }]
  })
}

No cycle. Both roles are created in a single apply. The trust relationship is live as soon as both resources exist — which they will be, after the same plan.

Where this pattern appears in practice

Cross-account deployment pipeline

CodePipeline execution role in account A assumes a deployment role in account B. The deployment role's trust policy needs to reference the pipeline role's ARN. Each Terraform root manages its own account's roles. The ARN construction pattern resolves the cross-account dependency: each module constructs the other account's role ARN from var.pipeline_account_id and a known role name — values passed in at plan time from tfvars or remote state outputs.

ECS task role and execution role

The ECS task execution role needs iam:PassRole to hand the task role to ECS at launch. Some teams want the task role's trust policy to explicitly list the execution role's ARN as the allowed principal. You don't need to — ecs-tasks.amazonaws.com as the service principal removes the dependency entirely. But if your security posture requires explicit principal ARNs rather than the service principal, ARN construction handles it without a two-pass apply.

Permission boundary bootstrap with an SCP

An SCP requires that all new IAM roles include a specific permission boundary policy. The boundary is a managed policy that must exist before any roles referencing it can be created. This isn't a circular dependency — it's a sequential one. The boundary policy must be applied first, separately. Construct its ARN deterministically (arn:aws:iam::${var.account_id}:policy/${var.boundary_name}) and pass it in wherever roles are created. Document the bootstrap order with a Terraform precondition block or a clear README section. Different problem, different fix.

When the dependency is genuine

There's a scenario that looks identical to this but isn't: when a Terraform provisioner or data source needs to actually call a role — not just reference its ARN — during resource creation.

Example: a null_resource provisioner that runs aws sts assume-role and then operates in the target account. Here you need the role to exist and be assumable before the provisioner fires. ARN construction doesn't help — you need the resource active at execution time, not just its string value known at plan time. The correct fix is explicit depends_on, not local string construction.

The distinction: static JSON referencing an ARN string (solvable with ARN construction) vs. a runtime API call that needs the resource actually live (solvable with depends_on). If your code needs to assume the role during apply, you need ordering. If it just needs to name the role in a policy document, you don't.

The trap in the fix

Once you've internalized "construct ARNs deterministically," the next failure mode is role names that include Terraform-generated suffixes.

resource "aws_iam_role" "role_a" {
  name = "${var.prefix}-role-${random_id.suffix.hex}"  # ARN not deterministic until random_id exists
  ...
}

If the role name includes random_id.suffix.hex, the ARN can't be computed until the random_id resource is created. That brings the dependency back — you're back to needing a resource output to construct the name, and the cycle re-forms if any of those names are referenced in another role's trust policy.

The fix is stable, predictable role names: "${var.prefix}-${var.env}-pipeline" rather than generated suffixes. IAM role names are unique per account, not globally. The habit of appending random suffixes comes from S3 bucket naming, where global uniqueness is required. IAM doesn't have that constraint. There's no reason to make the name unpredictable.

If you have existing roles with generated names and need their ARNs, they're deterministic after the first apply — stored in state and readable via aws_iam_role.role_a.arn. The construction approach is for cases where you control the naming and are defining the role name yourself.

What generalizes

The IAM trust policy deadlock is the most common place engineers hit this pattern, but it's not the only one. Wherever you encounter a Terraform circular dependency involving a predictable string — ARNs, resource names, account IDs, region names — ask whether you actually need the resource output or whether you can compute the value from what you already know.

data.aws_caller_identity.current.account_id gives you the account without creating a dependency on any resource. A stable name gives you the ARN. The dependency graph edge exists only because you referenced the resource — remove the reference by computing the value directly, and the cycle disappears.

The broader principle: Terraform's graph is built from references. References that aren't necessary are constraints that aren't necessary.

Untangling IAM architecture across multiple accounts — trust policies, permission boundaries, SCPs, cross-account assume-role chains — is where subtle errors compound quietly and the blast radius is real. I work on this regularly.

What the first 24 hours of production CloudWatch data told us

Glenn Gray — Mon, 04 May 2026 18:43:32 +0000

Originally published on graycloudarch.com.

The morning after go-live, the first thing I looked at was CPU. One of the two delivery services was sitting at 99.8% average utilization across 9 tasks. P50 latency: 1,010ms.

We'd launched deliberately without autoscaling. The plan was to observe real traffic patterns before configuring a scaling policy — you can't tune a policy you haven't seen the workload demand yet. What we didn't know was that the workload would reveal something about the task itself before we'd had a chance to watch it for a week.

Thirty-six hours after go-live, we'd shipped right-sizing changes, a working autoscaling configuration, and a new observability source for ALB-layer signals. All of it came directly from what the first day of production data said. Here's how we read it.

What 99.8% CPU means at 0.5 vCPU

The service was allocated 512 ECS CPU units per task — half a vCPU. CloudWatch was telling us the tasks were spending essentially all of their scheduled CPU time working.

The first instinct in this situation is to add tasks. Scale out horizontally. But adding more 0.5 vCPU containers when each one is already saturated doesn't change the constraint. In ECS, the scheduler distributes tasks across hosts, but the per-task CPU ceiling is set in the task definition. More tasks at ceiling is not materially different from fewer tasks at ceiling — you're distributing the same undersized unit more widely.

The signal wasn't about count. It was about the unit itself.

At 99.8% utilization, any burst in per-request processing time — a downstream API call that's slow, a cache miss, a spike in concurrent requests — queues. The task has no headroom to absorb it. That's where the 1,010ms p50 comes from: not that individual requests are slow, but that tasks are scheduled tightly enough that requests wait before they even start processing.

Right-sizing the task before configuring the autoscaler

We doubled the CPU allocation: 512 → 1,024 units. The rationale is mechanical once you see it: you can't configure a useful CPU-based autoscaling policy on a task that's already running at ceiling. If 100% CPU is the baseline, the autoscaler has nothing to respond to — it would scale out immediately on creation and never scale in.

Target tracking at 70% CPU requires headroom. A 1 vCPU task running the same workload that previously pinned a 0.5 vCPU task will land around 50% utilization — below the target, room to absorb variance before triggering a scale-out, and enough signal for scale-in to be meaningful rather than noise.

The second service had a different profile: 12 tasks, 1 vCPU each, hitting 92% at peak. Not saturated the same way, but thin on headroom. We went to 2 vCPU there.

Two other services in the platform were running the opposite problem — allocated more memory than they'd ever used. Those went the other direction: overprovisioned memory cut back based on observed peaks. The same 24-hour data window showed both problems at once.

Sequencing matters: right-size the task before you configure the autoscaler. Otherwise you're teaching a scaling policy to respond to a signal that's already maxed out, and the first thing it does is scale out to a floor that's still running on undersized tasks.

Why we chose CPU tracking instead of request count

The obvious autoscaling metric for an HTTP service is ALBRequestCountPerTarget. The ALB knows the request rate per target group; scaling on that metric tracks load linearly and is highly predictable.

We couldn't use it.

The platform uses a cross-account Lambda to register ECS tasks with ALB target groups at boot. Because of how the registration bridge works, the ECS service resource is provisioned with target_group_arn = null — the target group lives in a different account, and the service module doesn't know its ARN. ALBRequestCountPerTarget requires the target group ARN to be known to the Application Auto Scaling policy. Without it, there's no way to wire the metric across accounts without building additional dependency plumbing.

CPU target tracking at 70% was the correct second choice. For a CPU-bound workload — which 99.8% utilization confirms this is — CPU is a meaningful proxy for load. The metric was there, it was clean, and the task was now sized to make it useful.

One thing worth noting: the cross-account registration bridge was the right architectural decision for the problem it solved. But it created a constraint three layers away in a scaling configuration we hadn't designed yet. Architecture decisions compound downstream. The fix here was straightforward; I've seen the same pattern take longer to untangle when the constraint wasn't recognized.

The observability gap app logs can't fill

Application logs were already flowing to BetterStack from both services. We had route-level latency, HTTP status codes, request counts, error breakdowns — everything that happens inside a container.

What the logs couldn't tell us was what happens above them. The ALB generates its own error signals: HTTPCode_ELB_5XX_Count for errors the load balancer generates before a request reaches a container, RejectedConnectionCount for connections refused at the ALB layer when backend capacity is exhausted, ActiveConnectionCount as a proxy for in-flight load per target group. None of this appears in application logs. If the ALB had been dropping connections during the 99.8% CPU period, we would have had no signal in our observability platform.

CloudWatch had the data. The gap was getting it into the same place as everything else.

A 60-second Lambda in the infrastructure account — where the ALB lives — calls GetMetricData and ships structured JSON to BetterStack. One EventBridge rule, no ECS changes, effectively zero cost (one CloudWatch API call per minute against Lambda's free tier). The metrics land alongside the application data and show the ALB layer that the app logs are blind to.

The design decision here was Lambda over an ECS sidecar. A sidecar would have run per-service, per-task, 24 hours a day, and required task definition changes across the platform. A single Lambda running once per minute in the account that owns the ALB costs nothing and touches no ECS configuration.

Autoscaling parameters worth explaining

For the higher-load service: min=9, max=20, CPU target=70%, scale-out cooldown=60s, scale-in cooldown=300s.

Setting min_capacity to 9 — the current running task count — was deliberate. We'd just established that 9 tasks was a functional floor for this workload at current traffic levels. An autoscaler configured with min=2 or min=4 would have attempted to scale in on the first quiet period, bringing the service back to a state we knew was already under-provisioned. Anchoring the floor to the observed stable-state count prevents that while we accumulate enough autoscaling history to set a meaningful long-term floor.

The asymmetric cooldowns — 60 seconds for scale-out, 5 minutes for scale-in — reflect the cost asymmetry of being wrong in each direction. Scaling out too slowly during a load spike means requests queue. Scaling in too aggressively during a brief quiet period means tasks are killed and restarted unnecessarily. The 5-minute scale-in cooldown is conservative; we'll revisit it once we have a week of data showing where the service naturally stabilizes.

What 24 hours of data drove

We launched expecting to spend the first week observing. What the data delivered instead was a complete picture of three distinct problems: a task sizing issue that was causing queuing, a scaling policy that needed the right foundation before it could be configured, and an observability gap for a class of signals that app logs fundamentally can't surface.

All three were solved from the same 24-hour data window. The pre-launch load testing hadn't revealed any of them — synthetic traffic and production ad-bidding traffic have different CPU profiles, and you don't know which until the real thing runs.

The thing I'd change if running this again: put a structured post-launch data review into the go-live plan, not the next morning's to-do list. Not a formal incident review — a deliberate hour with CloudWatch after the first day's traffic has run through. The data is there. The question is whether you've planned to look at it.

If you're planning a production go-live and want a structured approach to post-launch data review and stabilization — or you're staring at a service running at ceiling with no autoscaling — get in touch. This is the kind of platform work I do regularly, and the pattern here applies well beyond ad delivery.

DEV Community: Glenn Gray

AWS Egress Cost Elimination: ECS Fargate on Public Subnets

What $0.130/GB buys you

What the firewall was actually inspecting

Moving tasks out of the inspection path

The incident that happened anyway

Public subnets and security groups

Before and after

What to verify before doing this

Composable Terraform Modules: Default Every Resource to False

The natural instinct, and where it breaks

What the call sites look like

The resource count gate

The same pattern, applied everywhere

When to break it

What the audit actually fixed

ECS Fargate as a Migration Bridge: Running Two Orchestrators at Once

Why not skip straight to EKS

What the ECS foundation looks like in Terraform

Cross-account ECR: the gotcha that hits every team

Logging: what changes from EKS

Running both orchestrators during the soak period

The ECS Anywhere variation: running both indefinitely

When to stay on ECS

Zero-Downtime DNS Cutover with ACM and ALB on AWS

Pre-provision everything before touching DNS

Lower the TTL — 48 hours in advance

The actual cutover — boring is the goal

Hold before decommission — at least 72 hours

Rollback: a scheduled option, not an emergency

Terraform CI Is Green. Here's What It Missed.

The specific failure: changed-files detection doesn't know about consumers

What actually works

Three supporting fixes that complete the picture

What good Terraform CI looks like end-to-end

What "CI is green" actually means

S3 Table Buckets in Terraform: What Nobody Warned Me About

The KMS Key Policy Needs a Service Principal You Won't Guess

The IAM Permissions Are in a Different Namespace

Don't Mix Table Buckets and Standard S3 in the Same Terraform Component

Check Region Availability Before Designing Around Table Buckets

terraform import for Existing Table Buckets Doesn't Work Cleanly

ECS vs EKS in 2026: The Decision Framework—Including ECS Anywhere

The Actual Decision Axis

What ECS Actually Gives You

What EKS Actually Gives You

The Decision Tree

ECS Anywhere: The Third Option

The Migration Path

Mistakes I See Repeatedly

Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

The Medallion Architecture

Why Two Modules Instead of One

The S3 Table Bucket Module

Glue Is Not the Catalog

The Cost Model

Deployment Sequence

What the Data Team Inherited

The 5-Minute Tax I Killed With GitHub Actions

What I Wanted

The Solution: GitHub Actions

The Part That Tripped Me Up

Path Filtering: The Secret Sauce

Why This Matters

The Real Win

I Spent 6 Hours Automating a 30-Minute Task (And I'd Do It Again)

The Manual Way (aka Hell)

The Automated Way (aka Overkill)

Why This Matters (Spoiler: It's Not About Terraform)

What I Actually Built

The Part Where I Screwed Up

Would I Do This Again?

The IAM Trust Policy Chicken-and-Egg (That Isn't)

ARNs are deterministic before creation

The cycle Terraform sees

Where this pattern appears in practice

When the dependency is genuine

The trap in the fix

What generalizes

What the first 24 hours of production CloudWatch data told us

`terraform import` for Existing Table Buckets Doesn't Work Cleanly