Zero-Downtime DNS Cutover with ACM and ALB on AWS

#aws #alb #acm #route53

Originally published on graycloudarch.com.

We were migrating the services behind a high-volume content distribution platform from one orchestration layer to another — new ALB, new target groups, new ECS cluster — and the question came up: when do we touch the DNS record?

The answer was: last. After everything else is done, validated, and confirmed healthy under real traffic conditions. Not concurrently with the new infrastructure. Not "we'll validate it after we cut over." Last.

Most DNS-related outages aren't caused by the DNS change itself. They're caused by the things that weren't ready when the change was made — a certificate that hadn't finished validating, a health check that passed in staging but broke under production traffic patterns, a TTL that made rollback a 10-minute ordeal instead of a 30-second one. The pattern that avoids all of those is the same: do all the risky work before you touch DNS.

Pre-provision everything before touching DNS

The new ALB, ACM certificate, target groups, and health check configuration all exist before any DNS change. No user traffic touches the new infrastructure during this phase — you're building and validating it in parallel with the old system still serving.

ACM certificate validation is the step most teams rush. Request the certificate immediately and use DNS validation, not email validation. (If you're managing DNS in Cloudflare rather than Route53, the automation pattern is covered here — the principle is the same.)

resource "aws_acm_certificate" "new" {
  domain_name       = "api.example.com"
  validation_method = "DNS"

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_route53_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.new.domain_validation_options : dvo.domain_name => {
      name   = dvo.resource_record_name
      record = dvo.resource_record_value
      type   = dvo.resource_record_type
    }
  }

  zone_id = data.aws_route53_zone.primary.zone_id
  name    = each.value.name
  type    = each.value.type
  ttl     = 60
  records = [each.value.record]
}

resource "aws_acm_certificate_validation" "new" {
  certificate_arn         = aws_acm_certificate.new.arn
  validation_record_fqdns = [for r in aws_route53_record.cert_validation : r.fqdn]
}

The validation CNAME goes into Route53 while the old system is still serving traffic. You wait for ISSUED status — typically 5-30 minutes, but occasionally longer. This is not a step you do on the morning of the cutover.

After the certificate is validated, deploy the new ALB with listener rules and target groups. Register targets and confirm health checks are passing — not just passing in principle, but passing with the actual application container running.

resource "aws_alb_target_group" "new" {
  name        = "api-new-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"  # required for ECS Fargate

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 30
    timeout             = 10
    matcher             = "200"
  }
}

The key check before proceeding: aws elbv2 describe-target-health should show all targets as healthy. Not initial. Not draining. Healthy. ALB-level health checks and application-level smoke tests are different things — confirm both.

Lower the TTL — 48 hours in advance

This is the most commonly skipped step and the highest-leverage thing you can do before a DNS cutover.

Route53 lets you change TTL on an existing record at any time, and it takes effect immediately from Route53's perspective. The problem is DNS resolvers don't care about your new TTL — they cache based on the TTL they observed when they last fetched the record. A resolver that saw a 300-second TTL will hold that cached value for up to 300 seconds after you lower it.

In practice this means: if you lower the TTL 30 minutes before the cutover window, your effective rollback window is still the old TTL, not the new one. If the cutover has problems and you need to revert, you're waiting several minutes for caches to expire. With production traffic flowing to a misconfigured endpoint.

Lower it 48 hours in advance:

resource "aws_route53_record" "api" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_alb.old.dns_name
    zone_id                = aws_alb.old.zone_id
    evaluate_target_health = true
  }

  # Lower this to 30 at least 48 hours before cutover
  # TTL doesn't apply to alias records directly, but influences
  # downstream resolver caching behavior
}

For non-alias A records with an explicit TTL: 300 → 30, committed and applied 48 hours before the cutover window. This is a one-line Terraform change. Apply it, verify it, and move on.

After 48 hours, all resolvers that previously cached the record at 300 seconds have expired their cache and re-fetched with the new 30-second TTL. Your rollback window is now 30-60 seconds.

The actual cutover — boring is the goal

With 48-hour TTL prep and pre-validated infrastructure, the cutover itself should take under two minutes and produce no errors.

Update the Route53 alias record to point to the new ALB:

resource "aws_route53_record" "api" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_alb.new.dns_name       # ← changed
    zone_id                = aws_alb.new.zone_id        # ← changed
    evaluate_target_health = true
  }
}

Apply. Watch three metrics in parallel for the next 5 minutes:

New ALB: TargetResponseTime p99 and HTTPCode_Target_5XX_Count
New ALB: HealthyHostCount — should stay constant
Application error rates from your monitoring platform

With 30-second TTL and pre-validated infrastructure, you should see full traffic shift to the new ALB within 2-3 minutes. If you're using weighted routing for a gradual shift, start at 10% new and watch those same metrics before moving to 100%.

# Optional: weighted routing for gradual shift
resource "aws_route53_record" "api_old" {
  zone_id        = data.aws_route53_zone.primary.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "old"

  weighted_routing_policy {
    weight = 0  # reduce from 100 as you validate
  }

  alias {
    name                   = aws_alb.old.dns_name
    zone_id                = aws_alb.old.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_new" {
  zone_id        = data.aws_route53_zone.primary.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "new"

  weighted_routing_policy {
    weight = 100  # increase from 0 as you validate
  }

  alias {
    name                   = aws_alb.new.dns_name
    zone_id                = aws_alb.new.zone_id
    evaluate_target_health = true
  }
}

Hold before decommission — at least 72 hours

The cutover is done. Metrics are green. The temptation is to clean up immediately.

Don't.

Keep the old ALB running for at least 72 hours. Two reasons:

First, any clients with hardcoded IPs — rather than DNS names — will break when the ALB is decommissioned. ALB IPs are not static. You won't know these clients exist until they start generating errors after teardown. The 72-hour window surfaces them while you still have an easy fix.

Second, some clients have unusually long DNS TTL caches or are behind corporate proxies that cache aggressively. They'll still be resolving to the old ALB IP for a while after the cutover. Those requests need somewhere to land.

After 72 hours, verify: no Route53 records point to the old ALB, no CloudWatch alarms are still scoped to the old ALB's metrics, no ECS services are still registered to the old target groups. Then destroy.

Keep the old ALB in Terraform state for 30 days as a documented rollback artifact, even after physical decommission. A terraform destroy with a clear commit message is a better audit trail than a resource that disappeared from state.

Rollback: a scheduled option, not an emergency

The old ALB staying live through the hold period isn't just caution — it means rollback is a planned capability, not an emergency procedure.

Define the rollback trigger before the cutover window, not during it:

"If error rate exceeds 1% on the new ALB for more than 3 consecutive minutes after full traffic shift, revert."

Rollback is the same Terraform change in reverse — point the alias back at the old ALB. With 30-second TTL: traffic returns within 60 seconds. The old ALB never stopped running, so there's no warm-up time, no health check delay.

This framing matters for how the team experiences the cutover. If rollback requires scrambling, it creates pressure to push through problems rather than revert cleanly. If rollback is a pre-committed, 60-second operation, the team can move fast and be willing to revert at the first signal.

The cutovers that cause incidents are the ones where the rollback plan is "we'll figure it out if we need to."

Planning a production migration and want a second set of eyes on the cutover sequence? This is one of the higher-risk moments in a platform project, and the details that matter are usually in the runbook, not the architecture diagram. Get in touch.