<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Glenn Gray</title>
    <description>The latest articles on DEV Community by Glenn Gray (@tallgray1).</description>
    <link>https://dev.to/tallgray1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817657%2F22cc7f4e-c345-484f-89b0-07068c02c9c7.png</url>
      <title>DEV Community: Glenn Gray</title>
      <link>https://dev.to/tallgray1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tallgray1"/>
    <language>en</language>
    <item>
      <title>Composable Terraform Modules: Default Every Resource to False</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Tue, 23 Jun 2026 00:21:55 +0000</pubDate>
      <link>https://dev.to/tallgray1/composable-terraform-modules-default-every-resource-to-false-1aco</link>
      <guid>https://dev.to/tallgray1/composable-terraform-modules-default-every-resource-to-false-1aco</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/composable-terraform-modules/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The workload account had passed every review. Provisioned with the same VPC module we'd used for six months. All defaults. No customizations needed.&lt;/p&gt;

&lt;p&gt;Three months later, an audit flagged it: traffic from that account was bypassing the centralized inspection VPC. The Network Firewall wasn't seeing it. Direct path out through an internet gateway the module had created by default.&lt;/p&gt;

&lt;p&gt;No error. No alert. The module did exactly what it was designed to do. We just hadn't designed it for this context.&lt;/p&gt;

&lt;p&gt;That account had an IGW it never needed, because nobody explicitly told it not to create one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The natural instinct, and where it breaks
&lt;/h2&gt;

&lt;p&gt;The pull toward "batteries included" modules makes sense early. Network module creates VPC, subnets, IGW, NAT gateways, route tables — all of it. For a single-account setup, that's convenient.&lt;/p&gt;

&lt;p&gt;The problem appears by account three, where some VPCs should have IGWs and some shouldn't. By account six — workload VPCs routing through a hub, an inspection VPC that owns the IGW and NAT, a sandbox account with direct access — you're forking modules, adding &lt;code&gt;count = 0&lt;/code&gt; overrides at the call site, or writing if/else logic at every deployment root. Each workaround is a signal that the module wasn't designed for multiple contexts.&lt;/p&gt;

&lt;p&gt;The fix is a design rule: &lt;strong&gt;if a resource is not universally needed, its creation variable defaults to &lt;code&gt;false&lt;/code&gt;. The caller opts in explicitly.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"create_internet_gateway"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Create an IGW and default route in the public route table. Defaults to false because
workload VPCs use hub-and-spoke routing through the centralized inspection VPC for all egress."&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bool&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"create_nat_gateway"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Create NAT Gateways for private subnet egress. Defaults to false for hub-and-spoke
VPCs where egress routes through the TGW to the centralized egress/inspection VPC."&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bool&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"create_public_subnets"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Create public subnets, route table, and route table associations. Defaults to false
for hub-and-spoke design. Set to true only for hub VPCs that own an IGW."&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bool&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The description isn't documentation for its own sake. It explains &lt;em&gt;why&lt;/em&gt; the default is &lt;code&gt;false&lt;/code&gt; — the specific architectural constraint that makes &lt;code&gt;true&lt;/code&gt; wrong for most callers. When someone reads it at the call site, they know whether their context matches the assumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the call sites look like
&lt;/h2&gt;

&lt;p&gt;The hub VPC — the inspection VPC that owns the Network Firewall — explicitly opts in. Workload VPCs call the module with no overrides:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"inspection_vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../../..//common/modules/network"&lt;/span&gt;

  &lt;span class="nx"&gt;name&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"inspection"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/22"&lt;/span&gt;
  &lt;span class="nx"&gt;availability_zones&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"us-east-1a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1c"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# These are true because this is the hub — explicit opt-in&lt;/span&gt;
  &lt;span class="nx"&gt;create_internet_gateway&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;create_public_subnets&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;create_nat_gateway&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"workload_vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../../..//common/modules/network"&lt;/span&gt;

  &lt;span class="nx"&gt;name&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"workloads-prod"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.1.0.0/22"&lt;/span&gt;
  &lt;span class="nx"&gt;availability_zones&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"us-east-1a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1c"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# No opt-ins needed — all defaults correct for hub-and-spoke&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5vu857r3h3bi6eyty7i7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5vu857r3h3bi6eyty7i7.png" alt="Workload VPC (left) has only private subnets; all egress routes through the TGW to the hub. Hub/inspection VPC (right) owns the IGW, Network Firewall, and NAT." width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;workload_vpc&lt;/code&gt; call is safe to copy-paste for any new workload account. The security-relevant decisions are in the module, not scattered across caller configurations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The resource count gate
&lt;/h2&gt;

&lt;p&gt;Conditional creation only works if every resource that &lt;em&gt;depends&lt;/em&gt; on the gated resource is also gated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_internet_gateway"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_internet_gateway&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;common_tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.name}-igw"&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Routes that depend on the IGW must also be gated&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route"&lt;/span&gt; &lt;span class="s2"&gt;"public_internet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_internet_gateway&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_id&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;destination_cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
  &lt;span class="nx"&gt;gateway_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_internet_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outputs have the same requirement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"internet_gateway_id"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_internet_gateway&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;aws_internet_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A plan for a workload VPC shows zero IGW-related changes. Not suppressed — genuinely not there. The module doesn't create it, reference it, or output it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The same pattern, applied everywhere
&lt;/h2&gt;

&lt;p&gt;Networking is the clearest example because the security stakes are visible, but the principle applies to every module type:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ALB module:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;enable_deletion_protection = false&lt;/code&gt; — dev environments don't need it; prod opts in&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;enable_access_logs = false&lt;/code&gt; — caller enables when the S3 bucket for logs is ready&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;enable_https_redirect = false&lt;/code&gt; — explicit, not assumed; avoids broken behavior on internal ALBs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security baseline module:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;enable_guardduty = false&lt;/code&gt;, &lt;code&gt;enable_security_hub = false&lt;/code&gt;, &lt;code&gt;enable_config = false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;One module, two contexts: the bootstrap account enables everything; sandbox accounts enable nothing&lt;/li&gt;
&lt;li&gt;Without this: you're writing conditional logic at the call site for every new account type&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability baseline:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;enable_cloudwatch_alarms = false&lt;/code&gt;, &lt;code&gt;enable_container_insights = false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Nonprod may or may not want alarms — the caller decides, not the module author&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern: if a resource is conditional on the deployment context, the module expresses that conditionality as a boolean defaulting to &lt;code&gt;false&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to break it
&lt;/h2&gt;

&lt;p&gt;Not every variable is a gate on resource creation. The rule doesn't apply to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration variables with opinionated defaults.&lt;/strong&gt; &lt;code&gt;instance_type = "t3.medium"&lt;/code&gt; should default to a sensible value. The question isn't "should we create this?" — the resource always exists, you're just setting its properties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required inputs with no safe default.&lt;/strong&gt; &lt;code&gt;vpc_cidr&lt;/code&gt; shouldn't have a default at all. Force the caller to declare it explicitly. A missing required input surfaces immediately; a wrong default doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources that must exist for the module to function.&lt;/strong&gt; The VPC itself isn't gated — if the module is called, a VPC is created. If a resource is that foundational, don't hide it behind a boolean.&lt;/p&gt;

&lt;p&gt;The line: &lt;code&gt;create_*&lt;/code&gt; and &lt;code&gt;enable_*&lt;/code&gt; variables gate resource existence. Configuration variables set properties of resources that always exist. Required inputs have no default.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the audit actually fixed
&lt;/h2&gt;

&lt;p&gt;The inspection gap in that workload account had existed for months. The fix was changing the module default to &lt;code&gt;false&lt;/code&gt; and re-applying across all accounts.&lt;/p&gt;

&lt;p&gt;Because every other resource in the module was already following this pattern, the re-apply was clean. Zero unexpected changes on correctly-configured accounts — which is the second-order effect of this design rule: &lt;strong&gt;the module becomes safe to re-apply&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When everything that shouldn't exist defaults to not existing, &lt;code&gt;terraform plan&lt;/code&gt; on a correctly-configured account comes back empty. That emptiness is a signal you can rely on. It means the module isn't hiding state you didn't ask for.&lt;/p&gt;

&lt;p&gt;That's harder to achieve if you're starting from "batteries included" defaults and trying to carve out exceptions. It's straightforward if you start from &lt;code&gt;false&lt;/code&gt; and require callers to opt in.&lt;/p&gt;

&lt;p&gt;Standardizing Terraform module design across multiple accounts and environments — or inheriting a module library where the defaults aren't working in your favor? This is one of the first patterns I help teams establish. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>iac</category>
      <category>modules</category>
    </item>
    <item>
      <title>ECS Fargate as a Migration Bridge: Running Two Orchestrators at Once</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Tue, 16 Jun 2026 18:26:36 +0000</pubDate>
      <link>https://dev.to/tallgray1/ecs-fargate-as-a-migration-bridge-running-two-orchestrators-at-once-fj4</link>
      <guid>https://dev.to/tallgray1/ecs-fargate-as-a-migration-bridge-running-two-orchestrators-at-once-fj4</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/ecs-migration-bridge/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Three months into the EKS buildout, someone asked a reasonable question: do we actually need all of this right now?&lt;/p&gt;

&lt;p&gt;The cluster was running. The services were containerized. But the team was also operating cert-manager, an ingress controller, external-secrets-operator, and Karpenter — each with its own version compatibility matrix, each capable of generating its own 2am incident. None of it was directly related to shipping the product.&lt;/p&gt;

&lt;p&gt;We made the decision to migrate to ECS Fargate first, with EKS as a future destination if and when the operational capacity caught up. Not a retreat — a deliberate two-step. The container images were already built. The IAM patterns were transferable. The application code hadn't changed. Only the orchestration layer was moving.&lt;/p&gt;

&lt;p&gt;This is what that migration looked like, and why running both orchestrators simultaneously during the transition was the right pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not skip straight to EKS
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://graycloudarch.com/blog/ecs-vs-eks-decision-framework/" rel="noopener noreferrer"&gt;decision framework for ECS vs. EKS is covered in a prior post&lt;/a&gt; — if you've already worked through that, skip ahead. The short version relevant here: EKS adds roughly fifteen operational concepts on top of running a service, each capable of failing independently. The bridge pattern is for teams where the orchestration question and the containerization question are both open at the same time. Trying to answer them together multiplies the blast radius.&lt;/p&gt;

&lt;p&gt;The ECS → EKS migration later is largely mechanical. Task definitions become Helm charts, task roles become IRSA service account annotations, ALB target group registration becomes ingress controller configuration. The container image — the actual artifact — doesn't change. Build ECS as if you'll migrate it, and you will.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzg6boec71dgkzq684vwe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzg6boec71dgkzq684vwe.png" alt="ECS as a migration bridge: old platform → ECS Fargate (soak period with old platform live) → EKS (optional future target)" width="798" height="113"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What the ECS foundation looks like in Terraform
&lt;/h2&gt;

&lt;p&gt;Three modules compose to support any service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Shared per cluster&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"ecs_cluster"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/ecs-cluster"&lt;/span&gt;

  &lt;span class="nx"&gt;name&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"platform-prod"&lt;/span&gt;
  &lt;span class="nx"&gt;log_retention_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
  &lt;span class="nx"&gt;capacity_providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE_SPOT"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Per service — IAM task role with least-privilege access&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"api_task_role"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/ecs-task-role"&lt;/span&gt;

  &lt;span class="nx"&gt;service_name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api"&lt;/span&gt;
  &lt;span class="nx"&gt;environment&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod"&lt;/span&gt;
  &lt;span class="nx"&gt;secrets_arns&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_secretsmanager_secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;api_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;ecr_account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;shared_services_account_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Per service — ECS service + ALB registration&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"api_service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/ecs-service"&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_arn&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;task_role_arn&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;api_task_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;image&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.ecr_registry}/api:${var.image_tag}"&lt;/span&gt;
  &lt;span class="nx"&gt;cpu&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
  &lt;span class="nx"&gt;memory&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
  &lt;span class="nx"&gt;desired_count&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="nx"&gt;target_group_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb_target_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;

  &lt;span class="nx"&gt;environment_variables&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;APP_ENV&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;secrets&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;DB_PASSWORD&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_secretsmanager_secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;api_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The design constraint that matters most: keep the three modules independent. Don't build a composite "ecs-app" module that wraps all three. Independent modules mean each service can tune its task role and scaling behavior without touching the cluster, and the cluster can be upgraded without touching service configurations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq0uyf7k88csdr9flb82f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq0uyf7k88csdr9flb82f.png" alt="ECS Terraform module composition: shared ecs-cluster module feeds into per-service ecs-task-role and ecs-service modules" width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-account ECR: the gotcha that hits every team
&lt;/h2&gt;

&lt;p&gt;ECR lives in a shared-services account. ECS runs in the workloads account. This is standard multi-account architecture — and it means the ECS task execution role needs cross-account pull permissions that are easy to get wrong.&lt;/p&gt;

&lt;p&gt;Two pieces are required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In the workloads account: task execution role policy&lt;/span&gt;
&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_policy_document"&lt;/span&gt; &lt;span class="s2"&gt;"ecr_cross_account"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;actions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"ecr:GetAuthorizationToken"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;resources&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# GetAuthorizationToken is global; can't be scoped to a registry&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;actions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"ecr:BatchCheckLayerAvailability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"ecr:GetDownloadUrlForLayer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"ecr:BatchGetImage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;resources&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"arn:aws:ecr:us-east-1:${var.shared_services_account_id}:repository/*"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;In&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;shared-services&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;account:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ECR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;repository&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;policy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::WORKLOADS_ACCOUNT_ID:root"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ecr:BatchCheckLayerAvailability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ecr:GetDownloadUrlForLayer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ecr:BatchGetImage"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The common failure mode: the task execution role has the right IAM policy, but the ECR repository policy in the shared-services account doesn't grant the workloads account access. ECS pulls look like a permissions error, and the error message ("no basic auth credentials") is not helpful in pointing to the repository policy as the cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging: what changes from EKS
&lt;/h2&gt;

&lt;p&gt;On EKS, Fluent Bit runs as a DaemonSet — one per node, automatically collecting logs from every container. On ECS Fargate, there is no shared host and no DaemonSet. You configure logging per task definition.&lt;/p&gt;

&lt;p&gt;The simplest approach, and the right default for most services, is the &lt;code&gt;awslogs&lt;/code&gt; driver:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"logConfiguration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"logDriver"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"awslogs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-group"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/ecs/api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-stream-prefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ecs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-create-group"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sends all stdout/stderr from the container directly to CloudWatch. No sidecar, no additional IAM, no configuration beyond the task definition. The &lt;code&gt;awslogs-create-group: true&lt;/code&gt; option creates the log group automatically if it doesn't exist — useful during initial deployment.&lt;/p&gt;

&lt;p&gt;For services that need to ship logs to multiple destinations or apply structured filtering, FireLens is the right choice: a Fluent Bit or Fluentd container runs as a sidecar in the same task and routes logs where they need to go. The operational overhead is higher, but the routing flexibility is real.&lt;/p&gt;

&lt;p&gt;Verify logging works before cutting traffic: &lt;code&gt;aws logs tail /ecs/api --follow&lt;/code&gt; while a test request hits the new ECS service. If nothing appears, the task role is missing CloudWatch write permissions or the log group name doesn't match.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running both orchestrators during the soak period
&lt;/h2&gt;

&lt;p&gt;We migrated all production services to ECS Fargate, but we kept EKS running throughout a soak period. Not as a fallback — as a confirmed, immediate revert target.&lt;/p&gt;

&lt;p&gt;The migration sequence for each service:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy service on ECS Fargate, validate health checks and task stability&lt;/li&gt;
&lt;li&gt;Cut DNS to the new ALB (see the &lt;a href="https://graycloudarch.com/blog/dns-acm-alb-cutover/" rel="noopener noreferrer"&gt;companion post on zero-downtime DNS cutover&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Monitor for 72 hours: error rates, latency p99, ALB healthy host count&lt;/li&gt;
&lt;li&gt;If metrics are nominal after 72 hours, deprovision from EKS&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;During the soak period, EKS was live and capable of receiving traffic within 60 seconds if the DNS record was reverted. This isn't a hypothetical backup — it was a committed operational state, with the rollback sequence documented and tested before we cut DNS.&lt;/p&gt;

&lt;p&gt;The benefit of this pattern is that it changes the calculus on the cutover decision. If rollback requires re-provisioning on EKS from scratch, the team has every incentive to push through problems rather than revert. If rollback is "update one Route53 record and wait 60 seconds," the team can move fast and revert at the first real signal.&lt;/p&gt;

&lt;p&gt;We didn't need to revert. But having the option meant we could make the migration decision cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ECS Anywhere variation: running both indefinitely
&lt;/h2&gt;

&lt;p&gt;For one service — a high-volume content delivery workload — the migration pattern extended beyond a time-limited soak period. That service runs on both ECS Fargate and ECS Anywhere simultaneously, with the ability to shift traffic between them at any time.&lt;/p&gt;

&lt;p&gt;ECS Anywhere extends ECS to on-premises or edge nodes, registered as &lt;code&gt;EXTERNAL&lt;/code&gt; capacity providers. The same ECS service, task definitions, and IAM patterns apply — what changes is the capacity provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"delivery"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"delivery-central"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;platform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;task_definition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_task_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;delivery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;desired_count&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;desired_count&lt;/span&gt;

  &lt;span class="nx"&gt;capacity_provider_strategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;capacity_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fargate_weight&lt;/span&gt;  &lt;span class="c1"&gt;# adjust to shift traffic&lt;/span&gt;
    &lt;span class="nx"&gt;base&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;capacity_provider_strategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;capacity_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_capacity_provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;anywhere&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;anywhere_weight&lt;/span&gt;
    &lt;span class="nx"&gt;base&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Shifting between Fargate and Anywhere is a Terraform variable change — no service restart, no DNS change, no downtime. The service is always running on both; only the task distribution changes.&lt;/p&gt;

&lt;p&gt;This pattern works well for workloads that need geographic proximity to edge infrastructure or where data sovereignty makes cloud-only deployment impractical. It also provides a genuine multi-region/multi-location deployment model without requiring a separate orchestrator.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to stay on ECS
&lt;/h2&gt;

&lt;p&gt;ECS Fargate is the right long-term answer — not just the bridge — when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service count is small (under roughly 15-20 services) and autoscaling requirements are straightforward target-tracking&lt;/li&gt;
&lt;li&gt;The team's operational capacity doesn't yet support cluster-level operations: node group upgrades, admission controller management, custom scheduler configuration&lt;/li&gt;
&lt;li&gt;Deploys via Terraform or CI/CD pipelines are acceptable and GitOps isn't a hard requirement&lt;/li&gt;
&lt;li&gt;No hard requirement for KEDA, HPA with custom metrics, or cluster-level bin-packing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ECS vs. EKS decision framework is covered in more detail in an &lt;a href="https://graycloudarch.com/blog/ecs-vs-eks-decision-framework/" rel="noopener noreferrer"&gt;earlier post&lt;/a&gt;. The short version: it's an operational capacity question, not a features comparison.&lt;/p&gt;

&lt;p&gt;The bridge pattern is valuable precisely because it decouples the containerization decision from the orchestration decision. You can containerize now, on ECS, without betting that the team is ready to operate Kubernetes. When the team is ready — and that readiness is genuinely there, not aspirational — the migration from ECS to EKS is mostly mechanical. The hard work of containerizing the application is already done.&lt;/p&gt;

&lt;p&gt;Running a platform migration and figuring out the container orchestration path? This is the kind of decision I work through with teams regularly. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>eks</category>
      <category>fargate</category>
    </item>
    <item>
      <title>Zero-Downtime DNS Cutover with ACM and ALB on AWS</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:29:43 +0000</pubDate>
      <link>https://dev.to/tallgray1/zero-downtime-dns-cutover-with-acm-and-alb-on-aws-58g9</link>
      <guid>https://dev.to/tallgray1/zero-downtime-dns-cutover-with-acm-and-alb-on-aws-58g9</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/dns-acm-alb-cutover/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;We were migrating the services behind a high-volume content distribution platform from one orchestration layer to another — new ALB, new target groups, new ECS cluster — and the question came up: when do we touch the DNS record?&lt;/p&gt;

&lt;p&gt;The answer was: last. After everything else is done, validated, and confirmed healthy under real traffic conditions. Not concurrently with the new infrastructure. Not "we'll validate it after we cut over." Last.&lt;/p&gt;

&lt;p&gt;Most DNS-related outages aren't caused by the DNS change itself. They're caused by the things that weren't ready when the change was made — a certificate that hadn't finished validating, a health check that passed in staging but broke under production traffic patterns, a TTL that made rollback a 10-minute ordeal instead of a 30-second one. The pattern that avoids all of those is the same: do all the risky work before you touch DNS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hithm3b0czib5msvcw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hithm3b0czib5msvcw2.png" alt="DNS cutover phases: pre-provision → lower TTL at T-48h → cutover at T-0 → 72h soak → decommission" width="800" height="90"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-provision everything before touching DNS
&lt;/h2&gt;

&lt;p&gt;The new ALB, ACM certificate, target groups, and health check configuration all exist before any DNS change. No user traffic touches the new infrastructure during this phase — you're building and validating it in parallel with the old system still serving.&lt;/p&gt;

&lt;p&gt;ACM certificate validation is the step most teams rush. Request the certificate immediately and use DNS validation, not email validation. (If you're managing DNS in Cloudflare rather than Route53, &lt;a href="https://graycloudarch.com/blog/dns-hell-to-automated/" rel="noopener noreferrer"&gt;the automation pattern is covered here&lt;/a&gt; — the principle is the same.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"new"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api.example.com"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;

  &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;create_before_destroy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route53_record"&lt;/span&gt; &lt;span class="s2"&gt;"cert_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;
      &lt;span class="nx"&gt;record&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_value&lt;/span&gt;
      &lt;span class="nx"&gt;type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_type&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_route53_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;
  &lt;span class="nx"&gt;ttl&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
  &lt;span class="nx"&gt;records&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;record&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"new"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;certificate_arn&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;validation_record_fqdns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_route53_record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cert_validation&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fqdn&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The validation CNAME goes into Route53 while the old system is still serving traffic. You wait for &lt;code&gt;ISSUED&lt;/code&gt; status — typically 5-30 minutes, but occasionally longer. This is not a step you do on the morning of the cutover.&lt;/p&gt;

&lt;p&gt;After the certificate is validated, deploy the new ALB with listener rules and target groups. Register targets and confirm health checks are passing — not just passing in principle, but passing with the actual application container running.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_alb_target_group"&lt;/span&gt; &lt;span class="s2"&gt;"new"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api-new-tg"&lt;/span&gt;
  &lt;span class="nx"&gt;port&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8080&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"HTTP"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;target_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ip"&lt;/span&gt;  &lt;span class="c1"&gt;# required for ECS Fargate&lt;/span&gt;

  &lt;span class="nx"&gt;health_check&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;path&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/health"&lt;/span&gt;
    &lt;span class="nx"&gt;healthy_threshold&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="nx"&gt;unhealthy_threshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="nx"&gt;interval&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
    &lt;span class="nx"&gt;timeout&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="nx"&gt;matcher&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"200"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key check before proceeding: &lt;code&gt;aws elbv2 describe-target-health&lt;/code&gt; should show all targets as &lt;code&gt;healthy&lt;/code&gt;. Not &lt;code&gt;initial&lt;/code&gt;. Not &lt;code&gt;draining&lt;/code&gt;. Healthy. ALB-level health checks and application-level smoke tests are different things — confirm both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lower the TTL — 48 hours in advance
&lt;/h2&gt;

&lt;p&gt;This is the most commonly skipped step and the highest-leverage thing you can do before a DNS cutover.&lt;/p&gt;

&lt;p&gt;Route53 lets you change TTL on an existing record at any time, and it takes effect immediately from Route53's perspective. The problem is DNS resolvers don't care about your new TTL — they cache based on the TTL they observed when they last fetched the record. A resolver that saw a 300-second TTL will hold that cached value for up to 300 seconds after you lower it.&lt;/p&gt;

&lt;p&gt;In practice this means: if you lower the TTL 30 minutes before the cutover window, your effective rollback window is still the old TTL, not the new one. If the cutover has problems and you need to revert, you're waiting several minutes for caches to expire. With production traffic flowing to a misconfigured endpoint.&lt;/p&gt;

&lt;p&gt;Lower it 48 hours in advance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route53_record"&lt;/span&gt; &lt;span class="s2"&gt;"api"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_route53_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api.example.com"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"A"&lt;/span&gt;

  &lt;span class="nx"&gt;alias&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dns_name&lt;/span&gt;
    &lt;span class="nx"&gt;zone_id&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
    &lt;span class="nx"&gt;evaluate_target_health&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Lower this to 30 at least 48 hours before cutover&lt;/span&gt;
  &lt;span class="c1"&gt;# TTL doesn't apply to alias records directly, but influences&lt;/span&gt;
  &lt;span class="c1"&gt;# downstream resolver caching behavior&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For non-alias A records with an explicit TTL: &lt;code&gt;300&lt;/code&gt; → &lt;code&gt;30&lt;/code&gt;, committed and applied 48 hours before the cutover window. This is a one-line Terraform change. Apply it, verify it, and move on.&lt;/p&gt;

&lt;p&gt;After 48 hours, all resolvers that previously cached the record at 300 seconds have expired their cache and re-fetched with the new 30-second TTL. Your rollback window is now 30-60 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual cutover — boring is the goal
&lt;/h2&gt;

&lt;p&gt;With 48-hour TTL prep and pre-validated infrastructure, the cutover itself should take under two minutes and produce no errors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a98gsge9og66qegl9p6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a98gsge9og66qegl9p6.png" alt="DNS cutover traffic routing: Route53 alias pointing to old ALB before cutover (solid line) and new ALB after cutover (dashed), both live during 72h soak" width="799" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Update the Route53 alias record to point to the new ALB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route53_record"&lt;/span&gt; &lt;span class="s2"&gt;"api"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_route53_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api.example.com"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"A"&lt;/span&gt;

  &lt;span class="nx"&gt;alias&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dns_name&lt;/span&gt;       &lt;span class="c1"&gt;# ← changed&lt;/span&gt;
    &lt;span class="nx"&gt;zone_id&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;        &lt;span class="c1"&gt;# ← changed&lt;/span&gt;
    &lt;span class="nx"&gt;evaluate_target_health&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply. Watch three metrics in parallel for the next 5 minutes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New ALB: &lt;code&gt;TargetResponseTime&lt;/code&gt; p99 and &lt;code&gt;HTTPCode_Target_5XX_Count&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;New ALB: &lt;code&gt;HealthyHostCount&lt;/code&gt; — should stay constant&lt;/li&gt;
&lt;li&gt;Application error rates from your monitoring platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With 30-second TTL and pre-validated infrastructure, you should see full traffic shift to the new ALB within 2-3 minutes. If you're using weighted routing for a gradual shift, start at 10% new and watch those same metrics before moving to 100%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Optional: weighted routing for gradual shift&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route53_record"&lt;/span&gt; &lt;span class="s2"&gt;"api_old"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;zone_id&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_route53_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api.example.com"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"A"&lt;/span&gt;
  &lt;span class="nx"&gt;set_identifier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"old"&lt;/span&gt;

  &lt;span class="nx"&gt;weighted_routing_policy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# reduce from 100 as you validate&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;alias&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dns_name&lt;/span&gt;
    &lt;span class="nx"&gt;zone_id&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
    &lt;span class="nx"&gt;evaluate_target_health&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route53_record"&lt;/span&gt; &lt;span class="s2"&gt;"api_new"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;zone_id&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_route53_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api.example.com"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"A"&lt;/span&gt;
  &lt;span class="nx"&gt;set_identifier&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"new"&lt;/span&gt;

  &lt;span class="nx"&gt;weighted_routing_policy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;  &lt;span class="c1"&gt;# increase from 0 as you validate&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;alias&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dns_name&lt;/span&gt;
    &lt;span class="nx"&gt;zone_id&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone_id&lt;/span&gt;
    &lt;span class="nx"&gt;evaluate_target_health&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hold before decommission — at least 72 hours
&lt;/h2&gt;

&lt;p&gt;The cutover is done. Metrics are green. The temptation is to clean up immediately.&lt;/p&gt;

&lt;p&gt;Don't.&lt;/p&gt;

&lt;p&gt;Keep the old ALB running for at least 72 hours. Two reasons:&lt;/p&gt;

&lt;p&gt;First, any clients with hardcoded IPs — rather than DNS names — will break when the ALB is decommissioned. ALB IPs are not static. You won't know these clients exist until they start generating errors after teardown. The 72-hour window surfaces them while you still have an easy fix.&lt;/p&gt;

&lt;p&gt;Second, some clients have unusually long DNS TTL caches or are behind corporate proxies that cache aggressively. They'll still be resolving to the old ALB IP for a while after the cutover. Those requests need somewhere to land.&lt;/p&gt;

&lt;p&gt;After 72 hours, verify: no Route53 records point to the old ALB, no CloudWatch alarms are still scoped to the old ALB's metrics, no ECS services are still registered to the old target groups. Then destroy.&lt;/p&gt;

&lt;p&gt;Keep the old ALB in Terraform state for 30 days as a documented rollback artifact, even after physical decommission. A &lt;code&gt;terraform destroy&lt;/code&gt; with a clear commit message is a better audit trail than a resource that disappeared from state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rollback: a scheduled option, not an emergency
&lt;/h2&gt;

&lt;p&gt;The old ALB staying live through the hold period isn't just caution — it means rollback is a planned capability, not an emergency procedure.&lt;/p&gt;

&lt;p&gt;Define the rollback trigger before the cutover window, not during it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If error rate exceeds 1% on the new ALB for more than 3 consecutive minutes after full traffic shift, revert."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Rollback is the same Terraform change in reverse — point the alias back at the old ALB. With 30-second TTL: traffic returns within 60 seconds. The old ALB never stopped running, so there's no warm-up time, no health check delay.&lt;/p&gt;

&lt;p&gt;This framing matters for how the team experiences the cutover. If rollback requires scrambling, it creates pressure to push through problems rather than revert cleanly. If rollback is a pre-committed, 60-second operation, the team can move fast and be willing to revert at the first signal.&lt;/p&gt;

&lt;p&gt;The cutovers that cause incidents are the ones where the rollback plan is "we'll figure it out if we need to."&lt;/p&gt;

&lt;p&gt;Planning a production migration and want a second set of eyes on the cutover sequence? This is one of the higher-risk moments in a platform project, and the details that matter are usually in the runbook, not the architecture diagram. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>alb</category>
      <category>acm</category>
      <category>route53</category>
    </item>
    <item>
      <title>Terraform CI Is Green. Here's What It Missed.</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Tue, 02 Jun 2026 13:57:37 +0000</pubDate>
      <link>https://dev.to/tallgray1/terraform-ci-is-green-heres-what-it-missed-1mj9</link>
      <guid>https://dev.to/tallgray1/terraform-ci-is-green-heres-what-it-missed-1mj9</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/terraform-ci-reviewer-experience/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The apply produced a diff nobody expected. The plan had been green. The PR had been approved. Two engineers had been moving fast through a Terraform monorepo — module changes, stack updates, new resources in parallel — and the CI was green on every single PR. Nobody saw the problem until the change was already in.&lt;/p&gt;

&lt;p&gt;The cause wasn't bad code. It was a CI pattern so common it's nearly a default: run &lt;code&gt;terraform plan&lt;/code&gt; only for stacks where files changed in the PR.&lt;/p&gt;

&lt;p&gt;That sounds right. It is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The specific failure: changed-files detection doesn't know about consumers
&lt;/h2&gt;

&lt;p&gt;Here's the shape of a typical monorepo CI setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;modules/
  network/
    main.tf
stacks/
  prod-vpc/
    main.tf   ← sources from modules/network/
  dev-vpc/
    main.tf   ← sources from modules/network/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A PR modifies &lt;code&gt;modules/network/main.tf&lt;/code&gt;. The changed-files action sees changes in &lt;code&gt;modules/network/&lt;/code&gt;. It runs a plan for &lt;code&gt;modules/network/&lt;/code&gt;. It does not run a plan for &lt;code&gt;stacks/prod-vpc/&lt;/code&gt; or &lt;code&gt;stacks/dev-vpc/&lt;/code&gt; — because those directories have no changed files.&lt;/p&gt;

&lt;p&gt;Both of those stacks will produce a different plan when they're next applied. Nobody saw it before merge.&lt;/p&gt;

&lt;p&gt;The logic is seductive: why run plans for stacks that haven't changed? But the premise is wrong. A stack that &lt;em&gt;sources&lt;/em&gt; a changed module has changed — you just can't see it in the diff. The module change is the diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually works
&lt;/h2&gt;

&lt;p&gt;Three approaches, in order of correctness:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run plan for every stack on every PR.&lt;/strong&gt; Expensive on a large monorepo, but correct. Terragrunt's &lt;code&gt;run-all plan&lt;/code&gt; with &lt;code&gt;--terragrunt-parallelism 8&lt;/code&gt; makes this tractable in most codebases. If it's too slow, it's a signal the monorepo has grown past what a single pipeline can handle — and that's a different problem worth surfacing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build a dependency graph.&lt;/strong&gt; Parse &lt;code&gt;source =&lt;/code&gt; references to find all consumers of changed modules, add those stacks to the plan set. This is the right answer architecturally, but it requires build tooling to maintain the graph. Tools like &lt;a href="https://terragrunt.gruntwork.io/docs/reference/config-blocks-and-attributes/#dependency" rel="noopener noreferrer"&gt;Terragrunt's dependency blocks&lt;/a&gt; give you this for free if your dependency declarations are complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical middle ground.&lt;/strong&gt; Run plan for all stacks in the same directory subtree as any changed module. Not as precise as a graph, but catches the most common failure: a module and its primary consumers living near each other in the directory structure. Works well for codebases where &lt;code&gt;modules/&lt;/code&gt; and &lt;code&gt;stacks/&lt;/code&gt; are adjacent siblings and team conventions keep related things together.&lt;/p&gt;

&lt;p&gt;What doesn't work: &lt;code&gt;paths-filter&lt;/code&gt; or the &lt;code&gt;changed-files&lt;/code&gt; action scoped to the stack directory. It sees no diff, skips the plan, CI stays green, and the module change is invisible to reviewers until apply runs post-merge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three supporting fixes that complete the picture
&lt;/h2&gt;

&lt;p&gt;The module consumer problem is the silent failure mode — it requires a deliberate fix to CI architecture. But there are three other common issues that are cheaper to address and eliminate most of the remaining review friction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Put the plan in the PR comment, not in the logs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A plan that lives in the Actions logs requires a reviewer to click through to the workflow run, find the right job, scroll to the plan output, and read it in isolation from the PR diff. Most reviewers don't. They check whether CI is green and click approve.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post plan to PR&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;PLAN=$(terraform show -no-color tfplan 2&amp;gt;&amp;amp;1 | head -200)&lt;/span&gt;
    &lt;span class="s"&gt;gh pr comment ${{ github.event.pull_request.number }} \&lt;/span&gt;
      &lt;span class="s"&gt;--body "### Terraform Plan&lt;/span&gt;
    &lt;span class="s"&gt;\`\`\`&lt;/span&gt;
    &lt;span class="s"&gt;${PLAN}&lt;/span&gt;
    &lt;span class="s"&gt;\`\`\`"&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;GH_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A reviewer who sees the plan inline — showing N resources to add, M to change, 0 to destroy — can make a real decision before clicking approve. The plan comment also becomes a lightweight audit trail: what did we expect to happen, and what actually happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enable &lt;code&gt;terraform fmt --check&lt;/code&gt;. For real this time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most codebases have it disabled. The comment is usually &lt;code&gt;# TODO: fix formatting first&lt;/code&gt;. The fix is a one-time operation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform &lt;span class="nb"&gt;fmt&lt;/span&gt; &lt;span class="nt"&gt;-recursive&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"terraform fmt: normalize formatting before enforcing check"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then enable the check as a separate fast job. It runs in under 10 seconds, has no false positives, and eliminates the category of review comments that are pure style — freeing reviewers to focus on substance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add &lt;code&gt;tflint&lt;/code&gt; with the AWS ruleset.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;terraform validate&lt;/code&gt; catches syntax errors. It does not catch deprecated resource types, instance types that no longer exist, missing &lt;code&gt;required_providers&lt;/code&gt;, or module interface mismatches where a variable is passed to a module that no longer expects it. Those surface at apply time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .tflint.hcl&lt;/span&gt;
&lt;span class="nx"&gt;plugin&lt;/span&gt; &lt;span class="s2"&gt;"aws"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.32.0"&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"github.com/terraform-linters/tflint-ruleset-aws"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The practical value is catching things that Terraform itself won't catch until it's talking to the AWS API — like an &lt;code&gt;instance_type&lt;/code&gt; that was deprecated, or a &lt;code&gt;required_providers&lt;/code&gt; block that's incomplete after a module upgrade.&lt;/p&gt;

&lt;h2&gt;
  
  
  What good Terraform CI looks like end-to-end
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zyxa88x0mdtlgfpbday.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zyxa88x0mdtlgfpbday.png" alt="Terraform CI pipeline: fmt check → tflint → plan all stacks → post plan to PR comment → apply on merge" width="800" height="93"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PR opened
  → terraform fmt --check        (fast; fails on style)
  → tflint                       (fast; catches deprecated/missing config)
  → terraform plan (all affected stacks)
  → plan posted to PR comment

PR merged
  → terraform apply (gated on approval + merge)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical constraint is the last line: apply never runs on open PRs. Plan runs freely and often; apply runs exactly once per PR, after merge, and only on approved changes.&lt;/p&gt;

&lt;p&gt;On a monorepo with Terragrunt, &lt;code&gt;run-all plan&lt;/code&gt; handles the multi-stack case. The plan comment step posts one comment per stack with a summary header, so reviewers can scan affected stacks without opening each workflow run.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "CI is green" actually means
&lt;/h2&gt;

&lt;p&gt;Green CI on a Terraform PR means syntax is valid, the workflow ran, and the specific stacks with changed files produced a plan. It does not mean the change is safe. It does not mean the full blast radius is visible.&lt;/p&gt;

&lt;p&gt;The module consumer problem is the clearest example of this gap, but it's not the only one. Infrastructure review requires actually reading the plan — which requires the plan to be somewhere reviewers will look. Green CI that nobody reads is a false signal, and a fast-moving codebase will eventually prove that.&lt;/p&gt;

&lt;p&gt;The four fixes here don't require new tools or platform investment. They require deciding that CI should actually help reviewers make decisions, not just confirm the workflow completed.&lt;/p&gt;

&lt;p&gt;Working through Terraform CI gaps in a fast-moving monorepo? This is the kind of platform work I do regularly. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>cicd</category>
      <category>githubactions</category>
      <category>infrastructureascode</category>
    </item>
    <item>
      <title>S3 Table Buckets in Terraform: What Nobody Warned Me About</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 27 May 2026 12:45:15 +0000</pubDate>
      <link>https://dev.to/tallgray1/s3-table-buckets-in-terraform-what-nobody-warned-me-about-52gj</link>
      <guid>https://dev.to/tallgray1/s3-table-buckets-in-terraform-what-nobody-warned-me-about-52gj</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/s3-table-buckets-terraform-gotchas/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The table bucket created without errors. The Terraform apply was clean. The KMS key was attached. The task definition had the right IAM role.&lt;/p&gt;

&lt;p&gt;Then the first read operation hit &lt;code&gt;AccessDenied&lt;/code&gt; on a KMS decrypt call, and the error gave no hint about what was actually wrong.&lt;/p&gt;

&lt;p&gt;We'd added S3 Table Buckets to a data lake architecture — &lt;a href="https://graycloudarch.com/blog/apache-iceberg-lakehouse-s3-table-buckets/" rel="noopener noreferrer"&gt;purpose-built Iceberg storage with 10x faster metadata operations&lt;/a&gt; compared to standard S3. The architecture decision was straightforward. The Terraform implementation had five gotchas that weren't in any documentation I found before running into them. This post documents all of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The KMS Key Policy Needs a Service Principal You Won't Guess
&lt;/h2&gt;

&lt;p&gt;The KMS principal for S3 Table Buckets is &lt;code&gt;s3tables.amazonaws.com&lt;/code&gt; — not &lt;code&gt;s3.amazonaws.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This distinction matters because most teams already have a KMS key for their S3 buckets with a key policy that includes &lt;code&gt;s3.amazonaws.com&lt;/code&gt; as a service principal. When you encrypt a table bucket with that same key without updating the policy, the bucket creates successfully. ACLs, tags, Terraform state — everything looks right. The failure happens on the first metadata read, when the &lt;code&gt;s3tables&lt;/code&gt; service tries to decrypt and the key policy doesn't authorize it.&lt;/p&gt;

&lt;p&gt;The error is &lt;code&gt;AccessDenied&lt;/code&gt; from KMS, which sends you looking at IAM policies on your application role before you think to check the key policy. The required addition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AllowS3TableBuckets"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3tables.amazonaws.com"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:GenerateDataKey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Terraform, this goes in your KMS key resource's &lt;code&gt;policy&lt;/code&gt; document — either as an additional &lt;code&gt;statement&lt;/code&gt; block in an &lt;code&gt;aws_iam_policy_document&lt;/code&gt; data source, or as a JSON merge if your key policy is managed elsewhere. The fix took about 10 minutes once we knew what to look for. Finding it took most of an afternoon.&lt;/p&gt;

&lt;p&gt;If you're using a customer-managed key (and you should be for any production data lake), add this statement before you create the table bucket, not after. The bucket will create cleanly either way — the failure only appears at access time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The IAM Permissions Are in a Different Namespace
&lt;/h2&gt;

&lt;p&gt;Once the KMS issue was resolved, the next &lt;code&gt;AccessDenied&lt;/code&gt; came from a different place: table operations.&lt;/p&gt;

&lt;p&gt;Standard S3 permissions — &lt;code&gt;s3:GetObject&lt;/code&gt;, &lt;code&gt;s3:PutObject&lt;/code&gt;, &lt;code&gt;s3:ListBucket&lt;/code&gt; — don't apply to Table Bucket operations. Table Bucket operations live in the &lt;code&gt;s3tables:*&lt;/code&gt; namespace: &lt;code&gt;s3tables:GetTableBucket&lt;/code&gt;, &lt;code&gt;s3tables:CreateTable&lt;/code&gt;, &lt;code&gt;s3tables:GetTableData&lt;/code&gt;, &lt;code&gt;s3tables:PutTableData&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Roles with full S3 access have no access to table buckets. Roles with table bucket permissions still need standard S3 for underlying object operations. You need both, explicitly granted.&lt;/p&gt;

&lt;p&gt;The minimal IAM policy for a role that reads from table buckets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_policy_document"&lt;/span&gt; &lt;span class="s2"&gt;"table_bucket_read"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;actions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"s3tables:GetTableBucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"s3tables:ListTables"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"s3tables:GetTable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"s3tables:GetTableData"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;resources&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="nx"&gt;aws_s3tables_table_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"${aws_s3tables_table_bucket.this.arn}/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Underlying S3 object access — required in addition to s3tables:* permissions&lt;/span&gt;
  &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;actions&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;resources&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="s2"&gt;"arn:aws:s3:::${aws_s3tables_table_bucket.this.name}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="s2"&gt;"arn:aws:s3:::${aws_s3tables_table_bucket.this.name}/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For write access, add &lt;code&gt;s3tables:PutTableData&lt;/code&gt;, &lt;code&gt;s3tables:CreateTable&lt;/code&gt;, and &lt;code&gt;s3tables:DeleteTable&lt;/code&gt; as needed. The principle of least privilege applies here more strictly than with standard S3 — there's no &lt;code&gt;s3tables:*&lt;/code&gt; wildcard shortcut that's safe to use in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't Mix Table Buckets and Standard S3 in the Same Terraform Component
&lt;/h2&gt;

&lt;p&gt;This one is subtle and doesn't always bite you immediately.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;aws_s3tables_table_bucket&lt;/code&gt; resource uses a different API endpoint from &lt;code&gt;aws_s3_bucket&lt;/code&gt;. When both resource types are in the same Terraform root module or Terragrunt component, the AWS provider's resource graph can produce ordering conflicts on concurrent applies. The symptom isn't usually an apply error — it's unexpected diffs on subsequent plans, where a table bucket resource shows changes that shouldn't be there based on the configuration.&lt;/p&gt;

&lt;p&gt;The fix is isolation: one Terragrunt component for table buckets, one for standard S3 buckets, both pulling encryption keys from a separate KMS component. The dependency chain is explicit and clean:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk000va2p2hvbg2cuqgxc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk000va2p2hvbg2cuqgxc.png" alt="Isolated component structure for S3 Table Buckets, KMS, and standard S3 in separate Terraform state" width="799" height="574"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform/
└── data-lake/
    ├── kms/                  # KMS key — created first
    │   └── main.tf
    ├── s3-standard/          # landing, raw, curated S3 buckets
    │   ├── main.tf
    │   └── terragrunt.hcl    # depends_on kms/
    └── s3-table-buckets/     # table buckets in isolated state
        ├── main.tf
        └── terragrunt.hcl    # depends_on kms/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The standard S3 and table bucket components have no dependency on each other — they both depend on the KMS component and nothing else. If a table bucket apply fails, it doesn't touch standard S3 state. &lt;code&gt;terraform plan&lt;/code&gt; for one doesn't show noise from the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Region Availability Before Designing Around Table Buckets
&lt;/h2&gt;

&lt;p&gt;S3 Table Buckets launched in a limited set of regions and have been expanding, but as of mid-2026 they're still not available everywhere. The list includes &lt;code&gt;us-east-1&lt;/code&gt;, &lt;code&gt;us-west-2&lt;/code&gt;, &lt;code&gt;eu-west-1&lt;/code&gt;, and a handful of others — but not all regions where you might run a data platform.&lt;/p&gt;

&lt;p&gt;The check is fast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws s3tables list-table-buckets &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1
&lt;span class="c"&gt;# Returns an empty list if available, an endpoint error if not&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Table Buckets aren't available in your required region, the fallback is the pre-Table Buckets architecture: standard S3 plus Glue Data Catalog for Iceberg metadata management. That architecture works well and is broadly available. The &lt;a href="https://graycloudarch.com/blog/apache-iceberg-lakehouse-s3-table-buckets/" rel="noopener noreferrer"&gt;Iceberg lakehouse post&lt;/a&gt; covers it.&lt;/p&gt;

&lt;p&gt;Don't let region availability be a late discovery. Run this check before the architecture is committed.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;terraform import&lt;/code&gt; for Existing Table Buckets Doesn't Work Cleanly
&lt;/h2&gt;

&lt;p&gt;If a table bucket was created manually — console, CLI, a one-off script — before your Terraform module existed, bringing it under IaC management is messy.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;terraform import&lt;/code&gt; command for &lt;code&gt;aws_s3tables_table_bucket&lt;/code&gt; expects a resource ID format that's different from what you'd derive from the ARN. The exact format is the table bucket name, not the ARN, not the resource ID from the console. AWS documentation is inconsistent about this.&lt;/p&gt;

&lt;p&gt;Even when the import runs without errors, the resulting state may show plan diffs for attributes like &lt;code&gt;created_at&lt;/code&gt; and &lt;code&gt;arn&lt;/code&gt; that Terraform can't manage but includes in the resource schema. These show up as perpetual diffs that you can't suppress cleanly.&lt;/p&gt;

&lt;p&gt;The safer path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Reference the existing bucket with a data source — don't try to import it&lt;/span&gt;
&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3tables_table_bucket"&lt;/span&gt; &lt;span class="s2"&gt;"existing"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"my-existing-table-bucket"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Reference it from other resources&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3tables_namespace"&lt;/span&gt; &lt;span class="s2"&gt;"raw"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;table_bucket_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_s3tables_table_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;namespace&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"raw"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use the data source to reference the existing bucket, manage new namespaces and tables via Terraform, and defer full ownership transfer (replacing the manually-created bucket with a Terraform-managed one) to a planned migration window. It's more work than import, but the state stays clean.&lt;/p&gt;

&lt;p&gt;The architecture itself — Table Buckets replacing Glue Data Catalog for Iceberg metadata — is solid. These are operational details that mostly show up after the design decision is made. Better to find them here than at 2am during a data pipeline deployment.&lt;/p&gt;

&lt;p&gt;Building out a data lake architecture on AWS and running into Table Bucket or Iceberg issues? This is the kind of platform work I do regularly. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>s3</category>
      <category>terraform</category>
      <category>apacheiceberg</category>
    </item>
    <item>
      <title>ECS vs EKS in 2026: The Decision Framework—Including ECS Anywhere</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Tue, 19 May 2026 14:19:34 +0000</pubDate>
      <link>https://dev.to/tallgray1/ecs-vs-eks-in-2026-the-decision-framework-including-ecs-anywhere-47op</link>
      <guid>https://dev.to/tallgray1/ecs-vs-eks-in-2026-the-decision-framework-including-ecs-anywhere-47op</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/ecs-vs-eks-decision-framework/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The CTO wanted to know why the platform team had picked EKS for their new environment. They'd been running ECS for two years without issues. The team lead explained they needed GitOps, better autoscaling, and "industry-standard tooling."&lt;/p&gt;

&lt;p&gt;Three months later, they were debugging a cert-manager webhook failure at 11am. Two engineers had spent 30 hours the previous month on cluster operations. They hadn't shipped a net-new feature in six weeks.&lt;/p&gt;

&lt;p&gt;EKS wasn't wrong for them. The timing was. They had three engineers, twelve services, and no one who'd operated a Kubernetes cluster in production before. The ecosystem they wanted required them to operate it first.&lt;/p&gt;

&lt;p&gt;This is the ECS vs EKS conversation most teams don't have until after they've made the choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Decision Axis
&lt;/h2&gt;

&lt;p&gt;Feature comparisons miss the point. Both ECS and EKS run containers reliably. The real question is: what does your team have to operate to make that happen — and what's the cost of getting it wrong?&lt;/p&gt;

&lt;p&gt;Two axes matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational capacity&lt;/strong&gt;: How much complexity can your team absorb while still shipping product? A 3-engineer platform team and a 15-engineer platform team are not playing the same game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes maturity&lt;/strong&gt;: Have your engineers operated k8s in production under pressure? "We've done some k8s" and "we've debugged etcd under load" are not the same thing.&lt;/p&gt;

&lt;p&gt;The answer to which one you should use today often changes in 18 months. A team that's right for ECS now may be right for EKS after their platform engineers have shipped 6 months of Kubernetes work. Building with that arc in mind matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ECS Actually Gives You
&lt;/h2&gt;

&lt;p&gt;No control plane. That's the headline. With Fargate, there are no nodes to patch, no node groups to right-size, no kubelet to troubleshoot. AWS manages the underlying compute entirely.&lt;/p&gt;

&lt;p&gt;The IAM model is simpler by design. Task roles attach directly to task definitions — no service accounts, no IRSA, no Web Identity tokens to wire up. For engineers coming from EC2-era IAM, this maps cleanly to what they already know.&lt;/p&gt;

&lt;p&gt;ECS Fargate has no cluster fixed cost. EKS charges $0.10/hr per cluster — $72/month whether you're running one service or fifty. At low service counts or in non-production environments, that difference is real.&lt;/p&gt;

&lt;p&gt;AWS integrations are first-class rather than plugged in. ALB target group registration, CloudMap service discovery, Secrets Manager injection via ECS container secrets — these work without Helm charts or CRDs. The AWS API surface and the ECS API surface are the same surface.&lt;/p&gt;

&lt;p&gt;The internal tools team: 3 engineers, zero Kubernetes background, 8 services. ECS Fargate with a shared Terraform module got them to production in three weeks. No platform team required.&lt;/p&gt;

&lt;h2&gt;
  
  
  What EKS Actually Gives You
&lt;/h2&gt;

&lt;p&gt;Ecosystem depth that ECS simply doesn't have. Karpenter for bin-packing and just-in-time node provisioning. KEDA for event-driven autoscaling off SQS, Kafka, or custom metrics. Argo CD or Flux for GitOps with real reconciliation loops. External Secrets Operator, Cert-manager, Prometheus Operator — the tooling is mature, battle-tested, and actively maintained.&lt;/p&gt;

&lt;p&gt;ECS has no equivalent. The closest alternatives are either AWS-native (EventBridge Pipes, Application Auto Scaling) and less flexible, or custom-built and unmaintained after the engineer who wrote them leaves.&lt;/p&gt;

&lt;p&gt;Karpenter in particular changes the EC2 cost math at scale. Intelligent bin-packing and spot interruption handling can cut compute costs 30-50% compared to fixed node groups. Below 20-30 nodes the savings often don't justify the operational overhead. Above that, it's hard to ignore.&lt;/p&gt;

&lt;p&gt;Multi-cloud portability is real if you actually need it. Kubernetes manifests transfer to GKE or AKS. ECS task definitions do not. If "running this workload outside AWS" is a real scenario — not just theoretical — that matters.&lt;/p&gt;

&lt;p&gt;The data platform I worked on: mixed batch and streaming workloads, KEDA scaling on SQS queue depth. ECS autoscaling would have required custom CloudWatch metrics and polling-based triggers. KEDA handled it natively in 20 lines of YAML. That alone settled the decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision Tree
&lt;/h2&gt;

&lt;p&gt;Walk through these in order. First yes wins.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3zch5zvciylkyaq6lyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3zch5zvciylkyaq6lyc.png" alt="ECS vs EKS decision framework" width="800" height="1911"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero Kubernetes experience on the team?&lt;/strong&gt; → ECS. The operational cost of learning k8s while building product is real and usually underestimated. The 40-hour/month cluster ops tax from the story above was paid by a team that had some k8s experience. Zero experience is worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migrating from an existing ECS platform?&lt;/strong&gt; → ECS. Rewrite and replatform simultaneously fails more often than it succeeds. Stabilize on ECS, migrate later when the workload is boring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Need KEDA, custom-metric HPA, or Karpenter?&lt;/strong&gt; → EKS. ECS autoscaling is Application Auto Scaling against CloudWatch metrics. It works, but the ceiling is lower and the custom metric path is significantly more work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Need GitOps with Argo CD or Flux?&lt;/strong&gt; → EKS. ECS has no native GitOps story. You can build one — CodePipeline + ECS deployment, Terraform-driven deployments — but you're building it. The operational difference is significant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Five or more services sharing infrastructure?&lt;/strong&gt; → EKS. The fixed cost justifies it; shared node pools improve utilization; the per-service overhead of ECS task definitions multiplies fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Default&lt;/strong&gt; → ECS Fargate. Simpler, cheaper to start, and the migration path to EKS is well-understood.&lt;/p&gt;

&lt;h2&gt;
  
  
  ECS Anywhere: The Third Option
&lt;/h2&gt;

&lt;p&gt;ECS Anywhere gets overlooked in most comparisons because it doesn't fit neatly into "cloud vs cloud" comparisons. It should be in the decision tree.&lt;/p&gt;

&lt;p&gt;ECS Anywhere lets you register non-AWS compute — on-premises servers, VMs in other clouds, edge devices — as ECS external instances. Your task definitions, IAM roles, and tooling stay the same. The ECS control plane in AWS manages scheduling. The compute runs wherever you've registered it.&lt;/p&gt;

&lt;p&gt;Where this actually wins:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regulated environments with data residency requirements.&lt;/strong&gt; If certain workloads must stay on-premises for compliance, ECS Anywhere lets you run them with the same tooling as your AWS workloads. On the GovCloud platform I built, we had ground system software that had to process flight data on local hardware before transmission. ECS Anywhere would have let us manage those workloads from the same ECS cluster as our cloud services — same Terraform modules, same IAM patterns, same observability pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brownfield migration.&lt;/strong&gt; If you're moving workloads from on-premises to AWS and want a consistent deployment target during the migration, ECS Anywhere gives you that. Register the on-prem servers, migrate task by task, deregister when done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge compute.&lt;/strong&gt; Consistent deployment tooling across dozens of edge nodes without running a k8s control plane at each site.&lt;/p&gt;

&lt;p&gt;The constraint: ECS Anywhere instances are external infrastructure you own and patch. Fargate's "no nodes to manage" advantage disappears. The tradeoff is deliberate — you're accepting node management in exchange for placement control.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration Path
&lt;/h2&gt;

&lt;p&gt;ECS → EKS migration is well-understood and not particularly risky if the IaC is clean.&lt;/p&gt;

&lt;p&gt;Containerized workloads move without changes. The two meaningful changes are IAM (task roles → IRSA service accounts — mechanical, not complex) and networking (ALB target group registration → Ingress or Service — also mechanical).&lt;/p&gt;

&lt;p&gt;What breaks the migration is task definitions in CloudFormation or hand-managed console resources. If your ECS deployment is 100% Terraform with a module per service, the migration is boring. If it's six engineers' worth of one-off console configurations, it's archaeology.&lt;/p&gt;

&lt;p&gt;Build ECS as if you'll migrate it. Keep task definitions in Terraform modules, service definitions composable, networking configuration explicit. The Jira ticket for "migrate from ECS to EKS" should feel like plumbing work, not a project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes I See Repeatedly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choosing EKS because it's "industry standard."&lt;/strong&gt; Industry standard at Stripe is not industry standard at a 40-person SaaS company. The operational tax is the same either way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing ECS without accounting for the autoscaling ceiling.&lt;/strong&gt; For workloads with bursty, event-driven traffic patterns, ECS autoscaling requires CloudWatch custom metrics and Application Auto Scaling policies that are genuinely annoying to tune. Know the ceiling before you hit it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-cluster EKS for two services.&lt;/strong&gt; The fixed cost of the control plane ($72/month), the operational overhead of running Kubernetes, and the learning curve are all real. For two or three services, this almost never makes sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Underestimating the Helm/CRD surface area.&lt;/strong&gt; When a Helm-managed CRD conflicts with another controller at 2am, you need someone on the team who can debug it. "We'll figure it out" is not a plan.&lt;/p&gt;

&lt;p&gt;Building a new platform or rearchitecting an existing container environment? The choice between ECS, EKS, and ECS Anywhere usually comes down to where your team is on the Kubernetes maturity curve and what your autoscaling requirements actually are — not which technology is more capable. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; if you're working through this decision — it's a conversation I have with platform teams regularly, and the right answer depends on specifics that don't fit in a blog post.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>eks</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Building Apache Iceberg Lakehouse Storage with S3 Table Buckets</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 18 May 2026 17:54:46 +0000</pubDate>
      <link>https://dev.to/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-42oo</link>
      <guid>https://dev.to/tallgray1/building-apache-iceberg-lakehouse-storage-with-s3-table-buckets-42oo</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/apache-iceberg-lakehouse-s3-table-buckets/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.&lt;/p&gt;

&lt;p&gt;The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.&lt;/p&gt;

&lt;p&gt;We built the storage layer as a three-zone medallion architecture, fully managed with Terraform. Here's how we did it — including a few things about Table Buckets that don't show up in most writeups.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Medallion Architecture
&lt;/h2&gt;

&lt;p&gt;One table bucket per environment. Zones are namespaces inside the bucket — not separate buckets, not separate Glue databases in the legacy sense:&lt;/p&gt;

&lt;p&gt;&lt;a href="/diagrams/diag-apache-iceberg-medallion.png" class="article-body-image-wrapper"&gt;&lt;img src="/diagrams/diag-apache-iceberg-medallion.png" alt="Medallion architecture — one S3 Table Bucket per environment with raw, clean, and curated namespaces inside. DMS ingests from source systems into raw. EMR Serverless Spark transforms raw to clean and clean to curated. Glue exposes a federated s3tablescatalog integration layer. Athena queries through Glue. BI layer (Superset) sits on top of Athena."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, produced by Spark transforms. Curated is the analytics layer that Athena queries and BI dashboards read from.&lt;/p&gt;

&lt;p&gt;The namespace naming convention we used was &lt;code&gt;{zone}_{domain}&lt;/code&gt; — &lt;code&gt;raw_crm&lt;/code&gt;, &lt;code&gt;clean_customer&lt;/code&gt;, &lt;code&gt;curated_sales_metrics&lt;/code&gt;. When you're looking at a table in Athena or debugging a failed transform job, the namespace name tells you exactly what tier you're in and what domain you're touching. Data lineage is readable from table names alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Two Modules Instead of One
&lt;/h2&gt;

&lt;p&gt;The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.&lt;/p&gt;

&lt;p&gt;The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.&lt;/p&gt;

&lt;p&gt;The KMS module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# kms-key/main.tf&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_kms_key"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;
  &lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;
  &lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enable IAM User Permissions"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::&lt;/span&gt;&lt;span class="k"&gt;${data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_caller_identity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:root"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kms:*"&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow Service Access"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Service&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_principals&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:GenerateDataKey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:CreateGrant"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;service_principals&lt;/code&gt; variable takes a list of service principal strings — &lt;code&gt;["athena.amazonaws.com", "glue.amazonaws.com", "emr-serverless.amazonaws.com"]&lt;/code&gt; and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The S3 Table Bucket Module
&lt;/h2&gt;

&lt;p&gt;The table bucket itself is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# s3-table-bucket/main.tf&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3tables_table_bucket"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bucket_name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One important thing that trips people up: &lt;strong&gt;S3 Table Buckets are not standard S3 buckets.&lt;/strong&gt; They use the S3 Tables API, not the standard S3 API. Several standard S3 resources will fail with &lt;code&gt;NoSuchBucket (404)&lt;/code&gt; if you try to attach them to a Table Bucket:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_versioning&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_server_side_encryption_configuration&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_public_access_block&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;aws_s3_bucket_intelligent_tiering_configuration&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encryption is managed internally — AES256 is applied on creation automatically. You'll want &lt;code&gt;ignore_changes = [encryption_configuration]&lt;/code&gt; in your lifecycle block or Terraform will constantly detect drift.&lt;/p&gt;

&lt;p&gt;The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# lake-storage/terragrunt.hcl&lt;/span&gt;
&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"kms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../kms-key"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"company-lake-${local.environment}"&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Glue Is Not the Catalog
&lt;/h2&gt;

&lt;p&gt;This is the part that most S3 Table Bucket writeups get wrong, and it matters for how you structure the rest of your Terraform.&lt;/p&gt;

&lt;p&gt;S3 Tables is the metadata source of truth. Glue is the integration layer. When you enable the S3 Tables analytics integration, AWS creates a federated catalog named &lt;code&gt;s3tablescatalog&lt;/code&gt; in your Glue Data Catalog. Table buckets, namespaces, and tables are surfaced through that catalog hierarchy — Athena and EMR see them through Glue, but Glue doesn't own them.&lt;/p&gt;

&lt;p&gt;This means you should not be creating &lt;code&gt;aws_glue_catalog_database&lt;/code&gt; resources with &lt;code&gt;location_uri&lt;/code&gt; S3 paths and trying to wire Iceberg metadata parameters onto them. That's the legacy Glue-over-S3-prefixes model. For S3 Tables, the catalog structure comes from the table bucket integration, not from manual Glue database provisioning.&lt;/p&gt;

&lt;p&gt;In Terraform, the integration resource is &lt;code&gt;aws_s3tables_table_bucket_policy&lt;/code&gt; (for access control) and the analytics integration is enabled at the account level. Once enabled, Athena queries S3 Tables through the &lt;code&gt;s3tablescatalog&lt;/code&gt; namespace automatically.&lt;/p&gt;

&lt;p&gt;The namespace naming convention (&lt;code&gt;raw&lt;/code&gt;, &lt;code&gt;clean&lt;/code&gt;, &lt;code&gt;curated&lt;/code&gt; with domain suffixes) is defined in the table bucket itself, not in Glue. Glue reflects it — it doesn't own it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Model
&lt;/h2&gt;

&lt;p&gt;For a 100TB lake, the comparison against standard S3 holds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Storage Class&lt;/th&gt;
&lt;th&gt;When&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;Active data&lt;/td&gt;
&lt;td&gt;~$2,300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard-IA equivalent&lt;/td&gt;
&lt;td&gt;Less-accessed data&lt;/td&gt;
&lt;td&gt;~$400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glacier equivalent&lt;/td&gt;
&lt;td&gt;Archive&lt;/td&gt;
&lt;td&gt;~$100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The metadata acceleration charge for Table Buckets is $0.00025 per 1,000 requests — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. The performance improvement compounds the cost picture: 10x faster query planning means less Athena scan time, which means lower query costs as data volume grows.&lt;/p&gt;

&lt;p&gt;One note: you cannot attach &lt;code&gt;aws_s3_bucket_intelligent_tiering_configuration&lt;/code&gt; to a Table Bucket — it's a standard S3 resource and will fail. Storage cost optimization for Table Buckets happens through compaction and retention maintenance jobs (typically run on a schedule via MWAA or EMR), not through lifecycle policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Sequence
&lt;/h2&gt;

&lt;p&gt;The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before the S3 Tables analytics integration (which creates the federated Glue catalog surface).&lt;/p&gt;

&lt;p&gt;&lt;a href="/diagrams/diag-apache-iceberg-pipeline.png" class="article-body-image-wrapper"&gt;&lt;img src="/diagrams/diag-apache-iceberg-pipeline.png" alt="Deployment sequence: KMS Key must be created first, then S3 Table Bucket (which uses the key ARN), then S3 Tables Analytics Integration which creates the s3tablescatalog federated view in Glue Data Catalog"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.&lt;/p&gt;

&lt;p&gt;One deployment note: if you're using Athena and haven't enabled S3 Tables analytics integration in the account before, do that before the apply. Athena queries S3 Tables only after the integration is enabled and the &lt;code&gt;s3tablescatalog&lt;/code&gt; namespace is visible in the Glue Data Catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Team Inherited
&lt;/h2&gt;

&lt;p&gt;When we handed this over to the data engineering team, they had a fully provisioned storage foundation — one table bucket per environment, three namespaces per bucket, encryption enabled, and Athena wired to query through the &lt;code&gt;s3tablescatalog&lt;/code&gt; integration. They could start writing Spark jobs and creating tables immediately without worrying about storage configuration or catalog wiring after the fact.&lt;/p&gt;

&lt;p&gt;The Terraform modules are reusable. Adding a new environment is one Terragrunt leaf config. Adding a new domain namespace is a namespace declaration on the existing bucket. The KMS key and integration configuration don't change.&lt;/p&gt;

&lt;p&gt;S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains make a strong case for starting there rather than retrofitting later — just go in knowing they're a different API surface than standard S3, and structure your modules accordingly.&lt;/p&gt;




&lt;p&gt;Building out a data platform and figuring out the storage and catalog architecture? &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>apacheiceberg</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The 5-Minute Tax I Killed With GitHub Actions</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 18 May 2026 17:43:31 +0000</pubDate>
      <link>https://dev.to/tallgray1/the-5-minute-tax-i-killed-with-github-actions-1gpe</link>
      <guid>https://dev.to/tallgray1/the-5-minute-tax-i-killed-with-github-actions-1gpe</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/zero-touch-deployments/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Every time I finished writing a blog post, I had to do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;sites/graycloudarch
hugo &lt;span class="nt"&gt;--minify&lt;/span&gt;
aws s3 &lt;span class="nb"&gt;sync &lt;/span&gt;public/ s3://graycloudarch-website &lt;span class="nt"&gt;--delete&lt;/span&gt;
aws cloudfront create-invalidation &lt;span class="nt"&gt;--distribution-id&lt;/span&gt; E1234ABCDEF &lt;span class="nt"&gt;--paths&lt;/span&gt; &lt;span class="s2"&gt;"/*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five minutes. Doesn't sound like much.&lt;/p&gt;

&lt;p&gt;But when you're trying to publish 2-3 posts per week while working full-time, those 5 minutes add up. Not just in time—in &lt;em&gt;friction&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;"I just finished writing. Now I need to context-switch to deployment mode. What was that CloudFront ID again?"&lt;/p&gt;

&lt;p&gt;Friction kills momentum.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Wanted
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;git push&lt;/code&gt; → site updates automatically → I move on to the next thing.&lt;/p&gt;

&lt;p&gt;Zero thinking. Zero context switching. Zero "oh crap, I forgot to invalidate CloudFront."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: GitHub Actions
&lt;/h2&gt;

&lt;p&gt;GitHub Actions can build and deploy your site every time you push to &lt;code&gt;main&lt;/code&gt;. For free.&lt;/p&gt;

&lt;p&gt;Here's the whole workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy graycloudarch.com&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sites/graycloudarch/**'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content/graycloudarch/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;submodules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;peaceiris/actions-hugo@v2&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;hugo-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latest'&lt;/span&gt;
          &lt;span class="na"&gt;extended&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build site&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./sites/graycloudarch&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hugo --minify&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;aws-access-key-id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-secret-access-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./sites/graycloudarch&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;aws s3 sync public/ s3://graycloudarch-website --delete&lt;/span&gt;
          &lt;span class="s"&gt;aws cloudfront create-invalidation \&lt;/span&gt;
            &lt;span class="s"&gt;--distribution-id ${{ secrets.CLOUDFRONT_DISTRIBUTION }} \&lt;/span&gt;
            &lt;span class="s"&gt;--paths "/*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Push to &lt;code&gt;main&lt;/code&gt;, GitHub Actions handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part That Tripped Me Up
&lt;/h2&gt;

&lt;p&gt;Hugo themes are usually Git submodules. If you don't check them out, your build fails with cryptic errors about missing layouts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;submodules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Don't forget this&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost me 20 minutes of debugging before I realized. Now it's documented in code, not lost in my bash history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path Filtering: The Secret Sauce
&lt;/h2&gt;

&lt;p&gt;I run two sites in one repo: graycloudarch.com and cloudpatterns.io.&lt;/p&gt;

&lt;p&gt;Without path filtering, every push rebuilds &lt;em&gt;both&lt;/em&gt; sites, even if I only changed one. Wasted build minutes, unnecessary CloudFront invalidations, slower feedback.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sites/graycloudarch/**'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content/graycloudarch/**'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now GitHub Actions only runs when files for &lt;em&gt;that site&lt;/em&gt; change. Fast, efficient, no waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;I'm trying to hit $3K/month by March 31. That's 9 weeks.&lt;/p&gt;

&lt;p&gt;Every minute I spend deploying is a minute I'm not writing, not reaching out to clients, not building the course I want to sell.&lt;/p&gt;

&lt;p&gt;Manual deployments are a tax on my time. This workflow eliminated that tax.&lt;/p&gt;

&lt;p&gt;Now when I finish writing, I commit and push. Two minutes later, it's live. I'm already working on the next post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win
&lt;/h2&gt;

&lt;p&gt;It's not the 5 minutes per deployment.&lt;/p&gt;

&lt;p&gt;It's the &lt;em&gt;mental overhead&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Before: "Okay, post is done. Now I need to switch gears, build Hugo, sync to S3, remember that CloudFront command..."&lt;/p&gt;

&lt;p&gt;After: "Post is done. &lt;code&gt;git push&lt;/code&gt;. What's next?"&lt;/p&gt;

&lt;p&gt;No context switch. No friction. Just ship and move on.&lt;/p&gt;

&lt;p&gt;That's worth way more than 5 minutes.&lt;/p&gt;

&lt;p&gt;Want to set this up for your site? The workflow above works for any Hugo + S3 + CloudFront setup. Just plug in your bucket names and distribution IDs in GitHub Secrets.&lt;/p&gt;

&lt;p&gt;Or &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;reach out&lt;/a&gt; if you want help automating your deployments. I do this for a living.&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>cicd</category>
      <category>hugo</category>
      <category>automation</category>
    </item>
    <item>
      <title>I Spent 6 Hours Automating a 30-Minute Task (And I'd Do It Again)</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 18 May 2026 17:43:30 +0000</pubDate>
      <link>https://dev.to/tallgray1/i-spent-6-hours-automating-a-30-minute-task-and-id-do-it-again-14ee</link>
      <guid>https://dev.to/tallgray1/i-spent-6-hours-automating-a-30-minute-task-and-id-do-it-again-14ee</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/automated-infrastructure/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Look, I know what you're thinking. "Glenn, you could've just clicked through the AWS console and had both sites live in an hour."&lt;/p&gt;

&lt;p&gt;You're not wrong.&lt;/p&gt;

&lt;p&gt;But here's the thing—I'm allergic to clicking through consoles. It's a professional hazard from spending the last 5 years building enterprise platforms where "just do it manually" gets you fired.&lt;/p&gt;

&lt;p&gt;So when I sat down to launch graycloudarch.com and cloudpatterns.io, I did what any reasonable person would do: I spent 6 hours writing Terraform to automate a 30-minute task.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Manual Way (aka Hell)
&lt;/h2&gt;

&lt;p&gt;If I'd done this the normal way:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS Console → ACM → Request Certificate&lt;/li&gt;
&lt;li&gt;Copy the DNS validation CNAME&lt;/li&gt;
&lt;li&gt;Cloudflare → Add DNS record&lt;/li&gt;
&lt;li&gt;Wait. Refresh. Wait more.&lt;/li&gt;
&lt;li&gt;AWS Console → CloudFront → Create Distribution&lt;/li&gt;
&lt;li&gt;Copy CloudFront domain&lt;/li&gt;
&lt;li&gt;Cloudflare → Add another DNS record&lt;/li&gt;
&lt;li&gt;Test. Find typo. Fix typo. Test again.&lt;/li&gt;
&lt;li&gt;Repeat for second domain.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Time: 40 minutes if nothing breaks (it always breaks).&lt;/p&gt;

&lt;p&gt;Chance I'd screw up a DNS record: 80%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Automated Way (aka Overkill)
&lt;/h2&gt;

&lt;p&gt;One Terraform apply. That's it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform apply
&lt;span class="c"&gt;# Go make coffee&lt;/span&gt;
&lt;span class="c"&gt;# Come back to two working sites&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the real magic isn't the deployment—it's what happens when AWS generates those ACM validation records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cert_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;
        &lt;span class="nx"&gt;record&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_value&lt;/span&gt;
        &lt;span class="nx"&gt;type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_type&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudflare_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;record&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform reads the validation records from AWS, creates them in Cloudflare, and waits for validation to complete. Zero copy-paste. Zero switching between browser tabs. Zero forgetting which CNAME goes where.&lt;/p&gt;

&lt;p&gt;I don't touch Cloudflare. I don't touch AWS Console. I just run terraform apply and go do something useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters (Spoiler: It's Not About Terraform)
&lt;/h2&gt;

&lt;p&gt;I'm trying to hit $3K/month by March 31. That's 9 weeks away.&lt;/p&gt;

&lt;p&gt;Every hour I spend clicking through AWS is an hour I'm not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing blog posts&lt;/li&gt;
&lt;li&gt;Reaching out to potential clients on LinkedIn&lt;/li&gt;
&lt;li&gt;Building the course I want to sell&lt;/li&gt;
&lt;li&gt;Actually making money&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manual infrastructure doesn't generate revenue. Published content generates revenue.&lt;/p&gt;

&lt;p&gt;So yeah, I spent 6 hours automating something I could've done in 30 minutes. But now when I launch my third brand (and I will), it takes 10 minutes and one terraform apply.&lt;/p&gt;

&lt;p&gt;That's the bet: upfront investment for long-term velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Built
&lt;/h2&gt;

&lt;p&gt;The module is dead simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACM certificate with DNS validation&lt;/li&gt;
&lt;li&gt;S3 bucket for static hosting&lt;/li&gt;
&lt;li&gt;CloudFront distribution&lt;/li&gt;
&lt;li&gt;Cloudflare DNS records (both root and www)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Call it twice (once per brand), different inputs, same code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../modules/static-site"&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch.com"&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch-website"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../modules/static-site"&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns.io"&lt;/span&gt;
  &lt;span class="nx"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns-website"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No duplication. No drift. No "wait, which CloudFront ID goes with which domain?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part Where I Screwed Up
&lt;/h2&gt;

&lt;p&gt;Of course it didn't work perfectly the first time.&lt;/p&gt;

&lt;p&gt;Turns out when you register a domain through Cloudflare, they helpfully create a default parking page DNS record. When Terraform tried to create my root CNAME, it failed with "record already exists."&lt;/p&gt;

&lt;p&gt;Took me 20 minutes to figure out I needed &lt;code&gt;allow_overwrite = true&lt;/code&gt; in the Cloudflare resource.&lt;/p&gt;

&lt;p&gt;20 minutes I'll never get back. But at least it's documented in Git now, not lost in my bash history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Would I Do This Again?
&lt;/h2&gt;

&lt;p&gt;Absolutely.&lt;/p&gt;

&lt;p&gt;Not because it's faster (it's not, the first time).&lt;/p&gt;

&lt;p&gt;Not because it's easier (it's definitely not).&lt;/p&gt;

&lt;p&gt;Because when I'm sitting at 2am writing my fifth blog post of the week and I realize I need to spin up a third site for a new product line, I can do it in 10 minutes instead of canceling my writing session to spend 45 minutes in AWS console.&lt;/p&gt;

&lt;p&gt;Automation is a bet on future you. I'm betting future Glenn will appreciate not having to remember how SSL validation works.&lt;/p&gt;

&lt;p&gt;Want the code? It's not open source (yet), but if you're building something similar and want to talk through the architecture, &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;hit me up&lt;/a&gt;. I'm always down to talk Terraform.&lt;/p&gt;

&lt;p&gt;Or if you just want to tell me I'm insane for spending 6 hours on this, that's cool too. My DMs are open.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>cloudfront</category>
      <category>automation</category>
    </item>
    <item>
      <title>The IAM Trust Policy Chicken-and-Egg (That Isn't)</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Wed, 13 May 2026 17:55:53 +0000</pubDate>
      <link>https://dev.to/tallgray1/the-iam-trust-policy-chicken-and-egg-that-isnt-2ba5</link>
      <guid>https://dev.to/tallgray1/the-iam-trust-policy-chicken-and-egg-that-isnt-2ba5</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/iam-trust-policy-chicken-and-egg/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The pipeline role needed to trust the deployment role. The deployment role needed to trust the pipeline role. When I wrote both in Terraform and ran plan, it stopped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Cycle: module.pipeline.aws_iam_role.exec → module.deploy.aws_iam_role.target → module.pipeline.aws_iam_role.exec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The instinct is to create one role first, then go back and edit the trust policy of the other after it exists. A manual bootstrap step. It works. It also means you can't &lt;code&gt;terraform apply&lt;/code&gt; from a clean state and get a working result — someone has to remember the second pass. The IaC tells half the story.&lt;/p&gt;

&lt;p&gt;There's a better answer. IAM trust policies don't validate that the ARNs they reference actually exist. AWS stores the JSON document and moves on. The cycle Terraform sees is real — it's a real edge in its dependency graph. The underlying constraint that dependency represents is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  ARNs are deterministic before creation
&lt;/h2&gt;

&lt;p&gt;IAM role ARNs follow a fixed format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arn:aws:iam::&amp;lt;account-id&amp;gt;:role/&amp;lt;role-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The account ID is fixed. The role name is chosen at definition time. Which means the full ARN is computable before &lt;code&gt;terraform apply&lt;/code&gt; runs — before the resource exists — as long as the name is stable.&lt;/p&gt;

&lt;p&gt;AWS does not validate that a referenced principal ARN exists when you create or update a trust policy. It stores the JSON. The role becomes assumable once both sides exist, regardless of which one was created first.&lt;/p&gt;

&lt;p&gt;This is different from a configuration error like referencing a nonexistent IAM role in an &lt;code&gt;aws_iam_role_policy_attachment&lt;/code&gt; — that fails at apply time because Terraform tries to call the API and gets an error. A trust policy is just a JSON document stored against the role. If the ARN in the &lt;code&gt;Principal&lt;/code&gt; field doesn't resolve to an existing entity yet, IAM doesn't complain. It just doesn't match anything. Yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cycle Terraform sees
&lt;/h2&gt;

&lt;p&gt;The dependency graph problem is real. Here's the code that creates it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_a"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# depends on role_b&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_b"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# depends on role_a&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform resolves: &lt;code&gt;role_a&lt;/code&gt; needs &lt;code&gt;role_b&lt;/code&gt;'s ARN before creation → &lt;code&gt;role_b&lt;/code&gt; needs &lt;code&gt;role_a&lt;/code&gt;'s ARN before creation → cycle. It stops before creating either resource.&lt;/p&gt;

&lt;p&gt;The fix removes the dependency by computing what you already know:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_caller_identity"&lt;/span&gt; &lt;span class="s2"&gt;"current"&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_caller_identity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_id&lt;/span&gt;

  &lt;span class="nx"&gt;role_a_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${local.account_id}:role/${var.role_a_name}"&lt;/span&gt;
  &lt;span class="nx"&gt;role_b_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${local.account_id}:role/${var.role_b_name}"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_a"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_a_name&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_b_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# string, no Terraform dependency&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_b"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_b_name&lt;/span&gt;
  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role_a_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# string, no Terraform dependency&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No cycle. Both roles are created in a single apply. The trust relationship is live as soon as both resources exist — which they will be, after the same plan.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fdiagrams%2Fdiag-iam-chicken-egg-cross-account.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgraycloudarch.com%2Fdiagrams%2Fdiag-iam-chicken-egg-cross-account.png" alt="Two-account cross-account IAM trust relationship. Both role ARNs are constructed from known values at plan time — no deploy order required."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this pattern appears in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cross-account deployment pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CodePipeline execution role in account A assumes a deployment role in account B. The deployment role's trust policy needs to reference the pipeline role's ARN. Each Terraform root manages its own account's roles. The ARN construction pattern resolves the cross-account dependency: each module constructs the other account's role ARN from &lt;code&gt;var.pipeline_account_id&lt;/code&gt; and a known role name — values passed in at plan time from tfvars or remote state outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ECS task role and execution role&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ECS task execution role needs &lt;code&gt;iam:PassRole&lt;/code&gt; to hand the task role to ECS at launch. Some teams want the task role's trust policy to explicitly list the execution role's ARN as the allowed principal. You don't need to — &lt;code&gt;ecs-tasks.amazonaws.com&lt;/code&gt; as the service principal removes the dependency entirely. But if your security posture requires explicit principal ARNs rather than the service principal, ARN construction handles it without a two-pass apply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission boundary bootstrap with an SCP&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An SCP requires that all new IAM roles include a specific permission boundary policy. The boundary is a managed policy that must exist before any roles referencing it can be created. This isn't a circular dependency — it's a sequential one. The boundary policy must be applied first, separately. Construct its ARN deterministically (&lt;code&gt;arn:aws:iam::${var.account_id}:policy/${var.boundary_name}&lt;/code&gt;) and pass it in wherever roles are created. Document the bootstrap order with a Terraform &lt;code&gt;precondition&lt;/code&gt; block or a clear README section. Different problem, different fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the dependency is genuine
&lt;/h2&gt;

&lt;p&gt;There's a scenario that looks identical to this but isn't: when a Terraform provisioner or data source needs to actually &lt;em&gt;call&lt;/em&gt; a role — not just reference its ARN — during resource creation.&lt;/p&gt;

&lt;p&gt;Example: a &lt;code&gt;null_resource&lt;/code&gt; provisioner that runs &lt;code&gt;aws sts assume-role&lt;/code&gt; and then operates in the target account. Here you need the role to exist and be assumable before the provisioner fires. ARN construction doesn't help — you need the resource active at execution time, not just its string value known at plan time. The correct fix is explicit &lt;code&gt;depends_on&lt;/code&gt;, not local string construction.&lt;/p&gt;

&lt;p&gt;The distinction: static JSON referencing an ARN string (solvable with ARN construction) vs. a runtime API call that needs the resource actually live (solvable with &lt;code&gt;depends_on&lt;/code&gt;). If your code needs to &lt;em&gt;assume&lt;/em&gt; the role during apply, you need ordering. If it just needs to &lt;em&gt;name&lt;/em&gt; the role in a policy document, you don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap in the fix
&lt;/h2&gt;

&lt;p&gt;Once you've internalized "construct ARNs deterministically," the next failure mode is &lt;strong&gt;role names that include Terraform-generated suffixes&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"role_a"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.prefix}-role-${random_id.suffix.hex}"&lt;/span&gt;  &lt;span class="c1"&gt;# ARN not deterministic until random_id exists&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the role name includes &lt;code&gt;random_id.suffix.hex&lt;/code&gt;, the ARN can't be computed until the &lt;code&gt;random_id&lt;/code&gt; resource is created. That brings the dependency back — you're back to needing a resource output to construct the name, and the cycle re-forms if any of those names are referenced in another role's trust policy.&lt;/p&gt;

&lt;p&gt;The fix is stable, predictable role names: &lt;code&gt;"${var.prefix}-${var.env}-pipeline"&lt;/code&gt; rather than generated suffixes. IAM role names are unique per account, not globally. The habit of appending random suffixes comes from S3 bucket naming, where global uniqueness is required. IAM doesn't have that constraint. There's no reason to make the name unpredictable.&lt;/p&gt;

&lt;p&gt;If you have existing roles with generated names and need their ARNs, they're deterministic &lt;em&gt;after&lt;/em&gt; the first apply — stored in state and readable via &lt;code&gt;aws_iam_role.role_a.arn&lt;/code&gt;. The construction approach is for cases where you control the naming and are defining the role name yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What generalizes
&lt;/h2&gt;

&lt;p&gt;The IAM trust policy deadlock is the most common place engineers hit this pattern, but it's not the only one. Wherever you encounter a Terraform circular dependency involving a predictable string — ARNs, resource names, account IDs, region names — ask whether you actually need the resource output or whether you can compute the value from what you already know.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;data.aws_caller_identity.current.account_id&lt;/code&gt; gives you the account without creating a dependency on any resource. A stable name gives you the ARN. The dependency graph edge exists only because you referenced the resource — remove the reference by computing the value directly, and the cycle disappears.&lt;/p&gt;

&lt;p&gt;The broader principle: Terraform's graph is built from references. References that aren't necessary are constraints that aren't necessary.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Untangling IAM architecture across multiple accounts — trust policies, permission boundaries, SCPs, cross-account assume-role chains — is where subtle errors compound quietly and the blast radius is real. &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;I work on this regularly&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>iam</category>
      <category>terraform</category>
      <category>aws</category>
      <category>security</category>
    </item>
    <item>
      <title>What the first 24 hours of production CloudWatch data told us</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Mon, 04 May 2026 18:43:32 +0000</pubDate>
      <link>https://dev.to/tallgray1/what-the-first-24-hours-of-production-cloudwatch-data-told-us-1140</link>
      <guid>https://dev.to/tallgray1/what-the-first-24-hours-of-production-cloudwatch-data-told-us-1140</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/cloudwatch-go-live-24h/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The morning after go-live, the first thing I looked at was CPU. One of the two delivery services was sitting at 99.8% average utilization across 9 tasks. P50 latency: 1,010ms.&lt;/p&gt;

&lt;p&gt;We'd launched deliberately without autoscaling. The plan was to observe real traffic patterns before configuring a scaling policy — you can't tune a policy you haven't seen the workload demand yet. What we didn't know was that the workload would reveal something about the task itself before we'd had a chance to watch it for a week.&lt;/p&gt;

&lt;p&gt;Thirty-six hours after go-live, we'd shipped right-sizing changes, a working autoscaling configuration, and a new observability source for ALB-layer signals. All of it came directly from what the first day of production data said. Here's how we read it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 99.8% CPU means at 0.5 vCPU
&lt;/h2&gt;

&lt;p&gt;The service was allocated 512 ECS CPU units per task — half a vCPU. CloudWatch was telling us the tasks were spending essentially all of their scheduled CPU time working.&lt;/p&gt;

&lt;p&gt;The first instinct in this situation is to add tasks. Scale out horizontally. But adding more 0.5 vCPU containers when each one is already saturated doesn't change the constraint. In ECS, the scheduler distributes tasks across hosts, but the per-task CPU ceiling is set in the task definition. More tasks at ceiling is not materially different from fewer tasks at ceiling — you're distributing the same undersized unit more widely.&lt;/p&gt;

&lt;p&gt;The signal wasn't about count. It was about the unit itself.&lt;/p&gt;

&lt;p&gt;At 99.8% utilization, any burst in per-request processing time — a downstream API call that's slow, a cache miss, a spike in concurrent requests — queues. The task has no headroom to absorb it. That's where the 1,010ms p50 comes from: not that individual requests are slow, but that tasks are scheduled tightly enough that requests wait before they even start processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Right-sizing the task before configuring the autoscaler
&lt;/h2&gt;

&lt;p&gt;We doubled the CPU allocation: 512 → 1,024 units. The rationale is mechanical once you see it: you can't configure a useful CPU-based autoscaling policy on a task that's already running at ceiling. If 100% CPU is the baseline, the autoscaler has nothing to respond to — it would scale out immediately on creation and never scale in.&lt;/p&gt;

&lt;p&gt;Target tracking at 70% CPU requires headroom. A 1 vCPU task running the same workload that previously pinned a 0.5 vCPU task will land around 50% utilization — below the target, room to absorb variance before triggering a scale-out, and enough signal for scale-in to be meaningful rather than noise.&lt;/p&gt;

&lt;p&gt;The second service had a different profile: 12 tasks, 1 vCPU each, hitting 92% at peak. Not saturated the same way, but thin on headroom. We went to 2 vCPU there.&lt;/p&gt;

&lt;p&gt;Two other services in the platform were running the opposite problem — allocated more memory than they'd ever used. Those went the other direction: overprovisioned memory cut back based on observed peaks. The same 24-hour data window showed both problems at once.&lt;/p&gt;

&lt;p&gt;Sequencing matters: &lt;strong&gt;right-size the task before you configure the autoscaler.&lt;/strong&gt; Otherwise you're teaching a scaling policy to respond to a signal that's already maxed out, and the first thing it does is scale out to a floor that's still running on undersized tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we chose CPU tracking instead of request count
&lt;/h2&gt;

&lt;p&gt;The obvious autoscaling metric for an HTTP service is &lt;code&gt;ALBRequestCountPerTarget&lt;/code&gt;. The ALB knows the request rate per target group; scaling on that metric tracks load linearly and is highly predictable.&lt;/p&gt;

&lt;p&gt;We couldn't use it.&lt;/p&gt;

&lt;p&gt;The platform uses a cross-account Lambda to register ECS tasks with ALB target groups at boot. Because of how the registration bridge works, the ECS service resource is provisioned with &lt;code&gt;target_group_arn = null&lt;/code&gt; — the target group lives in a different account, and the service module doesn't know its ARN. &lt;code&gt;ALBRequestCountPerTarget&lt;/code&gt; requires the target group ARN to be known to the Application Auto Scaling policy. Without it, there's no way to wire the metric across accounts without building additional dependency plumbing.&lt;/p&gt;

&lt;p&gt;CPU target tracking at 70% was the correct second choice. For a CPU-bound workload — which 99.8% utilization confirms this is — CPU is a meaningful proxy for load. The metric was there, it was clean, and the task was now sized to make it useful.&lt;/p&gt;

&lt;p&gt;One thing worth noting: the cross-account registration bridge was the right architectural decision for the problem it solved. But it created a constraint three layers away in a scaling configuration we hadn't designed yet. Architecture decisions compound downstream. The fix here was straightforward; I've seen the same pattern take longer to untangle when the constraint wasn't recognized.&lt;/p&gt;

&lt;h2&gt;
  
  
  The observability gap app logs can't fill
&lt;/h2&gt;

&lt;p&gt;Application logs were already flowing to BetterStack from both services. We had route-level latency, HTTP status codes, request counts, error breakdowns — everything that happens inside a container.&lt;/p&gt;

&lt;p&gt;What the logs couldn't tell us was what happens above them. The ALB generates its own error signals: &lt;code&gt;HTTPCode_ELB_5XX_Count&lt;/code&gt; for errors the load balancer generates before a request reaches a container, &lt;code&gt;RejectedConnectionCount&lt;/code&gt; for connections refused at the ALB layer when backend capacity is exhausted, &lt;code&gt;ActiveConnectionCount&lt;/code&gt; as a proxy for in-flight load per target group. None of this appears in application logs. If the ALB had been dropping connections during the 99.8% CPU period, we would have had no signal in our observability platform.&lt;/p&gt;

&lt;p&gt;CloudWatch had the data. The gap was getting it into the same place as everything else.&lt;/p&gt;

&lt;p&gt;A 60-second Lambda in the infrastructure account — where the ALB lives — calls &lt;code&gt;GetMetricData&lt;/code&gt; and ships structured JSON to BetterStack. One EventBridge rule, no ECS changes, effectively zero cost (one CloudWatch API call per minute against Lambda's free tier). The metrics land alongside the application data and show the ALB layer that the app logs are blind to.&lt;/p&gt;

&lt;p&gt;The design decision here was Lambda over an ECS sidecar. A sidecar would have run per-service, per-task, 24 hours a day, and required task definition changes across the platform. A single Lambda running once per minute in the account that owns the ALB costs nothing and touches no ECS configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Autoscaling parameters worth explaining
&lt;/h2&gt;

&lt;p&gt;For the higher-load service: min=9, max=20, CPU target=70%, scale-out cooldown=60s, scale-in cooldown=300s.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;min_capacity&lt;/code&gt; to 9 — the current running task count — was deliberate. We'd just established that 9 tasks was a functional floor for this workload at current traffic levels. An autoscaler configured with min=2 or min=4 would have attempted to scale in on the first quiet period, bringing the service back to a state we knew was already under-provisioned. Anchoring the floor to the observed stable-state count prevents that while we accumulate enough autoscaling history to set a meaningful long-term floor.&lt;/p&gt;

&lt;p&gt;The asymmetric cooldowns — 60 seconds for scale-out, 5 minutes for scale-in — reflect the cost asymmetry of being wrong in each direction. Scaling out too slowly during a load spike means requests queue. Scaling in too aggressively during a brief quiet period means tasks are killed and restarted unnecessarily. The 5-minute scale-in cooldown is conservative; we'll revisit it once we have a week of data showing where the service naturally stabilizes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 24 hours of data drove
&lt;/h2&gt;

&lt;p&gt;We launched expecting to spend the first week observing. What the data delivered instead was a complete picture of three distinct problems: a task sizing issue that was causing queuing, a scaling policy that needed the right foundation before it could be configured, and an observability gap for a class of signals that app logs fundamentally can't surface.&lt;/p&gt;

&lt;p&gt;All three were solved from the same 24-hour data window. The pre-launch load testing hadn't revealed any of them — synthetic traffic and production ad-bidding traffic have different CPU profiles, and you don't know which until the real thing runs.&lt;/p&gt;

&lt;p&gt;The thing I'd change if running this again: put a structured post-launch data review into the go-live plan, not the next morning's to-do list. Not a formal incident review — a deliberate hour with CloudWatch after the first day's traffic has run through. The data is there. The question is whether you've planned to look at it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're planning a production go-live and want a structured approach to post-launch data review and stabilization — or you're staring at a service running at ceiling with no autoscaling — &lt;a href="https://graycloudarch.com/contact/" rel="noopener noreferrer"&gt;get in touch&lt;/a&gt;. This is the kind of platform work I do regularly, and the pattern here applies well beyond ad delivery.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ecs</category>
      <category>cloudwatch</category>
      <category>autoscaling</category>
      <category>rightsizing</category>
    </item>
    <item>
      <title>DNS Validation: From 15 Steps to Zero</title>
      <dc:creator>Glenn Gray</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:30:30 +0000</pubDate>
      <link>https://dev.to/tallgray1/dns-validation-from-15-steps-to-zero-1nng</link>
      <guid>https://dev.to/tallgray1/dns-validation-from-15-steps-to-zero-1nng</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://graycloudarch.com/blog/dns-hell-to-automated/" rel="noopener noreferrer"&gt;graycloudarch.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You know what's the worst part of launching a new site?&lt;/p&gt;

&lt;p&gt;SSL certificate validation.&lt;/p&gt;

&lt;p&gt;Not creating the cert—that's one click in AWS ACM. It's the validation dance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS gives you a CNAME record: &lt;code&gt;_abc123extremely-long-string-here.graycloudarch.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The value is equally ridiculous: &lt;code&gt;_xyz789another-massive-string.acm-validations.aws.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You copy it (pray you don't miss a character)&lt;/li&gt;
&lt;li&gt;Switch to Cloudflare (or Route 53, or wherever)&lt;/li&gt;
&lt;li&gt;Paste it in&lt;/li&gt;
&lt;li&gt;Wait 5-10 minutes&lt;/li&gt;
&lt;li&gt;Refresh AWS console&lt;/li&gt;
&lt;li&gt;Still pending...&lt;/li&gt;
&lt;li&gt;Refresh again&lt;/li&gt;
&lt;li&gt;Finally validated!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now do it again for &lt;code&gt;www.graycloudarch.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And then repeat the whole thing for your second domain.&lt;/p&gt;

&lt;p&gt;This is "DNS hell."&lt;/p&gt;

&lt;h2&gt;
  
  
  There's a Better Way
&lt;/h2&gt;

&lt;p&gt;Terraform can read AWS validation records and create them in Cloudflare automatically.&lt;/p&gt;

&lt;p&gt;Zero copy-paste. Zero browser tab switching. Zero waiting and refreshing.&lt;/p&gt;

&lt;p&gt;Here's the whole thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Request certificate&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"graycloudarch.com"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.graycloudarch.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create validation records in Cloudflare&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cert_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;
      &lt;span class="nx"&gt;value&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_value&lt;/span&gt;
      &lt;span class="nx"&gt;type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_type&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;zone_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cloudflare_zone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;
  &lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Critical - ACM validation breaks with proxy&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Wait for validation&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"site"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;certificate_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;validation_record_fqdns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;cloudflare_record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cert_validation&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;terraform apply&lt;/code&gt;. Go make coffee. Come back to a validated certificate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Magic: for_each
&lt;/h2&gt;

&lt;p&gt;The key is this part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;aws_acm_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;site&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_validation_options&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;domain_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dvo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_record_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS generates validation records dynamically (one for apex domain, one for www). Terraform reads them, loops over them, and creates each one in Cloudflare.&lt;/p&gt;

&lt;p&gt;You never see the records. You never copy anything. It just works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Screwed Up
&lt;/h2&gt;

&lt;p&gt;First time I ran this, ACM validation timed out after 30 minutes.&lt;/p&gt;

&lt;p&gt;The problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Wrong!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloudflare's proxy rewrites DNS responses. ACM's validation servers hit Cloudflare's IP instead of seeing your validation record.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;proxied&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Correct&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DNS-only mode. No proxy. ACM validation works.&lt;/p&gt;

&lt;p&gt;Cost me 30 minutes of debugging. Now it's in code so I never hit it again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;I'm running two brands: graycloudarch.com and cloudpatterns.io.&lt;/p&gt;

&lt;p&gt;Manual approach: 15 steps per domain = 30 steps total. 30 minutes minimum. High chance of typos.&lt;/p&gt;

&lt;p&gt;Terraform approach: One &lt;code&gt;terraform apply&lt;/code&gt;. 5 minutes to write the code (once), 10 minutes for AWS to validate. Then copy-paste the pattern for the second domain.&lt;/p&gt;

&lt;p&gt;When I launch my third brand (and I will), it'll take 5 minutes and one terraform apply.&lt;/p&gt;

&lt;p&gt;That's the bet: upfront automation for long-term velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part People Miss
&lt;/h2&gt;

&lt;p&gt;Most Terraform tutorials stop at requesting the certificate. They don't show you the validation loop or the waiting resource.&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;aws_acm_certificate_validation&lt;/code&gt;, Terraform exits immediately after creating the cert. It's still "Pending Validation" in AWS. When you try to use it in CloudFront, it fails.&lt;/p&gt;

&lt;p&gt;You'd have to run &lt;code&gt;terraform apply&lt;/code&gt; again later, after manually checking that validation completed.&lt;/p&gt;

&lt;p&gt;That's not automation—that's just documentation.&lt;/p&gt;

&lt;p&gt;The waiting resource makes it truly hands-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling It
&lt;/h2&gt;

&lt;p&gt;Adding a second domain is 10 lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns.io"&lt;/span&gt;
  &lt;span class="nx"&gt;validation_method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS"&lt;/span&gt;
  &lt;span class="nx"&gt;subject_alternative_names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"www.cloudpatterns.io"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"cloudflare_record"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns_validation"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* same pattern */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_acm_certificate_validation"&lt;/span&gt; &lt;span class="s2"&gt;"cloudpatterns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern, different names. No clicking. No switching between consoles. No remembering which validation record goes where.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win
&lt;/h2&gt;

&lt;p&gt;It's not the time savings (though 30 minutes per deployment adds up).&lt;/p&gt;

&lt;p&gt;It's the mental overhead.&lt;/p&gt;

&lt;p&gt;Manual DNS configuration requires focus. "Did I copy the whole string? Did I add the trailing dot? Is it DNS-only mode?"&lt;/p&gt;

&lt;p&gt;Terraform requires running one command. That's it.&lt;/p&gt;

&lt;p&gt;I get my focus back. I can write this blog post while Terraform validates certificates.&lt;/p&gt;

&lt;p&gt;Want the full code? It's not open source (yet), but if you're building something similar and want to talk through it, &lt;a href="https://graycloudarch.com/contact" rel="noopener noreferrer"&gt;reach out&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Or if you just want to tell me I'm overthinking this and should've clicked through Cloudflare like a normal person, that's cool too.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>cloudflare</category>
      <category>dns</category>
    </item>
  </channel>
</rss>
