<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: hugolesta</title>
    <description>The latest articles on DEV Community by hugolesta (@hugolesta).</description>
    <link>https://dev.to/hugolesta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F199721%2Fb7aa0261-c82c-4a48-9701-2c03e14d3102.jpeg</url>
      <title>DEV Community: hugolesta</title>
      <link>https://dev.to/hugolesta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hugolesta"/>
    <language>en</language>
    <item>
      <title>How to Upgrade EKS 1.32: Making the Switch from bootstrap.sh to nodeadm</title>
      <dc:creator>hugolesta</dc:creator>
      <pubDate>Tue, 28 Oct 2025 17:10:23 +0000</pubDate>
      <link>https://dev.to/hugolesta/how-to-upgrade-eks-132-making-the-switch-from-bootstrapsh-to-nodeadm-9lf</link>
      <guid>https://dev.to/hugolesta/how-to-upgrade-eks-132-making-the-switch-from-bootstrapsh-to-nodeadm-9lf</guid>
      <description>&lt;p&gt;Did you know staying on a deprecated version of Kubernetes in EKS can cost you six times more? At $0.60 per hour instead of the standard $0.10 per hour, you could end up paying nearly $500 per month just for running outdated clusters.&lt;/p&gt;

&lt;p&gt;However, the EKS upgrade from version 1.31 to 1.32 isn't just about avoiding extended support fees—it introduces one of the biggest under-the-hood changes in recent EKS history. Specifically, the traditional bootstrap.sh script used for years to configure worker nodes is now replaced by a new tool called &lt;code&gt;nodeadm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This architectural shift coincides with another critical change: after November 26, 2025, Amazon EKS will no longer publish EKS-optimized Amazon Linux 2 (AL2) AMIs. Furthermore, Kubernetes 1.32 will be the final version with AL2 AMI support, while version 1.33 onwards will only support Amazon Linux 2023 (AL2023) and Bottlerocket based AMIs.&lt;/p&gt;

&lt;p&gt;If your clusters currently run EKS 1.31 or earlier on Amazon Linux 2 AMIs, upgrading to 1.32 will break your node initialization unless you adapt your user-data scripts and switch to AL2023. Throughout this article, we'll walk through exactly what changes between versions 1.31 and 1.32, why &lt;code&gt;nodeadm&lt;/code&gt; is now required, and how to rewrite your user-data and Terraform templates to ensure a smooth transition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why EKS 1.32 Requires a New Bootstrap Approach
&lt;/h2&gt;

&lt;p&gt;Amazon's evolution of EKS introduces significant architectural changes with version 1.32. The most critical change affects how worker nodes bootstrap and join your clusters, requiring careful attention during upgrades.&lt;/p&gt;

&lt;h2&gt;
  
  
  End of Support for bootstrap.sh in AL2023
&lt;/h2&gt;

&lt;p&gt;EKS 1.32 marks a pivotal shift as it's the &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions-standard.html" rel="noopener noreferrer"&gt;final version for which Amazon will release Amazon Linux 2 (AL2) AMIs&lt;/a&gt;. Starting with Kubernetes 1.32, AL2023 introduces a completely different node initialization process that abandons the traditional &lt;code&gt;/etc/eks/bootstrap.sh&lt;/code&gt; script. This script has been the foundation of EKS node bootstrapping since the service launched, but is now replaced entirely in the AL2023 operating system. The familiar bash-based bootstrap approach that many DevOps teams have built automation around is completely absent in the new OS version.&lt;/p&gt;

&lt;h2&gt;
  
  
  nodeadm as the New Default Bootstrap Tool
&lt;/h2&gt;

&lt;p&gt;AL2023 replaces bootstrap.sh with &lt;code&gt;nodeadm&lt;/code&gt;, a tool that uses a YAML configuration schema. Unlike the previous approach where metadata was discovered automatically through the Amazon EKS &lt;code&gt;DescribeCluster&lt;/code&gt; API call, &lt;code&gt;nodeadm&lt;/code&gt; requires explicit provision of cluster information. This fundamental change means you must now specify three critical parameters that were previously auto-discovered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;apiServerEndpoint&lt;/li&gt;
&lt;li&gt;certificateAuthority&lt;/li&gt;
&lt;li&gt;service CIDR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, the format for applying parameters to kubelet has changed. Previously accomplished with &lt;code&gt;--kubelet-extra-args&lt;/code&gt;, node customization now requires using &lt;code&gt;NodeConfigSpec&lt;/code&gt;. This shift aims to reduce API throttling risks during large-scale node deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Impact on Existing AL2-Based Clusters
&lt;/h2&gt;

&lt;p&gt;The elimination of bootstrap.sh creates immediate backward compatibility issues. When upgrading an EKS cluster to version 1.32 while still using AL2-based node groups, &lt;a href="https://medium.com/@kalidindi-naveen/eks-ami-upgradation-journey-from-amazon-linux-2-to-amazon-linux-2023-385c1d958b27" rel="noopener noreferrer"&gt;nodes will fail to join the cluster&lt;/a&gt;. Any automation depending on bootstrap.sh will break, as files like &lt;code&gt;/etc/eks/bootstrap.sh&lt;/code&gt; and &lt;code&gt;/etc/eks/eni-max-pods.txt&lt;/code&gt; no longer exist.&lt;/p&gt;

&lt;p&gt;For organizations with self-managed node groups or custom AMI configurations, this requires extracting and explicitly providing cluster metadata that was formerly obtained automatically. Consequently, deployment scripts, Terraform modules, and CloudFormation templates must be rewritten to align with the new declarative approach before successfully migrating to EKS 1.32.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparing for the Migration to nodeadm
&lt;/h2&gt;

&lt;p&gt;Before migrating to EKS 1.32 with &lt;code&gt;nodeadm&lt;/code&gt;, careful preparation is essential to ensure a smooth transition from the traditional bootstrap approach to the new paradigm.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying Affected Clusters Running AL2
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/eks-ami-deprecation-faqs.html" rel="noopener noreferrer"&gt;After November 26, 2025&lt;/a&gt;, AWS will end support for EKS AL2-optimized AMIs. Kubernetes version 1.32 represents the final release where Amazon EKS will provide AL2 AMIs. This deadline necessitates prompt action, especially for organizations with multiple clusters.&lt;/p&gt;

&lt;p&gt;To identify affected clusters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check the AMI type for each node group using:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws eks describe-nodegroup --cluster-name &amp;lt;cluster-name&amp;gt; --nodegroup-name &amp;lt;nodegroup-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Examine existing user-data scripts that reference &lt;code&gt;/etc/eks/bootstrap.sh&lt;/code&gt;, which won't exist in AL2023.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Choosing Between AL2023 and Bottlerocket AMIs
&lt;/h2&gt;

&lt;p&gt;Upon identifying clusters requiring migration, you must decide between AL2023 and Bottlerocket as your future node operating system.&lt;/p&gt;

&lt;p&gt;AL2023 offers several advantages over AL2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Secure-by-default approach with preconfigured security policies&lt;/li&gt;
&lt;li&gt;SELinux in permissive mode&lt;/li&gt;
&lt;li&gt;IMDSv2-only mode enabled by default&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/eks-ami-deprecation-faqs.html" rel="noopener noreferrer"&gt;Optimized boot times and improved package management&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alternatively, Bottlerocket provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purpose-built container-optimized design with minimal attack surface&lt;/li&gt;
&lt;li&gt;Enhanced security with read-only file systems&lt;/li&gt;
&lt;li&gt;Automatic updates&lt;/li&gt;
&lt;li&gt;Improved compliance with security standards like CIS benchmarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choose AL2023 when you need significant customizations with direct OS-level access. Opt for Bottlerocket if you prefer a container-native approach with minimal node customization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scanning for Deprecated APIs with kubent or pluto
&lt;/h2&gt;

&lt;p&gt;Prior to upgrading, scan for deprecated APIs that might break during the transition. Kubernetes frequently removes beta APIs with each new version, potentially disrupting your workloads.&lt;/p&gt;

&lt;p&gt;The kube-no-trouble tool (kubent) efficiently identifies resources using deprecated APIs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This scans all accessible namespaces and lists APIs that will be deprecated compared to your current Kubernetes version. For clusters with hundreds of applications across multiple namespaces, this tool proves invaluable in detecting potential upgrade issues beforehand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backing Up Cluster State and Node Configurations
&lt;/h2&gt;

&lt;p&gt;A comprehensive migration plan should always include thorough backups. Given that you cannot downgrade an EKS cluster after upgrading, backups become crucial.&lt;/p&gt;

&lt;p&gt;Recommended backup steps include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Taking etcd snapshots for core Kubernetes data&lt;/li&gt;
&lt;li&gt;Backing up cluster configuration using tools like Velero&lt;/li&gt;
&lt;li&gt;Documenting node-specific configurations, especially custom user-data scripts&lt;/li&gt;
&lt;li&gt;Preserving IAM role configurations that will need migration to the new nodeadm format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, establish a documented rollback procedure with well-defined testing protocols before proceeding with the upgrade.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing nodeadm and AL2023
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuelrvrgw68lne8v0iqea.WEBP" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuelrvrgw68lne8v0iqea.WEBP" alt="How EKS cluster works" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Image Source: AWS Documentation&lt;/p&gt;

&lt;h2&gt;
  
  
  Debugging and Validating the Upgrade
&lt;/h2&gt;

&lt;p&gt;After implementing &lt;code&gt;nodeadm&lt;/code&gt; and upgrading to EKS 1.32, troubleshooting becomes essential as new components may not function perfectly on the first try. The migration introduces different debugging approaches that we need to master.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common nodeadm Errors and How to Fix Them
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;nodeadm&lt;/code&gt; debug command serves as our first line of defense when troubleshooting unhealthy or misconfigured nodes. It validates critical requirements including network access to AWS APIs, credentials for the IAM role, connectivity to the EKS Kubernetes API endpoint, and authentication with the EKS cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodeadm debug -c file://nodeConfig.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For configuration validation before implementation, the &lt;code&gt;nodeadm config check&lt;/code&gt; command proves invaluable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodeadm config check -c file://nodeConfig.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most permission-related issues arise from the Hybrid Nodes IAM role &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/hybrid-nodes-troubleshooting.html" rel="noopener noreferrer"&gt;missing the necessary eks:DescribeCluster action&lt;/a&gt;. Other common errors include network connectivity problems, incorrect node IP configuration, and timeout issues which can be remedied by extending timeouts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodeadm install K8S_VERSION --credential-provider CREDS_PROVIDER --timeout 20m0s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Verifying kubelet and containerd Startup Logs
&lt;/h2&gt;

&lt;p&gt;Examining kubelet logs offers visibility into node initialization problems. For AL2023 nodes, we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;systemctl status kubelet
journalctl -u kubelet -o cat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Moreover, checking the status helps verify successful restarts after upgrades. For deeper troubleshooting, we can connect to the node using SSH or kubectl debug and inspect the logs. My personal preference is using &lt;a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html" rel="noopener noreferrer"&gt;AWS Systems Manager Session Manager&lt;/a&gt; to enhance security, avoiding opening SSH ports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chroot /host journalctl -u kubelet -o cat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Checking Node Readiness and Taints
&lt;/h2&gt;

&lt;p&gt;To verify node status after migration, we use standard kubectl commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get nodes -o wide
kubectl describe node NODE_NAME
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The STATUS column should display "Ready" for all nodes with the updated version number visible. Untainted nodes are essential for workload scheduling. Furthermore, checking events related to nodes can reveal issues with node registration or initialization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validating Add-on Compatibility Post-Upgrade
&lt;/h2&gt;

&lt;p&gt;Post-upgrade add-on validation requires checking deployment versions and ensuring pods are running correctly. For example, to verify the CoreDNS version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe deployment coredns -n kube-system | grep Image | cut -d ':' -f 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To inspect add-on logs for errors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs -n kube-system -l k8s-app=kube-dns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For networking add-ons like Amazon VPC CNI, we should create test pods to validate IP assignment. Additionally, testing CoreDNS functionality using tools like nslookup ensures proper DNS resolution. Finally, checking if the number of replicas equals the number of nodes for daemonset add-ons like vpc-cni and kube-proxy confirms proper deployment.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Migrating to EKS 1.32 represents a significant paradigm shift for Kubernetes operations on AWS. The transition from &lt;code&gt;bootstrap.sh&lt;/code&gt; to &lt;code&gt;nodeadm&lt;/code&gt; fundamentally changes how worker nodes join your clusters, requiring careful planning and execution. Additionally, the impending deprecation of Amazon Linux 2 AMIs after November 26, 2025, creates urgency for organizations to adapt their infrastructure.&lt;/p&gt;

&lt;p&gt;Throughout this upgrade journey, you must remember that &lt;code&gt;nodeadm&lt;/code&gt; demands explicit configuration through YAML rather than command-line arguments. This declarative approach actually offers better consistency and reproducibility for node configurations once implemented correctly. Undoubtedly, the initial migration might seem daunting, especially when rewriting existing automation scripts or Terraform modules.&lt;/p&gt;

&lt;p&gt;The choice between AL2023 and Bottlerocket AMIs depends largely on your specific operational requirements. AL2023 provides a familiar environment with improved security features, whereas Bottlerocket offers a container-optimized approach with minimal attack surface. Regardless of your choice, both options eliminate the traditional bootstrap files and require adaptation to the new &lt;code&gt;nodeadm&lt;/code&gt; paradigm.&lt;/p&gt;

&lt;p&gt;Before initiating any upgrade, thorough preparation becomes essential. First, identify affected clusters running AL2, then scan for deprecated APIs, and finally back up your cluster state. After implementing the necessary changes, debugging tools like &lt;code&gt;nodeadm debug&lt;/code&gt; and &lt;code&gt;nodeadm config check&lt;/code&gt; help troubleshoot any issues that arise during the migration process.&lt;/p&gt;

&lt;p&gt;The EKS 1.32 upgrade, therefore, presents both challenges and opportunities. While it requires significant changes to existing workflows, it also aligns your infrastructure with AWS's future direction. Consequently, organizations that proactively adapt their node initialization processes will avoid extended support fees and benefit from improved security and performance features of newer operating systems.&lt;/p&gt;

&lt;p&gt;Ultimately, staying current with EKS versions not only saves operational costs but also ensures compatibility with the evolving Kubernetes ecosystem. Though this particular upgrade demands more effort than typical version bumps, the long-term benefits of embracing &lt;code&gt;nodeadm&lt;/code&gt; and AL2023 far outweigh the initial investment required for migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;EKS 1.32 introduces the most significant architectural change in recent history, requiring organizations to abandon the traditional bootstrap.sh script in favor of nodeadm for worker node initialization.&lt;/p&gt;

&lt;p&gt;• &lt;strong&gt;Critical deadline approaching:&lt;/strong&gt; Amazon ends AL2 AMI support on November 26, 2025, making EKS 1.32 the final version supporting Amazon Linux 2.&lt;/p&gt;

&lt;p&gt;• &lt;strong&gt;Bootstrap method completely changes:&lt;/strong&gt; nodeadm replaces bootstrap.sh and requires explicit YAML configuration instead of automatic cluster metadata discovery.&lt;/p&gt;

&lt;p&gt;• &lt;strong&gt;Migration requires careful preparation:&lt;/strong&gt; Identify AL2-based clusters, scan for deprecated APIs with kubent, and backup cluster state before upgrading.&lt;/p&gt;

&lt;p&gt;• &lt;strong&gt;Choose your AMI strategy:&lt;/strong&gt; Select between AL2023 for familiar environments with enhanced security or Bottlerocket for container-optimized minimal attack surface.&lt;/p&gt;

&lt;p&gt;• &lt;strong&gt;Debug with new tools:&lt;/strong&gt; Use nodeadm debug and nodeadm config check commands to troubleshoot configuration issues and validate node readiness.&lt;/p&gt;

&lt;p&gt;• &lt;strong&gt;Avoid costly extended support:&lt;/strong&gt; Staying on deprecated Kubernetes versions costs 6x more ($0.60/hour vs $0.10/hour), potentially adding $500+ monthly per cluster.&lt;/p&gt;

&lt;p&gt;The shift to nodeadm represents AWS's commitment to more secure, declarative infrastructure management. While the initial migration requires significant effort in rewriting automation scripts and Terraform modules, organizations that proactively adapt will benefit from improved security, performance, and long-term cost savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q1. What is the main change in EKS 1.32 that affects node initialization?&lt;/strong&gt; EKS 1.32 replaces the traditional bootstrap.sh script with a new tool called nodeadm for worker node initialization. This change requires explicit YAML configuration instead of automatic cluster metadata discovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q2. When will Amazon stop supporting Amazon Linux 2 (AL2) AMIs for EKS?&lt;/strong&gt; Amazon will end support for EKS AL2-optimized AMIs after November 26, 2025. Kubernetes version 1.32 is the final release where Amazon EKS will provide AL2 AMIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q3. What are the alternatives to Amazon Linux 2 for EKS nodes?&lt;/strong&gt; The main alternatives are Amazon Linux 2023 (AL2023) and Bottlerocket AMIs. AL2023 offers improved security features and a familiar environment, while Bottlerocket provides a container-optimized design with a minimal attack surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q4. How can I identify and fix common nodeadm errors?&lt;/strong&gt; You can use the 'nodeadm debug' command to validate critical requirements and the 'nodeadm config check' command to validate configurations. Common issues include missing IAM permissions, network connectivity problems, and incorrect node IP configurations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q5. What are the cost implications of staying on a deprecated version of EKS?&lt;/strong&gt; Running a deprecated version of EKS can cost six times more than the standard rate. Instead of $0.10 per hour, you could end up paying $0.60 per hour, potentially adding over $500 per month for each outdated cluster.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>devops</category>
    </item>
    <item>
      <title>The $10,000 Label: How We Used Go, Clean Architecture, and AWS to Build a FinOps-Driven Cloud Tagging Engine 🏷️</title>
      <dc:creator>hugolesta</dc:creator>
      <pubDate>Mon, 29 Sep 2025 16:01:54 +0000</pubDate>
      <link>https://dev.to/hugolesta/the-10000-label-how-we-used-go-clean-architecture-and-aws-to-build-a-finops-driven-cloud-2988</link>
      <guid>https://dev.to/hugolesta/the-10000-label-how-we-used-go-clean-architecture-and-aws-to-build-a-finops-driven-cloud-2988</guid>
      <description>&lt;p&gt;&lt;strong&gt;Why Consistent Tagging is Your Company’s Most Underrated FinOps Tool:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Business Problem:&lt;/strong&gt; Imagine your cloud bill is a massive corporate expense report. Without proper tagging—simple key-value labels like &lt;code&gt;project: crm-migration&lt;/code&gt; or &lt;code&gt;owner: finance-team&lt;/code&gt; —you're paying thousands every month for line items labeled simply “Server.” This isn't just an accounting headache; it's a direct threat to cost control and security.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Bloat:&lt;/strong&gt; Orphaned or forgotten AWS resources (Shadow IT) continue to generate costs because no one is accountable for terminating them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Billing Disputes:&lt;/strong&gt; Finance teams struggle to attribute costs accurately, leading to friction and delayed chargebacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Risks:&lt;/strong&gt; Unmanaged resources often fall outside compliance or patch cycles.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We decided to solve this with &lt;strong&gt;sys-tag-manager&lt;/strong&gt;, a powerful, automated system built in Golang that acts as our centralized "Cloud Label Printer," ensuring every AWS resource is correctly accounted for, compliant, and cost-trackable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Is a Tagging Strategy?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A tagging strategy is a &lt;strong&gt;structured approach to applying metadata (tags)&lt;/strong&gt; to cloud resources. Tags are simple key–value pairs like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;owner: finance-team
project: crm-migration
environment: production

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On their own, tags look trivial. But when applied consistently across an entire cloud estate, they form the backbone of organization, governance, and cost management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A tagging strategy defines:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Which tags are required&lt;/strong&gt; (e.g. owner, project, environment).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How tags should be formatted&lt;/strong&gt; (naming conventions, lowercase vs camelCase, separators).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When tags should be applied&lt;/strong&gt; (at creation time vs automated correction).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Who is responsible&lt;/strong&gt; for maintaining them?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For further guidance, consult the AWS documentation: &lt;a href="https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/tagging-best-practices.html" rel="noopener noreferrer"&gt;Best Practices for Tagging AWS Resources&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical Foundation: Go, Clean Architecture, and AWS Cost Savings&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wzxfdb1qzcfncwiokda.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wzxfdb1qzcfncwiokda.png" alt="Pretty gopher doing Thumb-up" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Go Advantage: Performance Meets FinOps:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While prototyping in Python is quick, a core, mission-critical tool demands performance and reliability. We chose Golang for &lt;strong&gt;sys-tag-manager&lt;/strong&gt; because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost-Efficient Execution:&lt;/strong&gt; Go's minimal memory footprint and extremely fast startup time are critical when running as AWS Lambda functions or Kubernetes CronJobs. This translates directly into lower AWS compute costs (less time billed for execution) compared to resource-heavier languages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; Static typing and robust concurrency ensure the system can efficiently handle a rapidly growing number of AWS API calls without failure, ensuring 100% compliance coverage.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m618venq5pdoydk5g1g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m618venq5pdoydk5g1g.png" alt="The clean architecture, Domain, Adapter and User Case" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Clean Architecture: Decoupling Logic from the AWS SDK&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To ensure the system remains maintainable as our cloud estate scales, we invested in &lt;strong&gt;Clean Architecture&lt;/strong&gt;. This strategic separation is key to our long-term technical debt reduction:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Domain (Core Business Logic):&lt;/strong&gt; Pure, independent rules (e.g., "A resource is compliant if it has the required tags: &lt;code&gt;owner&lt;/code&gt; and &lt;code&gt;project&lt;/code&gt;"). This is highly &lt;strong&gt;testable&lt;/strong&gt; and knows nothing about AWS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Cases (Application Logic):&lt;/strong&gt; Defines the "what" (e.g., "Check compliance and apply tags").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapters (AWS Implementation):&lt;/strong&gt; Isolated logic that interacts directly with specific AWS SDK services (&lt;code&gt;EC2Tagger&lt;/code&gt;, &lt;code&gt;RDSTagger&lt;/code&gt;). This prevents vendor lock-in and allows us to add new services (S3, Lambda) without touching the core business rules.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Compliance Engine: Terraform, SSM, and Metadata Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dmb67cnjqdn7m87s5jz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dmb67cnjqdn7m87s5jz.png" alt="AWS Resource explorer is the discovery layer" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Discovery Layer: Leveraging AWS Resource Explorer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before &lt;strong&gt;sys-tag-manager&lt;/strong&gt; can fix untagged resources, it must efficiently find them across accounts and Regions. We achieved this by using &lt;strong&gt;AWS Resource Explorer&lt;/strong&gt; as our primary discovery and inventory layer.&lt;/p&gt;

&lt;p&gt;Instead of writing complex API calls to list every resource type across every Region, &lt;strong&gt;sys-tag-manager&lt;/strong&gt; utilizes Resource Explorer's unified search capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Discovery:&lt;/strong&gt; &lt;strong&gt;sys-tag-manager&lt;/strong&gt; uses the Resource Explorer API to query the entire cloud estate for resources that are missing required tags (e.g., &lt;code&gt;tag:owner is absent&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation:&lt;/strong&gt; For each untagged resource found, &lt;strong&gt;sys-tag-manager&lt;/strong&gt; checks its metadata against the centralized, correct rules stored in &lt;strong&gt;AWS SSM Parameter Store&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Correction:&lt;/strong&gt; The system then applies the right tags, assigning the resource to the correct owner or project, ensuring immediate compliance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This design significantly streamlined the core tagging loop, ensuring we are not just efficient in applying tags (Golang), but also efficient in &lt;strong&gt;finding&lt;/strong&gt; them (Resource Explorer), saving API call costs and latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harnessing Shared Infrastructure: The Fallback Mechanism&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While our primary goal is to enforce resource-specific tagging, we recognized that some resources are &lt;strong&gt;"Shared Infrastructure"&lt;/strong&gt; (e.g., core networking components, centralized security groups) that don't belong to a single owner. Addressing this was a critical design challenge.&lt;/p&gt;

&lt;p&gt;Our solution was a &lt;strong&gt;smart fallback mechanism&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The tagging engine first checks for the required resource-specific tags.&lt;/li&gt;
&lt;li&gt;If the tags are missing, it then checks a predefined list of &lt;strong&gt;AWS ARNs&lt;/strong&gt; (Amazon Resource Names) that are designated as shared infrastructure.&lt;/li&gt;
&lt;li&gt;If an ARN matches, the system &lt;strong&gt;applies a generic, shared set of tags&lt;/strong&gt; (e.g., &lt;code&gt;owner: platform-team&lt;/code&gt;, &lt;code&gt;charge-code: shared-infra&lt;/code&gt;) instead of flagging it as non-compliant. This prevents false positives and ensures accurate cost attribution for common resources.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7mnmtvltdw9tm2yeymp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7mnmtvltdw9tm2yeymp.png" alt="Hashicorp and AWS hugging" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Terraform + SSM Parameter Store Synergy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The true power of &lt;strong&gt;sys-tag-manager&lt;/strong&gt; lies in its ability to dynamically enforce tagging rules based on centralized, auditable metadata.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Rule Source:&lt;/strong&gt; We leverage &lt;strong&gt;AWS Systems Manager (SSM) Parameter Store&lt;/strong&gt; to store the required tag keys, values, and compliance rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform as the Single Source of Truth:&lt;/strong&gt; The compliance rules in SSM are managed exclusively by Terraform. This means:

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Immutability:&lt;/strong&gt; Every rule change is tracked, reviewed, and deployed via a GitOps workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation:&lt;/strong&gt; When a new project is created in Terraform, the required tag values for that project are automatically pushed to SSM, immediately making those tags valid for the tag manager checks.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;This integration ensures that the FinOps rules are always aligned with the deployed infrastructure definitions, creating a clean, traceable metadata loop.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Business Impact: Quantifiable Results for FinOps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw9s6sh98t6w6is5l7z6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw9s6sh98t6w6is5l7z6.png" alt="Descriptive pie charts" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before sys-tag-manager&lt;/th&gt;
&lt;th&gt;After sys-tag-manager&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Value Proposition&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compliance Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weeks (Manual Audits)&lt;/td&gt;
&lt;td&gt;Minutes (Automated Correction)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Faster cost allocation &amp;amp; reduced risk.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orphaned Resources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~12% (Estimate)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt;1%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Direct savings on wasted AWS spend.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FinOps Accuracy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High Friction/Disputes&lt;/td&gt;
&lt;td&gt;High Trust/Automated Showback&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Enables accurate, automated chargeback.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By investing in this system, we have fundamentally shifted from reactive tag auditing to &lt;strong&gt;proactive, automated compliance enforcement&lt;/strong&gt;. This not only saves engineering hours but directly enables our finance team to confidently utilize &lt;strong&gt;AWS Cost and Usage Reports (CUR)&lt;/strong&gt; for accurate showback and chargeback, making our entire cloud operation more accountable and &lt;strong&gt;financially efficient&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Wrapping Up: Compliance as Code, Savings as the Result&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;sys-tag-manager&lt;/strong&gt; is more than just an automation script; it's the enforcement layer for our FinOps and security policies, ensuring that our cloud environment is self-healing and financially accountable.&lt;/p&gt;

&lt;p&gt;By embracing &lt;strong&gt;Golang&lt;/strong&gt; for performance, &lt;strong&gt;Clean Architecture&lt;/strong&gt; for maintainability, and the &lt;strong&gt;Terraform + SSM&lt;/strong&gt; synergy for centralized metadata management, we've transformed tagging from a manual burden into an automated, cost-saving asset. This shift has given us the confidence that every dollar spent on AWS is trackable, auditable, and directly attributed to a business owner or project.&lt;/p&gt;

&lt;p&gt;The result is a culture of &lt;strong&gt;Compliance as Code&lt;/strong&gt; where engineers can focus on feature delivery, knowing that the foundational governance—the tagging—is handled automatically and efficiently by &lt;strong&gt;sys-tag-manager&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let's Keep the Conversation Going&lt;/strong&gt; 🗣️&lt;br&gt;
We've focused on the technical core of &lt;strong&gt;sys-tag-manager&lt;/strong&gt;, but the true organizational victory was how we scaled this system across dozens of teams without friction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Would you be interested in learning more about how we automated the communication of compliance status and tagging fixes to developers, FinOps, and management?&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;I'd love to hear your thoughts!&lt;/strong&gt; What's the biggest tagging challenge you face in your organization? &lt;strong&gt;Share your experiences and suggestions in the comments below!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Feel free to connect with me on LinkedIn to discuss our approach to automated communication and team onboarding:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/hugo-lesta-5a058138/" rel="noopener noreferrer"&gt;Hugo Lesta&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/hugolesta" rel="noopener noreferrer"&gt;Hugo Lesta's GitHub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>terraform</category>
      <category>go</category>
    </item>
  </channel>
</rss>
