<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matthew</title>
    <description>The latest articles on DEV Community by Matthew (@matthewdipo).</description>
    <link>https://dev.to/matthewdipo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2968204%2Fd722f79d-c63b-4895-81c0-b4a80d203dfd.jpeg</url>
      <title>DEV Community: Matthew</title>
      <link>https://dev.to/matthewdipo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/matthewdipo"/>
    <language>en</language>
    <item>
      <title>Appendix: Live System Output</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Thu, 28 May 2026 18:51:02 +0000</pubDate>
      <link>https://dev.to/matthewdipo/appendix-live-system-output-4p3e</link>
      <guid>https://dev.to/matthewdipo/appendix-live-system-output-4p3e</guid>
      <description>&lt;h2&gt;
  
  
  Appendix: Live System Output — Real Pipeline in Production
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;All output below was captured live from the running pipeline on 2026-03-08.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;These are not mock outputs — they come from actual AWS infrastructure and Kubernetes clusters.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  ArgoCD — All 50 Applications Across 6 Clusters
&lt;/h2&gt;

&lt;p&gt;The following is the live output of &lt;code&gt;argocd app list&lt;/code&gt; from the hub cluster (&lt;code&gt;myapp-production-use1&lt;/code&gt;).&lt;br&gt;
Every component of the pipeline is represented — security, logging, monitoring, backups, and the application itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;argocd app list &lt;span class="nt"&gt;--output&lt;/span&gt; wide
&lt;span class="go"&gt;
NAME                                           CLUSTER                NAMESPACE         PROJECT     STATUS     HEALTH
argocd/argo-rollouts-myapp-production-use1     myapp-production-use1  argo-rollouts     production  Synced     Healthy
argocd/argo-rollouts-myapp-production-usw2     myapp-production-usw2  argo-rollouts     production  Synced     Healthy
argocd/aws-lbc-myapp-production-use1           myapp-production-use1  kube-system       production  Synced     Healthy
argocd/eso-myapp-production-use1               myapp-production-use1  external-secrets  production  OutOfSync  Healthy   ← known false positive
argocd/eso-myapp-production-usw2               myapp-production-usw2  external-secrets  production  OutOfSync  Healthy   ← known false positive
argocd/falco-myapp-dev-use1                    myapp-dev-use1         falco             production  Synced     Healthy
argocd/falco-myapp-dev-usw2                    myapp-dev-usw2         falco             production  Synced     Healthy
argocd/falco-myapp-production-use1             myapp-production-use1  falco             production  Synced     Healthy
argocd/falco-myapp-production-usw2             myapp-production-usw2  falco             production  Synced     Healthy
argocd/falco-myapp-staging-use1                myapp-staging-use1     falco             production  Synced     Healthy
argocd/falco-myapp-staging-usw2                myapp-staging-usw2     falco             production  Synced     Healthy
argocd/fluent-bit-myapp-dev-use1               myapp-dev-use1         logging           production  Synced     Healthy
argocd/fluent-bit-myapp-dev-usw2               myapp-dev-usw2         logging           production  Synced     Healthy
argocd/fluent-bit-myapp-production-use1        myapp-production-use1  logging           production  Synced     Healthy
argocd/fluent-bit-myapp-production-usw2        myapp-production-usw2  logging           production  Synced     Healthy
argocd/fluent-bit-myapp-staging-use1           myapp-staging-use1     logging           production  Synced     Healthy
argocd/fluent-bit-myapp-staging-usw2           myapp-staging-usw2     logging           production  Synced     Healthy
argocd/karpenter-myapp-production-use1         myapp-production-use1  karpenter         production  Synced     Healthy
argocd/karpenter-myapp-production-usw2         myapp-production-usw2  karpenter         production  Synced     Healthy
argocd/kyverno-myapp-dev-use1                  myapp-dev-use1         kyverno           production  Synced     Healthy
argocd/kyverno-myapp-dev-usw2                  myapp-dev-usw2         kyverno           production  Synced     Healthy
argocd/kyverno-myapp-production-use1           myapp-production-use1  kyverno           production  Synced     Healthy
argocd/kyverno-myapp-production-usw2           myapp-production-usw2  kyverno           production  Synced     Healthy
argocd/kyverno-myapp-staging-use1              myapp-staging-use1     kyverno           production  Synced     Healthy
argocd/kyverno-myapp-staging-usw2              myapp-staging-usw2     kyverno           production  Synced     Healthy
argocd/kyverno-policies-myapp-dev-use1         myapp-dev-use1         kyverno           production  Synced     Healthy
argocd/kyverno-policies-myapp-dev-usw2         myapp-dev-usw2         kyverno           production  Synced     Healthy
argocd/kyverno-policies-myapp-production-use1  myapp-production-use1  kyverno           production  Synced     Healthy
argocd/kyverno-policies-myapp-production-usw2  myapp-production-usw2  kyverno           production  Synced     Healthy
argocd/kyverno-policies-myapp-staging-use1     myapp-staging-use1     kyverno           production  Synced     Healthy
argocd/kyverno-policies-myapp-staging-usw2     myapp-staging-usw2     kyverno           production  Synced     Healthy
argocd/myapp-dev-myapp-dev-use1                myapp-dev-use1         dev               dev         OutOfSync  Healthy   ← ESO drift (expected)
argocd/myapp-dev-myapp-dev-usw2                myapp-dev-usw2         dev               dev         OutOfSync  Healthy   ← ESO drift (expected)
argocd/myapp-production-myapp-production-use1  myapp-production-use1  production        production  OutOfSync  Healthy   ← ESO drift (expected)
argocd/myapp-production-myapp-production-usw2  myapp-production-usw2  production        production  OutOfSync  Healthy   ← ESO drift (expected)
argocd/myapp-staging-myapp-staging-use1        myapp-staging-use1     staging           staging     Synced     Healthy
argocd/myapp-staging-myapp-staging-usw2        myapp-staging-usw2     staging           staging     Synced     Healthy
argocd/prometheus-myapp-production-use1        myapp-production-use1  monitoring        production  OutOfSync  Degraded  ← webhook job timeout (expected)
argocd/prometheus-myapp-production-usw2        myapp-production-usw2  monitoring        production  OutOfSync  Healthy   ← prometheus webhook drift (expected)
argocd/prometheus-myapp-staging-use1           myapp-staging-use1     monitoring        production  OutOfSync  Healthy
argocd/prometheus-myapp-staging-usw2           myapp-staging-usw2     monitoring        production  OutOfSync  Healthy
argocd/velero-myapp-dev-use1                   myapp-dev-use1         velero            production  Synced     Healthy
argocd/velero-myapp-dev-usw2                   myapp-dev-usw2         velero            production  Synced     Healthy
argocd/velero-myapp-production-use1            myapp-production-use1  velero            production  Synced     Healthy
argocd/velero-myapp-production-usw2            myapp-production-usw2  velero            production  Synced     Healthy
argocd/velero-myapp-staging-use1               myapp-staging-use1     velero            production  Synced     Healthy
argocd/velero-myapp-staging-usw2               myapp-staging-usw2     velero            production  Synced     Healthy
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ArgoCD — All 6 Clusters Registered and Reachable
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;argocd cluster list
&lt;span class="go"&gt;
SERVER                                                                    NAME                   VERSION  STATUS
https://3C0575BCE3279BAFF3BB2D5B8444226A.gr7.us-west-2.eks.amazonaws.com  myapp-dev-usw2         1.29+    Successful
https://5079196FCF4ED5112E09CA85D7B8650F.gr7.us-west-2.eks.amazonaws.com  myapp-staging-usw2     1.29+    Successful
https://EA3C5197A0C39EA32557D04B8A2240EA.gr7.us-west-2.eks.amazonaws.com  myapp-production-usw2  1.29+    Successful
https://654498BA82E54D67E79FE325057C464B.gr7.us-east-1.eks.amazonaws.com  myapp-dev-use1         1.29+    Successful
https://6C4AB3A81EFDB980A8356D40C1590263.gr7.us-east-1.eks.amazonaws.com  myapp-staging-use1     1.29+    Successful
https://kubernetes.default.svc                                            myapp-production-use1  1.29+    Successful
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 6 clusters show &lt;code&gt;Successful&lt;/code&gt; — the ArgoCD hub on &lt;code&gt;myapp-production-use1&lt;/code&gt; can communicate with all spoke clusters via VPC peering (private endpoints) and public endpoints (dev).&lt;/p&gt;




&lt;h2&gt;
  
  
  EKS Nodes — Live Cluster Status
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dev Clusters (public endpoints, Kubernetes 1.29)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; dev-use1 get nodes
&lt;span class="go"&gt;
ip-10-0-15-182.ec2.internal    Ready   v1.29.15-eks-ecaa3a6
ip-10-0-22-241.ec2.internal    Ready   v1.29.15-eks-ecaa3a6
ip-10-0-27-28.ec2.internal     Ready   v1.29.15-eks-ecaa3a6

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; dev-usw2 get nodes
&lt;span class="go"&gt;
ip-10-1-28-16.us-west-2.compute.internal    Ready   v1.29.15-eks-ecaa3a6
ip-10-1-3-187.us-west-2.compute.internal    Ready   v1.29.15-eks-ecaa3a6
ip-10-1-7-181.us-west-2.compute.internal    Ready   v1.29.15-eks-ecaa3a6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Production Cluster — myapp-production-use1 (private endpoint)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; wide
&lt;span class="go"&gt;
NAME                           STATUS  VERSION               INTERNAL-IP     INSTANCE-TYPE
ip-10-20-2-113.ec2.internal    Ready   v1.29.15-eks-ecaa3a6  10.20.2.113     t3.medium
ip-10-20-24-200.ec2.internal   Ready   v1.29.15-eks-ecaa3a6  10.20.24.200    t3.medium
ip-10-20-26-204.ec2.internal   Ready   v1.29.15-eks-ecaa3a6  10.20.26.204    t3.medium
ip-10-20-7-170.ec2.internal    Ready   v1.29.15-eks-ecaa3a6  10.20.7.170     t3.medium
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All nodes: &lt;code&gt;Ready&lt;/code&gt;, Kubernetes &lt;code&gt;v1.29.15-eks-ecaa3a6&lt;/code&gt;, VPC private IPs in the &lt;code&gt;10.20.0.0/16&lt;/code&gt; CIDR (production-use1).&lt;/p&gt;




&lt;h2&gt;
  
  
  Application — Running Pods in Production
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get pods &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;span class="go"&gt;
NAME                                                              READY   STATUS    RESTARTS   AGE
myapp-production-myapp-production-use1-myapp-9985ccc88-f7rxj     1/1     Running   1          11d
myapp-production-myapp-production-use1-myapp-9985ccc88-l2sh2     1/1     Running   0          11d
myapp-production-myapp-production-use1-myapp-9985ccc88-vtwsm     1/1     Running   0          11d
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3 replicas running (matches &lt;code&gt;minReplicas: 3&lt;/code&gt; in values-production.yaml).&lt;br&gt;
The pod naming convention shows the ArgoCD release name (&lt;code&gt;myapp-production-myapp-production-use1&lt;/code&gt;) and the Helm chart (&lt;code&gt;myapp&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Argo Rollouts — Canary Controller Active
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get rollouts &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;span class="go"&gt;
NAME                                           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
myapp-production-myapp-production-use1-myapp   3         3         3            3           11d
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get hpa &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;span class="go"&gt;
NAME                                           REFERENCE                                              TARGETS     MINPODS   MAXPODS   REPLICAS
&lt;/span&gt;&lt;span class="gp"&gt;myapp-production-myapp-production-use1-myapp   Rollout/myapp-production-myapp-production-use1-myapp   &amp;lt;unk&amp;gt;&lt;/span&gt;/60%   3         10        3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HPA targets the &lt;code&gt;Rollout&lt;/code&gt; resource (not a &lt;code&gt;Deployment&lt;/code&gt;) — this is the correct configuration for Argo Rollouts. &lt;code&gt;&amp;lt;unknown&amp;gt;/60%&lt;/code&gt; means the metrics-server hasn't collected enough data yet; the HPA is still functional and will scale when CPU crosses 60%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Kyverno — Admission Policies Enforced
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get clusterpolicies
&lt;span class="go"&gt;
NAME                       ADMISSION   BACKGROUND   VALIDATE ACTION   READY   AGE
disallow-latest-tag        true        true         Enforce           True    34h
require-non-root           true        true         Enforce           True    34h
require-readonly-filesystem true       true         Enforce           True    34h
require-resource-limits    true        true         Enforce           True    34h
restrict-image-registry    true        true         Enforce           True    34h
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5 cluster-wide policies active, all in &lt;code&gt;Enforce&lt;/code&gt; mode (not &lt;code&gt;Audit&lt;/code&gt;) — violations are blocked, not just logged. &lt;code&gt;Ready: True&lt;/code&gt; means each policy's webhook is registered and functioning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Falco — Runtime Security DaemonSet Running
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get pods &lt;span class="nt"&gt;-n&lt;/span&gt; falco
&lt;span class="go"&gt;
NAME                                                         READY   STATUS    RESTARTS
falco-myapp-production-use1-5d6dr                            1/1     Running   29
falco-myapp-production-use1-jql5r                            1/1     Running   0
falco-myapp-production-use1-pbhdr                            1/1     Running   0
falco-myapp-production-use1-sjlkh                            1/1     Running   0
falco-myapp-production-use1-falcosidekick-7c56844569-h4vvk   1/1     Running   1
falco-myapp-production-use1-falcosidekick-7c56844569-qcnjh   1/1     Running   2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Falco DaemonSet: one pod per node (4 nodes = 4 Falco pods). All &lt;code&gt;Running&lt;/code&gt;. The 29 restarts on one pod is from the initial eBPF driver loading — normal behaviour on kernel version changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  External Secrets — Synced from AWS Secrets Manager
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get externalsecret &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;span class="go"&gt;
NAME                                                   STORE                                    REFRESH   STATUS         READY
myapp-production-myapp-production-use1-myapp-secrets   myapp-production-myapp-production-use1-  1h        SecretSynced   True
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;SecretSynced: True&lt;/code&gt; — ESO has successfully fetched &lt;code&gt;production/myapp/db-password&lt;/code&gt; from AWS Secrets Manager and created the Kubernetes Secret. The IRSA authentication chain (OIDC token → STS → Secrets Manager) is working correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Velero — Scheduled Backups Running
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get schedules &lt;span class="nt"&gt;-n&lt;/span&gt; velero
&lt;span class="go"&gt;
NAME                                        STATUS    SCHEDULE    LASTBACKUP   AGE
velero-myapp-production-use1-daily-backup   Enabled   0 2 * * *   16h          34h
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Daily backup schedule active. Last backup ran 16 hours ago (2 AM UTC). Backups stored in S3.&lt;/p&gt;




&lt;h2&gt;
  
  
  ECR — Signed Images in Registry
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws ecr describe-images &lt;span class="nt"&gt;--repository-name&lt;/span&gt; myapp &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-mgmt
&lt;span class="go"&gt;
IMAGE TAG                                                          PUSHED AT              SIZE
sha-f72053d0d5fb765bc08d8b5a8374119655997784                      2026-02-22T17:37:27Z   48.5 MB
sha256-f01790daf982...956be0c.sig                                  2026-02-22T17:40:48Z   499 B   ← Cosign signature
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two OCI artifacts per image push:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The application image tagged &lt;code&gt;sha-&amp;lt;full-git-sha&amp;gt;&lt;/code&gt; (48.5 MB)&lt;/li&gt;
&lt;li&gt;The Cosign signature artifact tagged &lt;code&gt;sha256-&amp;lt;digest&amp;gt;.sig&lt;/code&gt; (499 bytes) — this is the cryptographic attestation stored in ECR, verified by Kyverno at admission time&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  AWS GuardDuty — Threat Detection Active
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws guardduty list-detectors + get-detector &lt;span class="o"&gt;(&lt;/span&gt;production account&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;
Status: ENABLED
DataSources:
  - S3 Logs:          ENABLED
  - Kubernetes Audit: ENABLED
  - Malware Protection (EBS): ENABLED

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws guardduty list-detectors &lt;span class="o"&gt;(&lt;/span&gt;staging account&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;
Status: ENABLED
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GuardDuty enabled in both production and staging accounts with EKS audit log monitoring. Any &lt;code&gt;kubectl exec&lt;/code&gt; into production pods, unusual API call patterns, or crypto mining activity will generate findings.&lt;/p&gt;




&lt;h2&gt;
  
  
  AWS CloudWatch — Log Groups from Fluent Bit
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws logs describe-log-groups &lt;span class="nt"&gt;--log-group-name-prefix&lt;/span&gt; &lt;span class="s2"&gt;"/eks/"&lt;/span&gt; &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="o"&gt;(&lt;/span&gt;production account&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;
/eks/myapp-production-use1
/eks/myapp-production-use1/argocd
/eks/myapp-production-use1/external-secrets
/eks/myapp-production-use1/falco
/eks/myapp-production-use1/karpenter
/eks/myapp-production-use1/kyverno
/eks/myapp-production-use1/logging
/eks/myapp-production-use1/monitoring
/eks/myapp-production-use1/production
/eks/myapp-production-use1/velero
/eks/myapp-production-use1/argo-rollouts
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One CloudWatch Log Group per Kubernetes namespace, all prefixed &lt;code&gt;/eks/myapp-production-use1/&lt;/code&gt;. Fluent Bit DaemonSet ships logs from every container in every namespace to the corresponding log group.&lt;/p&gt;




&lt;h2&gt;
  
  
  Live Health Check — Public Endpoint
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;curl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;https://www.matthewoladipupo.dev/health&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;python&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;json.tool&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Production application serving HTTPS traffic. Route53 latency routing directs users to the nearest healthy region. AWS WAF WebACL inspects every request before it reaches the ALB.&lt;/p&gt;




&lt;h2&gt;
  
  
  Grafana — Public Dashboard
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL:      https://grafana.matthewoladipupo.dev
Username: admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grafana deployed on &lt;code&gt;myapp-production-use1&lt;/code&gt; with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50 GiB Prometheus TSDB (15-day retention)&lt;/li&gt;
&lt;li&gt;10 GiB Grafana persistent volume&lt;/li&gt;
&lt;li&gt;ACM wildcard TLS certificate (&lt;code&gt;*.matthewoladipupo.dev&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;ALB internet-facing ingress provisioned by AWS Load Balancer Controller&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Add your Grafana dashboard screenshots here&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary Table — Component Health at Time of Writing
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Clusters&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ArgoCD hub&lt;/td&gt;
&lt;td&gt;prod-use1&lt;/td&gt;
&lt;td&gt;✅ Running, all 6 clusters registered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kyverno policies&lt;/td&gt;
&lt;td&gt;All 6&lt;/td&gt;
&lt;td&gt;✅ 5 ClusterPolicies, Enforce mode, Ready&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Falco DaemonSet&lt;/td&gt;
&lt;td&gt;All 6&lt;/td&gt;
&lt;td&gt;✅ One pod per node, all Running&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fluent Bit DaemonSet&lt;/td&gt;
&lt;td&gt;All 6&lt;/td&gt;
&lt;td&gt;✅ Synced/Healthy, CloudWatch log groups created&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External Secrets&lt;/td&gt;
&lt;td&gt;All 6&lt;/td&gt;
&lt;td&gt;✅ SecretSynced: True&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Velero schedules&lt;/td&gt;
&lt;td&gt;All 6&lt;/td&gt;
&lt;td&gt;✅ Daily backup at 02:00 UTC, last run 16h ago&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Karpenter&lt;/td&gt;
&lt;td&gt;prod-use1, prod-usw2&lt;/td&gt;
&lt;td&gt;✅ Synced/Healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Argo Rollouts&lt;/td&gt;
&lt;td&gt;prod-use1, prod-usw2&lt;/td&gt;
&lt;td&gt;✅ Synced/Healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kube-prometheus-stack&lt;/td&gt;
&lt;td&gt;staging+prod (4)&lt;/td&gt;
&lt;td&gt;✅ Running (OutOfSync is known false positive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GuardDuty&lt;/td&gt;
&lt;td&gt;prod + staging&lt;/td&gt;
&lt;td&gt;✅ ENABLED with EKS audit logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECR images&lt;/td&gt;
&lt;td&gt;mgmt account&lt;/td&gt;
&lt;td&gt;✅ Immutable tags, Cosign signatures present&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS + TLS&lt;/td&gt;
&lt;td&gt;Route53 + ACM&lt;/td&gt;
&lt;td&gt;✅ &lt;code&gt;www.matthewoladipupo.dev&lt;/code&gt; → healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;All data captured live on 2026-03-08 from the running AWS infrastructure.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Production DevSecOps Pipeline — The Complete Day-2 Operations Runbook</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Thu, 28 May 2026 18:50:54 +0000</pubDate>
      <link>https://dev.to/matthewdipo/production-devsecops-pipeline-the-complete-day-2-operations-runbook-5bj2</link>
      <guid>https://dev.to/matthewdipo/production-devsecops-pipeline-the-complete-day-2-operations-runbook-5bj2</guid>
      <description>&lt;h2&gt;
  
  
  DevSecOps Pipeline — Completion Runbook
&lt;/h2&gt;

&lt;p&gt;All code is written and pushed to GitHub. This runbook covers the remaining&lt;br&gt;
operational steps: Terraform applies, GitOps ARN updates, and ArgoCD deployment.&lt;/p&gt;


&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Install these tools if not already present:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# AWS CLI v2&lt;/span&gt;
winget &lt;span class="nb"&gt;install &lt;/span&gt;Amazon.AWSCLI

&lt;span class="c"&gt;# Terraform 1.6+&lt;/span&gt;
winget &lt;span class="nb"&gt;install &lt;/span&gt;HashiCorp.Terraform

&lt;span class="c"&gt;# Terragrunt&lt;/span&gt;
&lt;span class="c"&gt;# Download from https://github.com/gruntwork-io/terragrunt/releases&lt;/span&gt;
&lt;span class="c"&gt;# Place in C:\Windows\System32\ or add to PATH&lt;/span&gt;

&lt;span class="c"&gt;# kubectl&lt;/span&gt;
winget &lt;span class="nb"&gt;install &lt;/span&gt;Kubernetes.kubectl

&lt;span class="c"&gt;# ArgoCD CLI&lt;/span&gt;
winget &lt;span class="nb"&gt;install &lt;/span&gt;argoproj.argocd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  AWS Profile Setup
&lt;/h2&gt;

&lt;p&gt;The root &lt;code&gt;terragrunt.hcl&lt;/code&gt; uses profiles named &lt;code&gt;myapp-{env}-{region_alias}&lt;/code&gt;.&lt;br&gt;
Configure them in &lt;code&gt;~/.aws/config&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[profile myapp-production-use1]&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="py"&gt;role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::591120834781:role/AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;source_profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-production-usw2]&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-west-2&lt;/span&gt;
&lt;span class="py"&gt;role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::591120834781:role/AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;source_profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-staging-use1]&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="py"&gt;role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::690687753178:role/AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;source_profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-staging-usw2]&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-west-2&lt;/span&gt;
&lt;span class="py"&gt;role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::690687753178:role/AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;source_profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-dev-use1]&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="py"&gt;role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::557702566877:role/AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;source_profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-dev-usw2]&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-west-2&lt;/span&gt;
&lt;span class="py"&gt;role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::557702566877:role/AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;source_profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  PHASE 1 — Terraform Applies
&lt;/h2&gt;

&lt;p&gt;Work from the &lt;code&gt;myapp-infra/&lt;/code&gt; directory. Run in the order shown — capture outputs&lt;br&gt;
for updating GitOps files in Phase 2.&lt;/p&gt;
&lt;h3&gt;
  
  
  1.1 WAF (production + staging)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Production us-east-1&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-east-1/waf
&lt;span class="c"&gt;# Output → webacl_arn  (copy this value)&lt;/span&gt;

&lt;span class="c"&gt;# Production us-west-2&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-west-2/waf
&lt;span class="c"&gt;# Output → webacl_arn  (copy this value)&lt;/span&gt;

&lt;span class="c"&gt;# Staging (no GitOps ARN needed, but good to have)&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-east-1/waf
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-west-2/waf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  1.2 GuardDuty (all regions — no outputs needed)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-east-1/guardduty
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-west-2/guardduty
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-east-1/guardduty
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-west-2/guardduty
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;GuardDuty has no GitOps dependency. Alerts appear in the AWS console and&lt;br&gt;
optionally in CloudWatch.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  1.3 ESO IRSA for Staging
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Staging us-east-1&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-east-1/eso-irsa
&lt;span class="c"&gt;# Output → role_arn  (copy → used in environments/staging/applicationset.yaml)&lt;/span&gt;

&lt;span class="c"&gt;# Staging us-west-2&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-west-2/eso-irsa
&lt;span class="c"&gt;# Output → role_arn  (copy → used in environments/staging/applicationset.yaml)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;NOTE: The ESO operator ApplicationSet (&lt;code&gt;infrastructure/eso/applicationset.yaml&lt;/code&gt;)&lt;br&gt;
already includes staging clusters. Once ESO is running on staging and the&lt;br&gt;
ExternalSecret IRSA role is set, ExternalSecrets will sync automatically.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  1.4 Fluent Bit IRSA (all 6 clusters)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-east-1/fluent-bit-irsa
&lt;span class="c"&gt;# → role_arn for myapp-production-use1&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-west-2/fluent-bit-irsa
&lt;span class="c"&gt;# → role_arn for myapp-production-usw2&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-east-1/fluent-bit-irsa
&lt;span class="c"&gt;# → role_arn for myapp-staging-use1&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-west-2/fluent-bit-irsa
&lt;span class="c"&gt;# → role_arn for myapp-staging-usw2&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/dev/us-east-1/fluent-bit-irsa
&lt;span class="c"&gt;# → role_arn for myapp-dev-use1&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/dev/us-west-2/fluent-bit-irsa
&lt;span class="c"&gt;# → role_arn for myapp-dev-usw2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  1.5 Karpenter (production only)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-east-1/karpenter
&lt;span class="c"&gt;# Outputs:&lt;/span&gt;
&lt;span class="c"&gt;#   controller_role_arn   → for karpenter applicationset.yaml&lt;/span&gt;
&lt;span class="c"&gt;#   node_role_arn         → for verification (name = myapp-production-use1-karpenter-node)&lt;/span&gt;
&lt;span class="c"&gt;#   node_instance_profile → for verification&lt;/span&gt;
&lt;span class="c"&gt;#   interruption_queue_name → should be "myapp-production-use1-karpenter"&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-west-2/karpenter
&lt;span class="c"&gt;# Outputs same structure for usw2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;nodeRoleName&lt;/code&gt; values in &lt;code&gt;karpenter/nodepool-applicationset.yaml&lt;/code&gt; are&lt;br&gt;
pre-set to &lt;code&gt;myapp-production-use1-karpenter-node&lt;/code&gt; and &lt;code&gt;myapp-production-usw2-karpenter-node&lt;/code&gt;.&lt;br&gt;
These match what Terraform creates so no update needed there.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  1.6 Velero (all 6 clusters)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Production&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-east-1/velero
&lt;span class="c"&gt;# → role_arn for myapp-production-use1&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/production/us-west-2/velero
&lt;span class="c"&gt;# → role_arn for myapp-production-usw2&lt;/span&gt;

&lt;span class="c"&gt;# Staging&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-east-1/velero
&lt;span class="c"&gt;# → role_arn for myapp-staging-use1&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/staging/us-west-2/velero
&lt;span class="c"&gt;# → role_arn for myapp-staging-usw2&lt;/span&gt;

&lt;span class="c"&gt;# Dev&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/dev/us-east-1/velero
&lt;span class="c"&gt;# → role_arn for myapp-dev-use1&lt;/span&gt;

terragrunt apply &lt;span class="nt"&gt;--terragrunt-working-dir&lt;/span&gt; live/dev/us-west-2/velero
&lt;span class="c"&gt;# → role_arn for myapp-dev-usw2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  PHASE 2 — Update GitOps ARNs
&lt;/h2&gt;

&lt;p&gt;After collecting all outputs from Phase 1, update the GitOps repo&lt;br&gt;
(&lt;code&gt;myapp-gitops/&lt;/code&gt;) and push.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.1 Production WAF ARNs
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;environments/production/applicationset.yaml&lt;/code&gt; — replace &lt;code&gt;"PENDING"&lt;/code&gt; with&lt;br&gt;
real WAF ACL ARNs from Step 1.1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-production-use1&lt;/span&gt;
    &lt;span class="s"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;wafAclArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:wafv2:us-east-1:591120834781:regional/webacl/myapp-production-use1-waf/XXXXXXXX"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-production-usw2&lt;/span&gt;
    &lt;span class="s"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;wafAclArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:wafv2:us-west-2:591120834781:regional/webacl/myapp-production-usw2-waf/XXXXXXXX"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.2 Staging ESO IRSA ARNs
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;environments/staging/applicationset.yaml&lt;/code&gt; — replace &lt;code&gt;"PENDING"&lt;/code&gt; with&lt;br&gt;
role ARNs from Step 1.3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-staging-use1&lt;/span&gt;
    &lt;span class="s"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;irsaRoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::690687753178:role/myapp-staging-use1-eso"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-staging-usw2&lt;/span&gt;
    &lt;span class="s"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;irsaRoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::690687753178:role/myapp-staging-usw2-eso"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.3 Fluent Bit IRSA ARNs
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;infrastructure/logging/applicationset.yaml&lt;/code&gt; — replace all 6 &lt;code&gt;"PENDING"&lt;/code&gt; values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-production-use1  roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-use1-fluent-bit"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-production-usw2  roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-usw2-fluent-bit"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-staging-use1     roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::690687753178:role/myapp-staging-use1-fluent-bit"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-staging-usw2     roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::690687753178:role/myapp-staging-usw2-fluent-bit"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-dev-use1         roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::557702566877:role/myapp-dev-use1-fluent-bit"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-dev-usw2         roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::557702566877:role/myapp-dev-usw2-fluent-bit"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;TIP: Role names follow the pattern &lt;code&gt;{cluster_name}-fluent-bit&lt;/code&gt;. Verify with&lt;br&gt;
&lt;code&gt;terragrunt output role_arn&lt;/code&gt; in each fluent-bit-irsa directory.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2.4 Karpenter Controller Role ARNs
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;infrastructure/karpenter/applicationset.yaml&lt;/code&gt; — replace 2 &lt;code&gt;"PENDING"&lt;/code&gt; values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-production-use1  controllerRole&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-use1-karpenter"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-production-usw2  controllerRole&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-usw2-karpenter"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.5 Velero Role ARNs
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;infrastructure/velero/applicationset.yaml&lt;/code&gt; — replace all 6 &lt;code&gt;"PENDING"&lt;/code&gt; values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-production-use1  roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-use1-velero"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-production-usw2  roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-usw2-velero"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-staging-use1     roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::690687753178:role/myapp-staging-use1-velero"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-staging-usw2     roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::690687753178:role/myapp-staging-usw2-velero"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-dev-use1         roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::557702566877:role/myapp-dev-use1-velero"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster: myapp-dev-usw2         roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::557702566877:role/myapp-dev-usw2-velero"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.6 Slack Webhooks + Grafana Password
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;infrastructure/monitoring/prometheus-values.yaml&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace both &lt;code&gt;https://hooks.slack.com/services/CHANGE_ME&lt;/code&gt; with real Slack
incoming webhook URLs&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;change-me-grafana&lt;/code&gt; with a real password (or use an ExternalSecret)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.7 Commit + Push GitOps changes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;myapp-gitops
git add environments/ infrastructure/
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"chore: fill in real ARNs from terraform outputs"&lt;/span&gt;
git push origin HEAD:main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.8 Create staging Secrets Manager secret
&lt;/h3&gt;

&lt;p&gt;Run this once to seed the staging ExternalSecret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-staging-use1 aws secretsmanager create-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; staging/myapp/db-password &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"password":"change-me-staging"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1

&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-staging-usw2 aws secretsmanager create-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; staging/myapp/db-password &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"password":"change-me-staging"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  PHASE 3 — ArgoCD Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 Bootstrap ArgoCD (App of Apps)
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;argocd/&lt;/code&gt; directory in &lt;code&gt;myapp-gitops&lt;/code&gt; now contains the AppProject and a&lt;br&gt;
bootstrap Application. Apply the bootstrap once — after that ArgoCD manages&lt;br&gt;
itself and will also pick up the AppProject automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Point kubectl at production cluster (where ArgoCD runs)&lt;/span&gt;
kubectl config use-context myapp-production-use1

&lt;span class="nb"&gt;cd &lt;/span&gt;myapp-gitops

&lt;span class="c"&gt;# One-time bootstrap — creates the self-managing Application&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; argocd/bootstrap.yaml &lt;span class="nt"&gt;-n&lt;/span&gt; argocd

&lt;span class="c"&gt;# ArgoCD will now sync argocd/project-production.yaml automatically.&lt;/span&gt;
&lt;span class="c"&gt;# Watch until it's healthy:&lt;/span&gt;
argocd app &lt;span class="nb"&gt;wait &lt;/span&gt;bootstrap &lt;span class="nt"&gt;--health&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;argocd/project-production.yaml&lt;/code&gt; AppProject already includes every&lt;br&gt;
namespace and source repo needed by all components. No &lt;code&gt;kubectl patch&lt;/code&gt; needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3.2 Apply new ApplicationSets to ArgoCD
&lt;/h3&gt;

&lt;p&gt;After the bootstrap Application syncs (it only manages the &lt;code&gt;argocd/&lt;/code&gt; directory),&lt;br&gt;
apply the infrastructure ApplicationSets manually once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;myapp-gitops

kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/eso/applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/monitoring/applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/monitoring/alert-rules-applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/logging/applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/karpenter/applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/karpenter/nodepool-applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/velero/applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/falco/applicationset.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/argo-rollouts/applicationset.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;After this, ArgoCD self-manages all ApplicationSets via the automated sync&lt;br&gt;
on the generated Applications.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  PHASE 4 — ArgoCD Sync Order (Production)
&lt;/h2&gt;

&lt;p&gt;Sync in this exact order to respect CRD dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Prometheus stack (creates CRDs for PrometheusRule, ServiceMonitor, etc.)&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;prometheus-myapp-production-use1 prometheus-myapp-production-usw2
argocd app &lt;span class="nb"&gt;wait &lt;/span&gt;prometheus-myapp-production-use1 &lt;span class="nt"&gt;--health&lt;/span&gt;
argocd app &lt;span class="nb"&gt;wait &lt;/span&gt;prometheus-myapp-production-usw2 &lt;span class="nt"&gt;--health&lt;/span&gt;

&lt;span class="c"&gt;# Step 2: Alert rules (needs Prometheus CRDs)&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;alert-rules-myapp-production-use1 alert-rules-myapp-production-usw2

&lt;span class="c"&gt;# Step 3: Parallel infra components (no inter-dependency)&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  fluent-bit-myapp-production-use1 fluent-bit-myapp-production-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  velero-myapp-production-use1 velero-myapp-production-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  falco-myapp-production-use1 falco-myapp-production-usw2

&lt;span class="c"&gt;# Step 4: Karpenter controller (needs ECR access to pull image from public.ecr.aws)&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;karpenter-myapp-production-use1 karpenter-myapp-production-usw2
argocd app &lt;span class="nb"&gt;wait &lt;/span&gt;karpenter-myapp-production-use1 &lt;span class="nt"&gt;--health&lt;/span&gt;

&lt;span class="c"&gt;# Step 5: Karpenter NodePools (needs Karpenter CRDs installed by Step 4)&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;karpenter-nodepool-myapp-production-use1 karpenter-nodepool-myapp-production-usw2

&lt;span class="c"&gt;# Step 6: Argo Rollouts controller&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;argo-rollouts-myapp-production-use1 argo-rollouts-myapp-production-usw2
argocd app &lt;span class="nb"&gt;wait &lt;/span&gt;argo-rollouts-myapp-production-use1 &lt;span class="nt"&gt;--health&lt;/span&gt;

&lt;span class="c"&gt;# Step 7: App (uses Rollout CR — needs argo-rollouts controller running)&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;myapp-production-myapp-production-use1 myapp-production-myapp-production-usw2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Staging sync (can run in parallel with production steps 3+)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;argocd app &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  eso-myapp-staging-use1 eso-myapp-staging-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  fluent-bit-myapp-staging-use1 fluent-bit-myapp-staging-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  velero-myapp-staging-use1 velero-myapp-staging-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  falco-myapp-staging-use1 falco-myapp-staging-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  prometheus-myapp-staging-use1 prometheus-myapp-staging-usw2

&lt;span class="c"&gt;# After staging ESO is healthy, ExternalSecrets will sync automatically&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;myapp-staging-myapp-staging-use1 myapp-staging-myapp-staging-usw2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  PHASE 5 — Verification
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Monitoring
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get prometheusrule &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get alertmanager &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;span class="c"&gt;# Access Grafana: kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Logging
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; logging &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;span class="c"&gt;# Verify log groups were created:&lt;/span&gt;
&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-production-use1 aws logs describe-log-groups &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-group-name-prefix&lt;/span&gt; /eks/myapp-production-use1 &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Karpenter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; karpenter &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get nodepool &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get ec2nodeclass &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;span class="c"&gt;# Trigger a scale test:&lt;/span&gt;
kubectl scale deploy/stress &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50 &lt;span class="nt"&gt;-n&lt;/span&gt; default &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get nodes &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Velero
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; velero &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get schedule &lt;span class="nt"&gt;-n&lt;/span&gt; velero &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;span class="c"&gt;# Trigger manual backup:&lt;/span&gt;
velero backup create manual-test &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
velero backup describe manual-test &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Falco
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; falco &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;span class="c"&gt;# Check CloudWatch for events:&lt;/span&gt;
&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-production-use1 aws logs describe-log-groups &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-group-name-prefix&lt;/span&gt; /falco &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Argo Rollouts (canary deploy)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get rollout &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl argo rollouts get rollout myapp-production-use1-myapp &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;--watch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ESO Staging
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get externalsecret &lt;span class="nt"&gt;-n&lt;/span&gt; staging &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-staging-use1
kubectl describe externalsecret myapp-production-use1-myapp-secrets &lt;span class="nt"&gt;-n&lt;/span&gt; staging &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-staging-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  WAF
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-production-use1 aws wafv2 list-web-acls &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scope&lt;/span&gt; REGIONAL &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 | &lt;span class="nb"&gt;grep &lt;/span&gt;myapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GuardDuty
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-production-use1 aws guardduty list-detectors &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1
&lt;span class="nv"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-production-usw2 aws guardduty list-detectors &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Troubleshooting Notes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Karpenter fails to pull image&lt;/td&gt;
&lt;td&gt;Ensure the node IAM role has ECR pull-through cache configured or use &lt;code&gt;public.ecr.aws&lt;/code&gt; directly. Karpenter controller image is on &lt;code&gt;public.ecr.aws/karpenter/karpenter&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Falco &lt;code&gt;modern_ebpf&lt;/code&gt; not supported&lt;/td&gt;
&lt;td&gt;Some EKS AMIs/kernel versions don't support eBPF. Fall back to &lt;code&gt;driver.kind: ebpf&lt;/code&gt; or &lt;code&gt;driver.kind: module&lt;/code&gt; in &lt;code&gt;infrastructure/falco/values.yaml&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Velero backup fails&lt;/td&gt;
&lt;td&gt;Ensure S3 bucket lifecycle rule and encryption config applied. Check IRSA trust policy &lt;code&gt;sub&lt;/code&gt; matches &lt;code&gt;system:serviceaccount:velero:velero&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alert rules not picked up&lt;/td&gt;
&lt;td&gt;The PrometheusRule must have label &lt;code&gt;release: kube-prometheus-stack&lt;/code&gt; (already set in &lt;code&gt;alert-rules.yaml&lt;/code&gt;). Verify with &lt;code&gt;kubectl get prometheusrule -n monitoring -o yaml&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollout stuck at 20%&lt;/td&gt;
&lt;td&gt;Check AnalysisTemplate — if &lt;code&gt;myapp_http_requests_total&lt;/code&gt; metric doesn't exist yet (app not instrumented), the analysis will fail. Set &lt;code&gt;failureLimit: 3&lt;/code&gt; or temporarily disable analysis by removing the &lt;code&gt;analysis&lt;/code&gt; step from the canary steps.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Karpenter NodePool not scheduling&lt;/td&gt;
&lt;td&gt;Verify subnet and SG tags: &lt;code&gt;aws ec2 describe-subnets --filters "Name=tag:karpenter.sh/discovery,Values=myapp-production-use1"&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;







&lt;h2&gt;
  
  
  Day-2 Operations Runbook
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For:&lt;/strong&gt; Anyone operating this pipeline after initial setup is complete&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;code&gt;https://api.matthewoladipupo.dev/health&lt;/code&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;
&lt;h3&gt;
  
  
  URLs and Credentials
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Application&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://api.matthewoladipupo.dev/health&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Public&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ArgoCD UI&lt;/td&gt;
&lt;td&gt;&lt;code&gt;http://a0c3c1ea43b294c4d8f5c2a7c514f6f2-1678928976.us-east-1.elb.amazonaws.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;admin / see Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://grafana.matthewoladipupo.dev&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;admin / see Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS SSO Portal&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://d-9a6757fb3c.awsapps.com/start&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;IAM Identity Center&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Cluster → Profile Map
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;th&gt;kubectl context&lt;/th&gt;
&lt;th&gt;AWS Profile&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-production-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-production-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-prod-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;private&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-production-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-production-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-prod-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;private&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;private&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;private&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;public&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;public&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  OPS-1: Start of Session
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Run every time you open a new terminal.&lt;/strong&gt; SSO tokens last 8 hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Authenticate&lt;/span&gt;
aws sso login &lt;span class="nt"&gt;--sso-session&lt;/span&gt; admin &lt;span class="nt"&gt;--no-browser&lt;/span&gt;
&lt;span class="c"&gt;# → browser opens → click Allow → wait for "Successfully logged in"&lt;/span&gt;

&lt;span class="c"&gt;# 2. Verify&lt;/span&gt;
aws sts get-caller-identity &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1
&lt;span class="c"&gt;# → should return Account: 591120834781&lt;/span&gt;

&lt;span class="c"&gt;# 3. Quick health check&lt;/span&gt;
curl https://api.matthewoladipupo.dev/health
&lt;span class="c"&gt;# → {"status":"healthy","region":"us-east-1"}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;If you see &lt;code&gt;Token has expired and refresh failed&lt;/code&gt; at any point, re-run step 1.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  OPS-2: Accessing Private Clusters (Production + Staging)
&lt;/h2&gt;

&lt;p&gt;Production and staging clusters have &lt;code&gt;endpointPublicAccess: false&lt;/code&gt;. You must temporarily enable public access, do your work, then lock it back. &lt;strong&gt;Never leave production with a public endpoint.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable (wait ~3 minutes after running this)&lt;/span&gt;
aws eks update-cluster-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resources-vpc-config&lt;/span&gt; &lt;span class="nv"&gt;endpointPublicAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,endpointPrivateAccess&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,publicAccessCidrs&lt;span class="o"&gt;=&lt;/span&gt;0.0.0.0/0

&lt;span class="c"&gt;# Confirm it's ready&lt;/span&gt;
aws eks describe-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'cluster.resourcesVpcConfig.endpointPublicAccess'&lt;/span&gt;
&lt;span class="c"&gt;# Must return: true&lt;/span&gt;

&lt;span class="c"&gt;# --- Do your kubectl work here ---&lt;/span&gt;

&lt;span class="c"&gt;# Lock back immediately after&lt;/span&gt;
aws eks update-cluster-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resources-vpc-config&lt;/span&gt; &lt;span class="nv"&gt;endpointPublicAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;,endpointPrivateAccess&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;myapp-production-use1&lt;/code&gt; / &lt;code&gt;myapp-prod-use1&lt;/code&gt; / &lt;code&gt;us-east-1&lt;/code&gt; with the appropriate values for other private clusters.&lt;/p&gt;




&lt;h2&gt;
  
  
  OPS-3: Standard Deployment
&lt;/h2&gt;

&lt;p&gt;Normal deployments are fully automated — zero manual steps required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Developer pushes to &lt;code&gt;main&lt;/code&gt; branch of &lt;code&gt;MatthewDipo/myapp&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;GitHub Actions: test → Trivy scan → build → push ECR → Cosign sign → update gitops values&lt;/li&gt;
&lt;li&gt;ArgoCD detects gitops change within 3 minutes → syncs&lt;/li&gt;
&lt;li&gt;In production: Argo Rollout starts canary (20% traffic for 5 minutes → analysis → 100%)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Monitor progress:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Watch rollout steps (enable public endpoint first)&lt;/span&gt;
kubectl get rollouts &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;-w&lt;/span&gt;

&lt;span class="c"&gt;# Detailed view (requires kubectl-argo-rollouts plugin)&lt;/span&gt;
kubectl argo rollouts get rollout myapp &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;--watch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  OPS-4: Promote or Abort a Canary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Promote immediately&lt;/strong&gt; (skip the 5-minute pause):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl argo rollouts promote myapp &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Abort&lt;/strong&gt; (shift all traffic back to stable version instantly):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl argo rollouts abort myapp &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# After abort the rollout shows Degraded — retry to return to Healthy&lt;/span&gt;
kubectl argo rollouts retry rollout myapp &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  OPS-5: Roll Back a Deployment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Option A — GitOps revert (preferred, keeps git history clean):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /path/to/myapp-gitops
git revert HEAD &lt;span class="nt"&gt;--no-edit&lt;/span&gt;
git push origin main
&lt;span class="c"&gt;# ArgoCD auto-syncs the revert within 3 minutes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option B — ArgoCD rollback to a previous revision:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;argocd app &lt;span class="nb"&gt;history &lt;/span&gt;myapp-production-myapp-production-use1
argocd app rollback myapp-production-myapp-production-use1 &amp;lt;revision-number&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option C — Emergency direct image update (bypasses GitOps, use only in outage):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get previous image tag from ECR&lt;/span&gt;
aws ecr describe-images &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--repository-name&lt;/span&gt; myapp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'sort_by(imageDetails,&amp;amp;imagePushedAt)[-2].imageTags'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; json

&lt;span class="c"&gt;# Force update the rollout&lt;/span&gt;
kubectl argo rollouts &lt;span class="nb"&gt;set &lt;/span&gt;image myapp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;myapp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:sha-&amp;lt;previous-sha&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# After incident: update values-production.yaml in gitops to match, then push&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  OPS-6: Rotate a Secret
&lt;/h2&gt;

&lt;p&gt;The External Secrets Operator (ESO) syncs from AWS Secrets Manager on a 1-hour cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Update value in Secrets Manager:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws secretsmanager put-secret-value &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-id&lt;/span&gt; production/myapp/db-password &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"password":"new-value-here"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 — Force immediate ESO refresh:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable public endpoint first (OPS-2)&lt;/span&gt;
kubectl annotate externalsecret myapp-db-password &lt;span class="se"&gt;\&lt;/span&gt;
  force-sync&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;--overwrite&lt;/span&gt;

&lt;span class="c"&gt;# Verify&lt;/span&gt;
kubectl get externalsecret myapp-db-password &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;span class="c"&gt;# STATUS: SecretSynced&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 — Restart pods to pick up new value:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl argo rollouts restart myapp &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  OPS-7: Incident Response
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Falco Security Alert
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Identify what triggered the alert&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; falco daemonset/falco &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"Warning|Critical"&lt;/span&gt; | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-20&lt;/span&gt;

&lt;span class="c"&gt;# 2. Inspect the affected pod&lt;/span&gt;
kubectl describe pod &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl logs &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100

&lt;span class="c"&gt;# 3. Contain if confirmed malicious — delete pod (Rollout replaces with clean copy)&lt;/span&gt;
kubectl delete pod &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# 4. Preserve logs before deletion&lt;/span&gt;
kubectl logs &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/incident-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d-%H%M%S&lt;span class="si"&gt;)&lt;/span&gt;.log

&lt;span class="c"&gt;# 5. Check GuardDuty for correlated findings&lt;/span&gt;
aws guardduty list-findings &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--detector-id&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;aws guardduty list-detectors &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DetectorIds[0]'&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Kyverno Blocked a Pod
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# See the rejection reason&lt;/span&gt;
kubectl describe pod &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &amp;lt;namespace&amp;gt; &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;span class="c"&gt;# Look for: "admission webhook" in Events&lt;/span&gt;

&lt;span class="c"&gt;# List policy violations&lt;/span&gt;
kubectl get policyreport &lt;span class="nt"&gt;-n&lt;/span&gt; &amp;lt;namespace&amp;gt; &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common violations and fixes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Policy&lt;/th&gt;
&lt;th&gt;Violation&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;block-privileged&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;privileged: true&lt;/code&gt; in spec&lt;/td&gt;
&lt;td&gt;Remove the privileged flag&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;require-non-root&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Running as root&lt;/td&gt;
&lt;td&gt;Add &lt;code&gt;runAsNonRoot: true&lt;/code&gt;, &lt;code&gt;runAsUser: 1000&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;block-host-path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;hostPath volume&lt;/td&gt;
&lt;td&gt;Replace with PVC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;require-resource-limits&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No CPU/memory limits&lt;/td&gt;
&lt;td&gt;Add &lt;code&gt;resources.limits&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify-image-signature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Image not Cosign-signed&lt;/td&gt;
&lt;td&gt;Must go through CI/CD pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Application Down
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check pods&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# Check pod logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50

&lt;span class="c"&gt;# Check rollout&lt;/span&gt;
kubectl get rollouts &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# Check HPA — is it maxed out?&lt;/span&gt;
kubectl get hpa &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# Check if Karpenter is provisioning nodes&lt;/span&gt;
kubectl get nodes &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  OPS-8: Routine Health Check
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Application&lt;/span&gt;
curl https://api.matthewoladipupo.dev/health

&lt;span class="c"&gt;# ArgoCD — show only non-Synced apps (empty = all good)&lt;/span&gt;
argocd app list | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"Synced.*Healthy"&lt;/span&gt;

&lt;span class="c"&gt;# Enable public endpoint, then:&lt;/span&gt;
kubectl get nodes &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get hpa &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get externalsecret &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1
kubectl get schedule &lt;span class="nt"&gt;-n&lt;/span&gt; velero &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# Lock endpoint back&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  OPS-9: Restore from Velero Backup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable public endpoint first (OPS-2), then:&lt;/span&gt;

&lt;span class="c"&gt;# List available backups&lt;/span&gt;
kubectl get backups &lt;span class="nt"&gt;-n&lt;/span&gt; velero &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1

&lt;span class="c"&gt;# Restore a namespace&lt;/span&gt;
kubectl create &lt;span class="nt"&gt;-f&lt;/span&gt; - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d-%H%M&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;
  namespace: velero
spec:
  backupName: &amp;lt;backup-name-from-list&amp;gt;
  includedNamespaces:
    - production
  restorePVs: true
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Watch progress&lt;/span&gt;
kubectl get restore &lt;span class="nt"&gt;-n&lt;/span&gt; velero &lt;span class="nt"&gt;--context&lt;/span&gt; myapp-production-use1 &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  OPS-10: Common Error Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Token has expired and refresh failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SSO session expired&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws sso login --sso-session admin --no-browser&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dial tcp 10.x.x.x:443: i/o timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Private cluster endpoint&lt;/td&gt;
&lt;td&gt;Enable public access temporarily (OPS-2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;config profile (X) could not be found&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wrong profile name&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;myapp-prod-use1&lt;/code&gt; not &lt;code&gt;production&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unknown command "argo" for "kubectl"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Plugin not installed&lt;/td&gt;
&lt;td&gt;Install &lt;code&gt;kubectl-argo-rollouts&lt;/code&gt; binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;SecretSynced: False&lt;/code&gt; on ExternalSecret&lt;/td&gt;
&lt;td&gt;IRSA role or secret missing&lt;/td&gt;
&lt;td&gt;Check IAM role exists, check secret path in Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod stuck in &lt;code&gt;Pending&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Karpenter provisioning&lt;/td&gt;
&lt;td&gt;Wait 2 min; check &lt;code&gt;kubectl get nodeclaims&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ArgoCD app &lt;code&gt;OutOfSync&lt;/code&gt; after ESO sync&lt;/td&gt;
&lt;td&gt;ESO writes &lt;code&gt;status.refreshTime&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Known false positive — safe to ignore or force-sync&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Part 10: Resilience — Karpenter, HPA, Argo Rollouts, and Velero</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 20 May 2026 08:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-10-resilience-karpenter-hpa-argo-rollouts-and-velero-3odg</link>
      <guid>https://dev.to/matthewdipo/part-10-resilience-karpenter-hpa-argo-rollouts-and-velero-3odg</guid>
      <description>&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Resilience engineering is about building systems that degrade gracefully, recover automatically, and deploy safely. This final part brings together four capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Karpenter&lt;/strong&gt; — automatically provisions the right EC2 instances for pending pods, and removes them when no longer needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HPA&lt;/strong&gt; — scales pod replicas based on CPU/memory pressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argo Rollouts&lt;/strong&gt; — deploys new versions with controlled canary traffic and automatic rollback on errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Velero&lt;/strong&gt; — backs up Kubernetes resources and PVC data to S3 for disaster recovery
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│  RESILIENCE LAYERS                                                  │
│                                                                     │
│  Traffic Spike                                                      │
│  └─► HPA: scale pods from 3 → 8                                     │
│       └─► Karpenter: provision new nodes to fit pending pods        │
│                                                                     │
│  New Deployment                                                     │
│  └─► Argo Rollouts: 20% canary → analysis → promote or rollback     │
│                                                                     │
│  Cluster Disaster                                                   │
│  └─► Velero: restore from S3 backup to new cluster                  │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Karpenter — Next-Generation Node Autoscaler
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Karpenter over Cluster Autoscaler?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Cluster Autoscaler&lt;/th&gt;
&lt;th&gt;Karpenter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Node selection&lt;/td&gt;
&lt;td&gt;Pre-defined ASG instance types&lt;/td&gt;
&lt;td&gt;Picks cheapest EC2 that fits pod requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spot support&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Native, with fallback ordering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;3–5 minutes&lt;/td&gt;
&lt;td&gt;30–90 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consolidation&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Active: moves pods to fewer nodes, terminates unused&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration&lt;/td&gt;
&lt;td&gt;Per-ASG scaling groups&lt;/td&gt;
&lt;td&gt;Declarative NodePool CRD&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Karpenter provisions EC2 instances directly via the EC2 Fleet API — no Auto Scaling Group required for scaling decisions. It launches the exact instance type that fits your pending pods' resource requests, which means you pay only for what you actually need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation (Production Only)
&lt;/h3&gt;

&lt;p&gt;Karpenter is deployed only on production clusters where cost optimization and fast scaling matter. Dev/staging use fixed 2-node groups.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/karpenter/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://charts.karpenter.sh&lt;/span&gt;
  &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;karpenter&lt;/span&gt;
  &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.37.0"&lt;/span&gt;
  &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;serviceAccount:&lt;/span&gt;
        &lt;span class="s"&gt;annotations:&lt;/span&gt;
          &lt;span class="s"&gt;eks.amazonaws.com/role-arn: "{{karpenterRoleArn}}"&lt;/span&gt;
      &lt;span class="s"&gt;settings:&lt;/span&gt;
        &lt;span class="s"&gt;clusterName: "{{cluster}}"&lt;/span&gt;
        &lt;span class="s"&gt;clusterEndpoint: "{{clusterEndpoint}}"&lt;/span&gt;
        &lt;span class="s"&gt;interruptionQueue: "{{cluster}}-interruption"&lt;/span&gt;
      &lt;span class="s"&gt;controller:&lt;/span&gt;
        &lt;span class="s"&gt;resources:&lt;/span&gt;
          &lt;span class="s"&gt;requests:&lt;/span&gt;
            &lt;span class="s"&gt;cpu: 1&lt;/span&gt;
            &lt;span class="s"&gt;memory: 1Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Karpenter IAM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/karpenter/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy"&lt;/span&gt; &lt;span class="s2"&gt;"karpenter"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"karpenter-policy"&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;karpenter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EC2Management"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:RunInstances"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:TerminateInstances"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:DescribeInstances"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:DescribeInstanceTypes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:DescribeSubnets"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:DescribeSecurityGroups"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:DescribeLaunchTemplates"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:CreateLaunchTemplate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:CreateFleet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:CreateTags"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ec2:DescribeSpotPriceHistory"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EKSDescribe"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="c1"&gt;# REQUIRED: Karpenter needs this to discover the cluster endpoint and CA&lt;/span&gt;
          &lt;span class="c1"&gt;# Without it Karpenter cannot configure new nodes to join the cluster&lt;/span&gt;
          &lt;span class="s2"&gt;"eks:DescribeCluster"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:eks:*:${var.account_id}:cluster/${var.cluster_name}"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"IAMPassRole"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"iam:PassRole"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;node_role_arn&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SQSInterruption"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s2"&gt;"sqs:DeleteMessage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"sqs:GetQueueUrl"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"sqs:ReceiveMessage"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_sqs_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;interruption&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lesson learned:&lt;/strong&gt; &lt;code&gt;eks:DescribeCluster&lt;/code&gt; is not optional. Without it, Karpenter cannot discover the cluster endpoint and certificate authority, so newly provisioned nodes cannot join the cluster. Pods remain &lt;code&gt;Pending&lt;/code&gt; indefinitely. Always include this permission.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  NodePool CRD
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/karpenter/nodepools/templates/nodepool.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NodePool&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;provisioner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;nodeClassRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1beta1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EC2NodeClass&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general&lt;/span&gt;

      &lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/arch&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;amd64&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;spot&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;on-demand&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# Prefer spot, fall back to on-demand&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/instance-category&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;c&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;r&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;           &lt;span class="c1"&gt;# Compute, memory, balanced families&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/instance-size&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NotIn&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;nano&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;micro&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;small&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Minimum medium instances&lt;/span&gt;

  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;400Gi&lt;/span&gt;

  &lt;span class="na"&gt;disruption&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;consolidationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WhenUnderutilized&lt;/span&gt;   &lt;span class="c1"&gt;# Actively consolidate idle nodes&lt;/span&gt;
    &lt;span class="na"&gt;consolidateAfter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
    &lt;span class="na"&gt;expireAfter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;720h&lt;/span&gt;   &lt;span class="c1"&gt;# Rotate nodes every 30 days (security hygiene)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EC2NodeClass CRD
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EC2NodeClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;amiFamily&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AL2&lt;/span&gt;   &lt;span class="c1"&gt;# Amazon Linux 2 — EKS-optimized AMI&lt;/span&gt;

  &lt;span class="c1"&gt;# Karpenter discovers subnets and security groups by tags&lt;/span&gt;
  &lt;span class="na"&gt;subnetSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}"&lt;/span&gt;

  &lt;span class="na"&gt;securityGroupSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}"&lt;/span&gt;

  &lt;span class="na"&gt;instanceProfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}-karpenter-node"&lt;/span&gt;

  &lt;span class="na"&gt;blockDeviceMappings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;deviceName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/dev/xvda&lt;/span&gt;
      &lt;span class="na"&gt;ebs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;volumeSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100Gi&lt;/span&gt;
        &lt;span class="na"&gt;volumeType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt;
        &lt;span class="na"&gt;encrypted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;kmsKeyID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kmsKeyArn}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verifying Karpenter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Watch for new NodeClaims as pods scale up&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get nodeclaims &lt;span class="nt"&gt;-w&lt;/span&gt;

&lt;span class="c"&gt;# Check NodePool status&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get nodepool general

&lt;span class="c"&gt;# See which nodes Karpenter provisioned vs the initial managed node group&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get nodes &lt;span class="nt"&gt;-L&lt;/span&gt; karpenter.sh/capacity-type,node.kubernetes.io/instance-type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  HPA — Horizontal Pod Autoscaler
&lt;/h2&gt;

&lt;p&gt;HPA watches CPU and memory metrics from the metrics-server and scales pod replicas up or down automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/hpa.yaml&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.autoscaling.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;if .Values.rollout.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;argoproj.io/v1alpha1{{ else }}apps/v1{{ end }}&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;if .Values.rollout.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;Rollout{{ else }}Deployment{{ end }}&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.autoscaling.minReplicas&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.autoscaling.maxReplicas&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Resource&lt;/span&gt;
      &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cpu&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Utilization&lt;/span&gt;
          &lt;span class="na"&gt;averageUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.autoscaling.targetCPUUtilizationPercentage | default 70&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Resource&lt;/span&gt;
      &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;memory&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Utilization&lt;/span&gt;
          &lt;span class="na"&gt;averageUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Production values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# values-production.yaml&lt;/span&gt;
&lt;span class="na"&gt;autoscaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;targetCPUUtilizationPercentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note the conditional &lt;code&gt;scaleTargetRef&lt;/code&gt;:&lt;/strong&gt; In production, HPA targets a &lt;code&gt;Rollout&lt;/code&gt; resource (Argo Rollouts). In dev/staging, it targets a standard &lt;code&gt;Deployment&lt;/code&gt;. The Helm template handles both cases via &lt;code&gt;.Values.rollout.enabled&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  HPA + Karpenter Interaction
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traffic spike arrives
      │
      ▼
HPA: CPU &amp;gt; 60% → scale pods from 3 to 8
      │
      ▼
5 new pods: Pending (not enough node capacity)
      │
      ▼
Karpenter: detects Pending pods → evaluates requests
           → finds cheapest EC2 that fits → provisions 2x m5.large
      │
      ▼ ~60 seconds
New nodes join cluster → pods schedule → Running
      │
Traffic normalizes
      │
HPA: CPU &amp;lt; 60% → scale pods from 8 back to 3
      │
5 pods terminate
      │
Karpenter: 2 nodes underutilized → consolidate → terminate nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Argo Rollouts — Canary Deployments
&lt;/h2&gt;

&lt;p&gt;Argo Rollouts replaces the standard Kubernetes &lt;code&gt;Deployment&lt;/code&gt; with a &lt;code&gt;Rollout&lt;/code&gt; resource that supports progressive delivery strategies. In production, every deployment goes through a canary phase.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation (Production Only)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/argo-rollouts/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://argoproj.github.io/argo-helm&lt;/span&gt;
  &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;argo-rollouts&lt;/span&gt;
  &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.35.3"&lt;/span&gt;
  &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;installCRDs: true&lt;/span&gt;
      &lt;span class="s"&gt;dashboard:&lt;/span&gt;
        &lt;span class="s"&gt;enabled: true&lt;/span&gt;
        &lt;span class="s"&gt;service:&lt;/span&gt;
          &lt;span class="s"&gt;type: ClusterIP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Rollout CRD (replaces Deployment in production)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/deployment.yaml&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.rollout.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rollout&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.replicaCount&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- include "myapp.selectorLabels" . | nindent 6&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- include "myapp.selectorLabels" . | nindent 8&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.Values.image.repository&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}:{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.Values.image.tag&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;128Mi&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;256Mi&lt;/span&gt;

  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;canaryService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;-canary&lt;/span&gt;
      &lt;span class="na"&gt;stableService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
      &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;          &lt;span class="c1"&gt;# Step 1: 20% of traffic to canary&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5m&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;# Step 2: Bake for 5 minutes&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;analysis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;               &lt;span class="c1"&gt;# Step 3: Automated metric check&lt;/span&gt;
            &lt;span class="na"&gt;templates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;templateName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
            &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-name&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;-canary&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;         &lt;span class="c1"&gt;# Step 4: Promote to 100%&lt;/span&gt;

&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- else&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="c1"&gt;# Standard Deployment for dev/staging&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AnalysisTemplate — Automated Promotion Gate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/analysis-template.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-name&lt;/span&gt;

  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;             &lt;span class="c1"&gt;# Run 5 measurements (5 minutes total)&lt;/span&gt;
      &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;      &lt;span class="c1"&gt;# One failure triggers rollback&lt;/span&gt;

      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus-operated.monitoring.svc.cluster.local:9090&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;sum(&lt;/span&gt;
              &lt;span class="s"&gt;rate(myapp_http_requests_total{&lt;/span&gt;
                &lt;span class="s"&gt;service="{{`{{args.service-name}}`}}",&lt;/span&gt;
                &lt;span class="s"&gt;status_code!~"5.."&lt;/span&gt;
              &lt;span class="s"&gt;}[5m])&lt;/span&gt;
            &lt;span class="s"&gt;)&lt;/span&gt;
            &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="s"&gt;sum(&lt;/span&gt;
              &lt;span class="s"&gt;rate(myapp_http_requests_total{&lt;/span&gt;
                &lt;span class="s"&gt;service="{{`{{args.service-name}}`}}"&lt;/span&gt;
              &lt;span class="s"&gt;}[5m])&lt;/span&gt;
            &lt;span class="s"&gt;)&lt;/span&gt;

      &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;gt;= &lt;/span&gt;&lt;span class="m"&gt;0.99&lt;/span&gt;   &lt;span class="c1"&gt;# 99%+ success rate required&lt;/span&gt;
      &lt;span class="na"&gt;failureCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;lt; &lt;/span&gt;&lt;span class="m"&gt;0.99&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Canary Service
&lt;/h3&gt;

&lt;p&gt;Traffic splitting requires two Services: &lt;code&gt;stable&lt;/code&gt; (regular Service) and &lt;code&gt;canary&lt;/code&gt; (routes only to canary pods).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/service-canary.yaml&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.rollout.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;-canary&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- include "myapp.selectorLabels" . | nindent 4&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Canary Deployment Walkthrough
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. CI pushes new image: sha-abc123 → myapp-gitops updated
2. ArgoCD detects diff → triggers Rollout update
3. Argo Rollouts creates new ReplicaSet with sha-abc123

   Traffic: 80% → stable (sha-xyz789), 20% → canary (sha-abc123)

4. 5-minute pause
   → Monitor Grafana: error rate on canary service?
   → Logs in CloudWatch: any exceptions?

5. AnalysisRun queries Prometheus (5 measurements × 1 min)
   → success_rate = 0.997 (99.7% success) ✓ PASS

6. setWeight: 100% → all traffic to sha-abc123
7. Old ReplicaSet (sha-xyz789) scaled to 0 after stability window
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Manual Intervention
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Watch rollout progress&lt;/span&gt;
kubectl argo rollouts status myapp &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--watch&lt;/span&gt;

&lt;span class="c"&gt;# Manually promote (skip remaining pause/analysis steps)&lt;/span&gt;
kubectl argo rollouts promote myapp &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Manually abort (rollback to stable immediately)&lt;/span&gt;
kubectl argo rollouts abort myapp &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Access the Argo Rollouts dashboard&lt;/span&gt;
kubectl port-forward svc/argo-rollouts-dashboard &lt;span class="nt"&gt;-n&lt;/span&gt; argo-rollouts 3100:3100
&lt;span class="c"&gt;# Open http://localhost:3100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Velero — Backup and Disaster Recovery
&lt;/h2&gt;

&lt;p&gt;Velero backs up Kubernetes resource definitions and EBS volume snapshots to S3. If a cluster is accidentally deleted or corrupted, you can restore everything to a new cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Velero Backs Up
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;All Kubernetes objects (Deployments, Services, ConfigMaps, Secrets, etc.)&lt;/li&gt;
&lt;li&gt;PersistentVolumeClaim snapshots (Prometheus data, Grafana dashboards, etc.)&lt;/li&gt;
&lt;li&gt;Namespace structure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/velero/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://vmware-tanzu.github.io/helm-charts&lt;/span&gt;
  &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;velero&lt;/span&gt;
  &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6.4.0"&lt;/span&gt;
  &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;serviceAccount:&lt;/span&gt;
        &lt;span class="s"&gt;server:&lt;/span&gt;
          &lt;span class="s"&gt;annotations:&lt;/span&gt;
            &lt;span class="s"&gt;eks.amazonaws.com/role-arn: "{{veleroRoleArn}}"&lt;/span&gt;
      &lt;span class="s"&gt;configuration:&lt;/span&gt;
        &lt;span class="s"&gt;backupStorageLocation:&lt;/span&gt;
          &lt;span class="s"&gt;- name: default&lt;/span&gt;
            &lt;span class="s"&gt;provider: aws&lt;/span&gt;
            &lt;span class="s"&gt;bucket: "myapp-velero-{{cluster}}"&lt;/span&gt;
            &lt;span class="s"&gt;config:&lt;/span&gt;
              &lt;span class="s"&gt;region: "{{region}}"&lt;/span&gt;
        &lt;span class="s"&gt;volumeSnapshotLocation:&lt;/span&gt;
          &lt;span class="s"&gt;- name: default&lt;/span&gt;
            &lt;span class="s"&gt;provider: aws&lt;/span&gt;
            &lt;span class="s"&gt;config:&lt;/span&gt;
              &lt;span class="s"&gt;region: "{{region}}"&lt;/span&gt;
      &lt;span class="s"&gt;initContainers:&lt;/span&gt;
        &lt;span class="s"&gt;- name: velero-plugin-for-aws&lt;/span&gt;
          &lt;span class="s"&gt;image: velero/velero-plugin-for-aws:v1.9.0&lt;/span&gt;
          &lt;span class="s"&gt;volumeMounts:&lt;/span&gt;
            &lt;span class="s"&gt;- mountPath: /target&lt;/span&gt;
              &lt;span class="s"&gt;name: plugins&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scheduled Backup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/velero/schedule.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;velero.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Schedule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;daily-backup&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;velero&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;   &lt;span class="c1"&gt;# 2 AM UTC daily&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;includedNamespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;monitoring&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
    &lt;span class="na"&gt;excludedResources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;events&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;events.events.k8s.io&lt;/span&gt;
    &lt;span class="na"&gt;snapshotVolumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;720h&lt;/span&gt;   &lt;span class="c1"&gt;# 30 days retention&lt;/span&gt;
    &lt;span class="na"&gt;storageLocation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
    &lt;span class="na"&gt;volumeSnapshotLocations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Restore Procedure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List available backups&lt;/span&gt;
velero backup get

&lt;span class="c"&gt;# Restore from a specific backup&lt;/span&gt;
velero restore create &lt;span class="nt"&gt;--from-backup&lt;/span&gt; daily-backup-20260308020000

&lt;span class="c"&gt;# Watch restore progress&lt;/span&gt;
velero restore describe &amp;lt;restore-name&amp;gt; &lt;span class="nt"&gt;--details&lt;/span&gt;

&lt;span class="c"&gt;# Verify restored resources&lt;/span&gt;
kubectl get all &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
kubectl get pvc &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical reminder:&lt;/strong&gt; An untested backup is not a backup. Run a restore drill at least quarterly into a temporary cluster. The restore procedure should be documented and rehearsed so it is not being learned for the first time during an actual outage.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  PodDisruptionBudget
&lt;/h2&gt;

&lt;p&gt;Karpenter's consolidation can evict pods to move them to fewer nodes. Without a PDB, it might evict too many pods at once and cause downtime.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/pdb.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;   &lt;span class="c1"&gt;# Always keep at least 2 pods running during disruption&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- include "myapp.selectorLabels" . | nindent 6&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;minReplicas: 3&lt;/code&gt; and &lt;code&gt;minAvailable: 2&lt;/code&gt;, Karpenter can only evict one pod at a time. The remaining two continue serving traffic while the evicted pod reschedules on a new node.&lt;/p&gt;




&lt;h2&gt;
  
  
  Complete Resilience Picture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NORMAL OPERATION
─────────────────
3 pods (minReplicas) on 2 managed nodes
HPA watching CPU (target: 60%)
Karpenter watching for pending pods
Velero backing up daily at 2 AM UTC

TRAFFIC SPIKE
─────────────────
CPU &amp;gt; 60% → HPA scales to 8 pods
2 pods pending (no capacity) → Karpenter provisions m5.large
All 8 pods running → serving traffic
Spike ends → HPA scales to 3 pods
2 pods terminate → Karpenter consolidates → terminates extra node

NEW DEPLOYMENT (production)
─────────────────
ArgoCD syncs new image → Rollout starts
20% canary → 5min bake → AnalysisRun → 100% promote
OR: error rate &amp;gt; 1% → automatic rollback to previous version

DISASTER RECOVERY
─────────────────
Cluster accidentally deleted → restore from Velero backup
velero restore create --from-backup &amp;lt;last-good-backup&amp;gt;
Kubernetes objects restored → EBS snapshots attached → running in ~15 min
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 10 — and the entire series — you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Karpenter&lt;/strong&gt; provisioning right-sized EC2 instances on demand, consolidating when idle&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;HPA&lt;/strong&gt; scaling pods 3→10 based on CPU utilization, targeting a Rollout in production&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Argo Rollouts&lt;/strong&gt; deploying every production change as a canary with automated Prometheus-based promotion gates&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Velero&lt;/strong&gt; running scheduled daily backups with 30-day retention to S3&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;PodDisruptionBudget&lt;/strong&gt; preventing Karpenter from evicting too many pods at once&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Conclusion
&lt;/h2&gt;

&lt;p&gt;You have now built a complete production-grade DevSecOps platform:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What You Built&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Foundation&lt;/td&gt;
&lt;td&gt;AWS Organizations, 4 accounts, SSO, SCPs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Terraform modules, Terragrunt DRY configs, 6 VPCs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute&lt;/td&gt;
&lt;td&gt;6 EKS clusters (k8s 1.29) across 3 environments and 2 regions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps&lt;/td&gt;
&lt;td&gt;ArgoCD hub-spoke, 35+ ApplicationSets, automated sync&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD&lt;/td&gt;
&lt;td&gt;GitHub Actions + OIDC + Trivy + Cosign + ECR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets&lt;/td&gt;
&lt;td&gt;AWS Secrets Manager + ESO + IRSA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Kyverno policies + Falco runtime detection + WAF + GuardDuty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Prometheus + Grafana + Fluent Bit + CloudWatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resilience&lt;/td&gt;
&lt;td&gt;Karpenter + HPA + Argo Rollouts canary + Velero DR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Networking&lt;/td&gt;
&lt;td&gt;Route53 latency routing + ACM TLS + ALB + NetworkPolicies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the platform that a growing engineering team with 50–500 developers would build and operate. Each component was chosen for a reason, wired to the others, and tested against real failures.&lt;/p&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: kubectl get hpa showing current/desired replicas scaling&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0gt1hbgiku4vhgqcu7p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0gt1hbgiku4vhgqcu7p.png" alt="Show in frame: The HPA showing TARGETS: 15%/70%, MIN: 3, MAX: 10, REPLICAS: 3. This confirms autoscaling is wired up.&lt;br&gt;
Screenshot 10-3: Velero Backup Schedule" width="800" height="734"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: ArgoCD — full applications view showing all 35+ apps across 6 clusters&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fopovny026913cge0awjf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fopovny026913cge0awjf.png" alt=" ArgoCD — full applications view showing all 35+ apps across 6 clusters" width="799" height="443"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Thank you for following this series. Source code:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Infrastructure: &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;github.com/MatthewDipo/myapp-infra&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;GitOps manifests: &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;github.com/MatthewDipo/myapp-gitops&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Application: &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;github.com/MatthewDipo/myapp&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>aws</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Part 9: Observability — Prometheus, Grafana, Fluent Bit, and CloudWatch</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 13 May 2026 08:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-9-observability-prometheus-grafana-fluent-bit-and-cloudwatch-4h44</link>
      <guid>https://dev.to/matthewdipo/part-9-observability-prometheus-grafana-fluent-bit-and-cloudwatch-4h44</guid>
      <description>&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Observability answers the question: &lt;em&gt;what is my system doing right now, and why?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The three pillars:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; — numerical measurements over time (CPU%, request rate, error rate, latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logs&lt;/strong&gt; — structured event records from every container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traces&lt;/strong&gt; — request flows across services (not covered in this series, but Grafana Tempo is the natural next step)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pipeline implements metrics with &lt;strong&gt;Prometheus + Grafana&lt;/strong&gt; and logs with &lt;strong&gt;Fluent Bit → CloudWatch&lt;/strong&gt;. Together they give you both real-time dashboards and historical log search without leaving the AWS ecosystem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────────────┐
│  OBSERVABILITY ARCHITECTURE                                              │
│                                                                          │
│  Application Pods                                                        │
│  ├─ /metrics endpoint → ServiceMonitor → Prometheus scrape               │
│  └─ stdout/stderr → Fluent Bit DaemonSet → CloudWatch Logs               │
│                                                                          │
│  Infrastructure                                                          │
│  ├─ node-exporter (CPU, memory, disk, network per node) → Prometheus     │
│  └─ kube-state-metrics (pod state, deployment state) → Prometheus        │
│                                                                          │
│  Prometheus → Grafana (dashboards + alert rules)                         │
│  Prometheus → Alertmanager (notifications)                               │
│                                                                          │
│  Falco (security events) → stdout → Fluent Bit → CloudWatch              │
│  All containers → stdout → Fluent Bit → CloudWatch                       │
└──────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  kube-prometheus-stack
&lt;/h2&gt;

&lt;p&gt;Rather than installing Prometheus, Grafana, and Alertmanager separately, we use the &lt;code&gt;kube-prometheus-stack&lt;/code&gt; Helm chart. It bundles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus Operator&lt;/strong&gt; — manages Prometheus, Alertmanager, and PrometheusRule CRDs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus&lt;/strong&gt; — the metrics database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alertmanager&lt;/strong&gt; — routes alerts to notification channels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; — dashboards and visualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kube-state-metrics&lt;/strong&gt; — exposes Kubernetes object state as metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;node-exporter&lt;/strong&gt; — exposes node-level metrics (CPU, memory, disk, network)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We install it only on &lt;strong&gt;staging and production&lt;/strong&gt; clusters (4 of 6) — dev clusters skip monitoring to reduce cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation via ArgoCD
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/monitoring/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-prometheus-stack&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;myapp-production-use1&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="s"&gt;us-east-1&lt;/span&gt;
            &lt;span class="na"&gt;grafanaIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
            &lt;span class="na"&gt;certArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:acm:us-east-1:591120834781:certificate/9ab022c9-..."&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;myapp-production-usw2&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="s"&gt;us-west-2&lt;/span&gt;
            &lt;span class="na"&gt;grafanaIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;
            &lt;span class="na"&gt;certArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;myapp-staging-use1&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="s"&gt;us-east-1&lt;/span&gt;
            &lt;span class="na"&gt;grafanaIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;
            &lt;span class="na"&gt;certArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;myapp-staging-usw2&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="s"&gt;us-west-2&lt;/span&gt;
            &lt;span class="na"&gt;grafanaIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;
            &lt;span class="na"&gt;certArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prometheus-{{cluster}}"&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
      &lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://prometheus-community.github.io/helm-charts&lt;/span&gt;
          &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;kube-prometheus-stack&lt;/span&gt;
          &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;61.9.0"&lt;/span&gt;
          &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;valueFiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$gitopsValues/infrastructure/monitoring/prometheus-values.yaml&lt;/span&gt;
            &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grafana.ingress.enabled"&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{grafanaIngress}}"&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grafana.ingress.annotations.alb&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;.ingress&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;.kubernetes&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;.io/certificate-arn"&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{certArn}}"&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://github.com/MatthewDipo/myapp-gitops.git&lt;/span&gt;
          &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
          &lt;span class="na"&gt;ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gitopsValues&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}"&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring&lt;/span&gt;
      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;CreateNamespace=true&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;ServerSideApply=true&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;retry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
          &lt;span class="na"&gt;backoff&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;30s&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;maxDuration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;10m&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;factor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;2&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important install flag:&lt;/strong&gt; Use &lt;code&gt;--no-hooks --timeout 10m&lt;/code&gt; (without &lt;code&gt;--wait&lt;/code&gt;). Pre-upgrade admission webhook Jobs consistently time out in ArgoCD, causing the sync phase to show &lt;code&gt;Failed&lt;/code&gt; — but the actual resources (Prometheus, Grafana, Alertmanager) deploy correctly. This is a known false positive. Do not let it alarm you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  prometheus-values.yaml
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheusSpec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;retention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15d&lt;/span&gt;
    &lt;span class="na"&gt;storageSpec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;volumeClaimTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp2&lt;/span&gt;
          &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ReadWriteOnce&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;50Gi&lt;/span&gt;
    &lt;span class="c1"&gt;# Auto-discover ALL ServiceMonitors and PodMonitors across all namespaces&lt;/span&gt;
    &lt;span class="c1"&gt;# Without these settings Prometheus only scrapes resources with matching Helm labels&lt;/span&gt;
    &lt;span class="na"&gt;podMonitorSelectorNilUsesHelmValues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;serviceMonitorSelectorNilUsesHelmValues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;ruleSelectorNilUsesHelmValues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="na"&gt;alertmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;alertmanagerSpec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;volumeClaimTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp2&lt;/span&gt;
          &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ReadWriteOnce&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10Gi&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;global&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resolve_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
    &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;group_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;group_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;group_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
      &lt;span class="na"&gt;repeat_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;12h&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null"&lt;/span&gt;    &lt;span class="c1"&gt;# Placeholder — replace with Slack/PagerDuty&lt;/span&gt;
    &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null"&lt;/span&gt;

&lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;admin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;existingSecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana-admin-secret&lt;/span&gt;
    &lt;span class="na"&gt;userKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;admin-user&lt;/span&gt;
    &lt;span class="na"&gt;passwordKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;admin-password&lt;/span&gt;
  &lt;span class="na"&gt;persistence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp2&lt;/span&gt;
    &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;             &lt;span class="s"&gt;10Gi&lt;/span&gt;
  &lt;span class="na"&gt;sidecar&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dashboards&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;searchNamespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;ALL&lt;/span&gt;   &lt;span class="c1"&gt;# Pick up dashboards from all namespaces&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="kc"&gt;false&lt;/span&gt;   &lt;span class="c1"&gt;# Overridden per-cluster via ApplicationSet parameter&lt;/span&gt;
    &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alb&lt;/span&gt;
    &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;alb.ingress.kubernetes.io/scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;internet-facing&lt;/span&gt;
      &lt;span class="na"&gt;alb.ingress.kubernetes.io/target-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;ip&lt;/span&gt;
      &lt;span class="na"&gt;alb.ingress.kubernetes.io/listen-ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[{"HTTPS":443}]'&lt;/span&gt;
      &lt;span class="na"&gt;alb.ingress.kubernetes.io/ssl-redirect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;443"&lt;/span&gt;
      &lt;span class="na"&gt;alb.ingress.kubernetes.io/certificate-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Injected per-region&lt;/span&gt;
    &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;grafana.matthewoladipupo.dev&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
    &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;

&lt;span class="na"&gt;kubeStateMetrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;nodeExporter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Grafana Public Access
&lt;/h2&gt;

&lt;p&gt;Grafana is only exposed publicly on &lt;code&gt;myapp-production-use1&lt;/code&gt;. The reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grafana uses an EBS &lt;code&gt;ReadWriteOnce&lt;/code&gt; PVC — only one node can mount it at a time, making it inherently single-instance&lt;/li&gt;
&lt;li&gt;EBS data is AZ-local — running a second public Grafana in usw2 would show different historical data&lt;/li&gt;
&lt;li&gt;One public Grafana that federates data from all clusters is cleaner than four separate Grafana instances&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Access URL:&lt;/strong&gt; &lt;code&gt;https://grafana.matthewoladipupo.dev&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The Route53 A record points to the ALB provisioned by AWS LBC when the Ingress is applied.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting the Grafana Password
&lt;/h3&gt;

&lt;p&gt;The Grafana admin credentials are stored in &lt;code&gt;grafana-admin-secret&lt;/code&gt; in the &lt;code&gt;monitoring&lt;/code&gt; namespace. &lt;strong&gt;Important caveat:&lt;/strong&gt; Grafana only reads this secret on first database initialization. If the secret value changes after the pod has started, you must reset the password via &lt;code&gt;grafana-cli&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &amp;lt;grafana-pod&amp;gt; &lt;span class="nt"&gt;-c&lt;/span&gt; grafana &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  grafana-cli admin reset-admin-password &lt;span class="s1"&gt;'YourNewPassword'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ServiceMonitor for the Application
&lt;/h2&gt;

&lt;p&gt;By default, Prometheus only scrapes the cluster components provided by kube-prometheus-stack. To scrape your application, add a &lt;code&gt;ServiceMonitor&lt;/code&gt; CRD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/servicemonitor.yaml&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.serviceMonitor.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceMonitor&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- include "myapp.labels" . | nindent 4&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- include "myapp.selectorLabels" . | nindent 6&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;          &lt;span class="c1"&gt;# Named port on the Service&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/metrics&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.serviceMonitor.interval | default "30s"&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
      &lt;span class="na"&gt;scrapeTimeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.serviceMonitor.scrapeTimeout | default "10s"&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The application's &lt;code&gt;/metrics&lt;/code&gt; endpoint returns Prometheus exposition format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="c"&gt;# HELP myapp_http_requests_total Total HTTP requests&lt;/span&gt;
&lt;span class="c"&gt;# TYPE myapp_http_requests_total counter&lt;/span&gt;
&lt;span class="n"&gt;myapp_http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/health"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;1423&lt;/span&gt;
&lt;span class="n"&gt;myapp_http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/metrics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;89&lt;/span&gt;
&lt;span class="c"&gt;# HELP process_cpu_seconds_total Total user and system CPU time&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key setting that makes auto-discovery work: &lt;code&gt;serviceMonitorSelectorNilUsesHelmValues: false&lt;/code&gt; in prometheus-values.yaml. Without this, Prometheus only scrapes ServiceMonitors that have the Helm release's labels — ignoring your application's ServiceMonitor.&lt;/p&gt;




&lt;h2&gt;
  
  
  PrometheusRule — Alert Rules
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/monitoring/alert-rules/myapp-alerts.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PrometheusRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-alerts&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;   &lt;span class="c1"&gt;# Must match what Prometheus Operator is watching&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp.rules&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighErrorRate&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;(&lt;/span&gt;
              &lt;span class="s"&gt;sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))&lt;/span&gt;
              &lt;span class="s"&gt;/&lt;/span&gt;
              &lt;span class="s"&gt;sum(rate(myapp_http_requests_total[5m]))&lt;/span&gt;
            &lt;span class="s"&gt;) &amp;gt; 0.01&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;myapp&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;({{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;humanizePercentage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}})"&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;More&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;than&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;requests&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5xx&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes."&lt;/span&gt;

        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodCrashLooping&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(kube_pod_container_status_restarts_total{namespace="myapp"}[15m]) * 60 * 15 &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;crash&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;looping"&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restarted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;more&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;than&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;times&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;15&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes."&lt;/span&gt;

        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighMemoryUsage&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;(&lt;/span&gt;
              &lt;span class="s"&gt;container_memory_working_set_bytes{namespace="myapp",container!=""}&lt;/span&gt;
              &lt;span class="s"&gt;/ container_spec_memory_limit_bytes{namespace="myapp",container!=""}&lt;/span&gt;
            &lt;span class="s"&gt;) &amp;gt; 0.9&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;90%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Alertmanager — Adding Slack Notifications
&lt;/h2&gt;

&lt;p&gt;The current config uses a &lt;code&gt;null&lt;/code&gt; receiver (alerts fire but go nowhere). To wire up Slack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;alertmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;global&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resolve_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
      &lt;span class="na"&gt;slack_api_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'&lt;/span&gt;
    &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;group_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;group_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;group_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
      &lt;span class="na"&gt;repeat_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4h&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slack-critical&lt;/span&gt;
      &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
          &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slack-critical&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
          &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slack-warning&lt;/span&gt;
    &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slack-critical&lt;/span&gt;
        &lt;span class="na"&gt;slack_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;#alerts-critical'&lt;/span&gt;
            &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
              &lt;span class="s"&gt;*Alert:* {{ .GroupLabels.alertname }}&lt;/span&gt;
              &lt;span class="s"&gt;*Severity:* {{ .GroupLabels.severity }}&lt;/span&gt;
              &lt;span class="s"&gt;*Cluster:* {{ .GroupLabels.cluster }}&lt;/span&gt;
              &lt;span class="s"&gt;{{ range .Alerts }}*Description:* {{ .Annotations.description }}{{ end }}&lt;/span&gt;
            &lt;span class="na"&gt;send_resolved&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slack-warning&lt;/span&gt;
        &lt;span class="na"&gt;slack_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;#alerts-warning'&lt;/span&gt;
            &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.GroupLabels.alertname&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;range&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.Alerts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.Annotations.summary&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}'&lt;/span&gt;
            &lt;span class="na"&gt;send_resolved&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Fluent Bit — Log Shipping to CloudWatch
&lt;/h2&gt;

&lt;p&gt;Fluent Bit runs as a DaemonSet — one pod per node — and reads all container logs from &lt;code&gt;/var/log/containers/*.log&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  IRSA for Fluent Bit
&lt;/h3&gt;

&lt;p&gt;Fluent Bit needs IAM permissions to write to CloudWatch. The key lesson: &lt;strong&gt;use wildcard ARNs for both log groups AND log streams&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/fluent-bit-irsa/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy"&lt;/span&gt; &lt;span class="s2"&gt;"fluent_bit"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"fluent-bit-cloudwatch"&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fluent_bit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CloudWatchLogs"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s2"&gt;"logs:CreateLogGroup"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"logs:CreateLogStream"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"logs:PutLogEvents"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"logs:DescribeLogStreams"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"logs:DescribeLogGroups"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="c1"&gt;# Log group operations (CreateLogGroup, DescribeLogGroups)&lt;/span&gt;
          &lt;span class="s2"&gt;"arn:aws:logs:${var.aws_region}:${var.account_id}:log-group:/eks/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="c1"&gt;# Log stream operations (CreateLogStream, PutLogEvents)&lt;/span&gt;
          &lt;span class="c1"&gt;# The :* suffix is REQUIRED for stream-level permissions&lt;/span&gt;
          &lt;span class="s2"&gt;"arn:aws:logs:${var.aws_region}:${var.account_id}:log-group:/eks/*:*"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lesson learned:&lt;/strong&gt; Using only &lt;code&gt;log-group:/eks/*&lt;/code&gt; (without the &lt;code&gt;:*&lt;/code&gt; suffix) grants permissions on the log group resource but NOT on log streams within it. &lt;code&gt;CreateLogStream&lt;/code&gt; and &lt;code&gt;PutLogEvents&lt;/code&gt; operate on the log stream resource, which requires the &lt;code&gt;:*&lt;/code&gt; suffix. Without this, pods get &lt;code&gt;AccessDeniedException&lt;/code&gt; on &lt;code&gt;CreateLogStream&lt;/code&gt; even though &lt;code&gt;CreateLogGroup&lt;/code&gt; succeeds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Fluent Bit Helm Values
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/logging/fluent-bit-values.yaml&lt;/span&gt;
&lt;span class="na"&gt;serviceAccount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{irsaRoleArn}}"&lt;/span&gt;   &lt;span class="c1"&gt;# Injected per-cluster&lt;/span&gt;

&lt;span class="na"&gt;cloudWatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{region}}"&lt;/span&gt;
  &lt;span class="na"&gt;logGroupName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/eks/{{cluster}}/$(kubernetes['namespace_name'])"&lt;/span&gt;
  &lt;span class="na"&gt;logStreamName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$(kubernetes['pod_name'])/$(kubernetes['container_name'])"&lt;/span&gt;
  &lt;span class="na"&gt;autoCreateGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="c1"&gt;# Enrich log records with Kubernetes metadata&lt;/span&gt;
&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kubernetes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Merge_Log&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;On&lt;/span&gt;
    &lt;span class="na"&gt;Keep_Log&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Off&lt;/span&gt;
    &lt;span class="na"&gt;K8S-Logging.Parser&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;On&lt;/span&gt;
    &lt;span class="na"&gt;K8S-Logging.Exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;On&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CloudWatch Log Groups Created
&lt;/h3&gt;

&lt;p&gt;After Fluent Bit starts, these log groups appear in CloudWatch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/eks/myapp-production-use1/myapp
/eks/myapp-production-use1/monitoring
/eks/myapp-production-use1/argocd
/eks/myapp-production-use1/kyverno
/eks/myapp-production-use1/falco
... (one per namespace, all clusters)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Querying Logs with CloudWatch Insights
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;Find&lt;/span&gt; &lt;span class="k"&gt;all&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="n"&gt;xx&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="k"&gt;last&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;
&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;kubernetes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"myapp"&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;Find&lt;/span&gt; &lt;span class="n"&gt;Falco&lt;/span&gt; &lt;span class="k"&gt;security&lt;/span&gt; &lt;span class="n"&gt;alerts&lt;/span&gt;
&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;output&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;kubernetes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"falco"&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"Warning"&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"Error"&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;

&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="k"&gt;Count&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;
&lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;kubernetes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pod_name&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Grafana Dashboards
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pre-built Dashboards (from kube-prometheus-stack)
&lt;/h3&gt;

&lt;p&gt;These are included automatically and show up in Grafana immediately after installation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes / Compute Resources / Cluster&lt;/strong&gt; — total CPU/memory across all nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes / Compute Resources / Namespace&lt;/strong&gt; — resource breakdown per namespace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node Exporter / Nodes&lt;/strong&gt; — per-node CPU, memory, disk, network I/O&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alertmanager / Overview&lt;/strong&gt; — alert firing/resolved history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Importing Community Dashboards
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;code&gt;https://grafana.matthewoladipupo.dev&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Dashboards → Import&lt;/li&gt;
&lt;li&gt;Enter the dashboard ID from grafana.com:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;1860&lt;/code&gt; — Node Exporter Full (very detailed node metrics)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;13332&lt;/code&gt; — Kubernetes Pods (pod-level resource view)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;15757&lt;/code&gt; — ArgoCD (sync status, app health)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Application Dashboard
&lt;/h3&gt;

&lt;p&gt;With the ServiceMonitor installed, Grafana can display your app metrics. Create a panel with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Request rate (requests per second)
sum(rate(myapp_http_requests_total[5m])) by (route)

# Error rate (percentage of 5xx)
sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m])) * 100

# 95th percentile latency (if using histogram metric)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 9 you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ kube-prometheus-stack on 4 clusters (staging + production) with EBS persistent storage&lt;/li&gt;
&lt;li&gt;✅ Grafana publicly accessible at &lt;code&gt;https://grafana.matthewoladipupo.dev&lt;/code&gt; (production-use1 only)&lt;/li&gt;
&lt;li&gt;✅ ServiceMonitor scraping myapp's &lt;code&gt;/metrics&lt;/code&gt; endpoint every 30 seconds&lt;/li&gt;
&lt;li&gt;✅ PrometheusRule alert rules for high error rate, crash looping, high memory&lt;/li&gt;
&lt;li&gt;✅ Fluent Bit DaemonSet on all 6 clusters shipping logs to CloudWatch&lt;/li&gt;
&lt;li&gt;✅ CloudWatch log groups per namespace per cluster&lt;/li&gt;
&lt;li&gt;✅ CloudWatch Insights for ad-hoc log queries&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: Grafana — Kubernetes cluster overview dashboard showing node CPU and memory&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80fld1ic5bi2llmxn84q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80fld1ic5bi2llmxn84q.png" alt="Show in frame: Dashboard ID 15757 or similar — showing cluster CPU/Memory/Pod count panels with the cluster selector dropdown visible." width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: Grafana — Node Exporter dashboard showing per-node metrics&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdimxl0baoqd956a6hjjz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdimxl0baoqd956a6hjjz.png" alt="Show in frame: CPU usage graphs, memory usage, disk I/O, network I/O — all in one view. Pick a time range with actual traffic." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: AWS CloudWatch — Log groups showing /eks/ hierarchy from Fluent Bit&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rc0vs525hteex2n6hsu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rc0vs525hteex2n6hsu.png" alt="Show in frame: The /eks/myapp-production-use1/ log groups visible — application, dataplane, host. This proves Fluent Bit is shipping logs." width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: CloudWatch Insights — query result showing application logs&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgq4ww2vh40bjajtglor4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgq4ww2vh40bjajtglor4.png" alt="Show in frame: A simple fields @timestamp, @message | sort @timestamp desc | limit 20 query with actual log lines from your app. This is very compelling visually." width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Next: Part 10 — Resilience: Karpenter, HPA, Argo Rollouts, and Velero&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>monitoring</category>
      <category>devops</category>
      <category>aws</category>
    </item>
    <item>
      <title>Part 8: Security Stack</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 06 May 2026 08:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-8-security-stack-48if</link>
      <guid>https://dev.to/matthewdipo/part-8-security-stack-48if</guid>
      <description>&lt;h2&gt;
  
  
  Security Stack — Kyverno, Falco, WAF, and GuardDuty
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction: Defense in Depth
&lt;/h2&gt;

&lt;p&gt;No single security tool is sufficient. A WAF blocks HTTP attacks but does nothing if an attacker exploits a container escape. Kyverno blocks bad pod configurations but can't stop an attacker who is already inside a running container. Each layer catches what the others miss.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88nfm95f8jp7bck9nlyd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88nfm95f8jp7bck9nlyd.png" alt="Security Defense in Depth — 5 concentric layers: Supply Chain (Trivy, &lt;br&gt;
Cosign, ECR), Runtime (Falco eBPF), Admission Control (Kyverno), Cloud &lt;br&gt;
Perimeter (GuardDuty, WAF), Network (VPC, Security Groups)" width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;An attacker must penetrate all 5 layers. A container escape attempt is caught &lt;br&gt;
simultaneously by Falco (runtime), Kyverno (admission), and GuardDuty (API &lt;br&gt;
anomaly detection).&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│  SECURITY LAYERS — each catches different attack vectors            │
│                                                                     │
│  Layer 1: SUPPLY CHAIN (Part 6)                                     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ Trivy: no HIGH/CRITICAL CVEs in image                       │    │
│  │ Cosign: image cryptographically signed before push          │    │
│  │ Distroless: no shell/tools available post-compromise        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                    ↓ (image passes, reaches cluster)                │
│  Layer 2: ADMISSION CONTROL (Kyverno)                               │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ Blocks bad configs at kubectl apply / ArgoCD sync time      │    │
│  │ Pod never starts if it violates policy                      │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                    ↓ (pod starts, attacker gets RCE)                │
│  Layer 3: RUNTIME DETECTION (Falco)                                 │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ eBPF syscall monitoring — detects attacks already running   │    │
│  │ Alerts within 1 second of suspicious activity               │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                    ↓ (attacker reaches HTTP layer)                  │
│  Layer 4: PERIMETER (AWS WAF)                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ Blocks SQLi, XSS, log4shell, rate limiting at ALB level     │    │
│  │ Attacker request never reaches your pod                     │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                    ↓ (account-level threats)                        │
│  Layer 5: THREAT INTELLIGENCE (GuardDuty)                           │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ ML-based: crypto mining, C2 comms, compromised credentials  │    │
│  │ Monitors CloudTrail, VPC Flow Logs, DNS queries             │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Kyverno — Admission Control
&lt;/h2&gt;

&lt;p&gt;Kyverno is a Kubernetes-native policy engine. Policies are written in YAML, not a separate policy language, which makes them readable and maintainable by anyone who knows Kubernetes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Version Selection
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;This matters enormously.&lt;/strong&gt; Kyverno 3.7.x requires Kubernetes ≥ 1.30 because it uses &lt;code&gt;ValidatingAdmissionPolicy&lt;/code&gt; (a v1 API). Our clusters run Kubernetes 1.29.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Kyverno 3.7.x → CrashLoopBackOff on k8s 1.29&lt;/li&gt;
&lt;li&gt;✅ Kyverno 3.2.6 (app version 1.12.5) → compatible with k8s 1.25–1.29
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/kyverno/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://kyverno.github.io/kyverno&lt;/span&gt;
  &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;kyverno&lt;/span&gt;
  &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.2.6"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Installation Flags
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;kyverno kyverno/kyverno &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; kyverno &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; 3.2.6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-hooks&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;       &lt;span class="c"&gt;# REQUIRED: cleanup CronJobs get ImagePullBackOff&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt; 10m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  System Namespace Exclusions
&lt;/h3&gt;

&lt;p&gt;Kyverno policies apply to all namespaces by default. System components like CoreDNS and kube-proxy run in &lt;code&gt;kube-system&lt;/code&gt; and don't follow application-level security policies (they need root, they need hostPath, etc.). Exclude system namespaces in every policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Applies to ALL policies below&lt;/span&gt;
&lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;kyverno&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cert-manager&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;external-secrets&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;argo-rollouts&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;monitoring&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;logging&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;falco&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Policy 1: Block Privileged Containers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;disallow-privileged-containers&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-privileged&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Pod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;kube-system&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;kyverno&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cert-manager&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;external-secrets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
                           &lt;span class="nv"&gt;argocd&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;argo-rollouts&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;monitoring&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;falco&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Privileged&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;containers&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;allowed."&lt;/span&gt;
        &lt;span class="c1"&gt;# Use anyPattern in Kyverno v1.12.x (NOT validate.any)&lt;/span&gt;
        &lt;span class="na"&gt;anyPattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;=(securityContext)&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;=(privileged)&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;=(securityContext)&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;API change in v1.12.x:&lt;/strong&gt; Use &lt;code&gt;validate.anyPattern&lt;/code&gt; not &lt;code&gt;validate.any&lt;/code&gt;. The &lt;code&gt;validate.any&lt;/code&gt; syntax was removed in the 1.12 API version.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Policy 2: Require Non-Root Containers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-non-root&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-runasnonroot&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Pod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;kube-system&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;kyverno&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cert-manager&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;external-secrets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
                           &lt;span class="nv"&gt;argocd&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;argo-rollouts&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;monitoring&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;falco&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Containers&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;run&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;as&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;non-root&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(runAsNonRoot:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;true)."&lt;/span&gt;
        &lt;span class="na"&gt;anyPattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Policy 3: Block hostPath Volumes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;disallow-host-path&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-hostpath&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Pod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;kube-system&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;kyverno&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;falco&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hostPath&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;volumes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;allowed."&lt;/span&gt;
        &lt;span class="na"&gt;deny&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;request.object.spec.volumes[].hostPath&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;length(@)&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
                &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GreaterThan&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Policy 4: Require Resource Limits
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-resource-limits&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-limits&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Pod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;kube-system&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;kyverno&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cert-manager&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;external-secrets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
                           &lt;span class="nv"&gt;argocd&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;argo-rollouts&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;monitoring&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;falco&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limits&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;containers."&lt;/span&gt;
        &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
                    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Policy 5: Require Signed Images
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-signed-images&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;   &lt;span class="c1"&gt;# Must check at admission, not retroactively&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-image-signature&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Pod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;myapp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# Only enforce on application namespaces&lt;/span&gt;
      &lt;span class="na"&gt;verifyImages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;imageReferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:*"&lt;/span&gt;
          &lt;span class="na"&gt;attestors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;entries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;kms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;awskms:///arn:aws:kms:us-east-1:206617159586:key/YOUR_KEY_ID"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Kyverno Circular Deadlock — How to Fix It
&lt;/h3&gt;

&lt;p&gt;If Kyverno's webhook configurations become corrupted (e.g., from a failed upgrade), new Kyverno pods can't start because they can't pass their own admission checks. It is a deadlock.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Symptom: Kyverno pods stuck in Pending or CrashLoopBackOff&lt;/span&gt;
&lt;span class="c"&gt;# Error: "failed calling webhook: the server is currently unable to handle the request"&lt;/span&gt;

&lt;span class="c"&gt;# Fix: Delete the broken webhook configs — this temporarily disables admission control&lt;/span&gt;
kubectl delete validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg
kubectl delete validatingwebhookconfiguration kyverno-policy-validating-webhook-cfg
kubectl delete mutatingwebhookconfiguration kyverno-resource-mutating-webhook-cfg

&lt;span class="c"&gt;# Kyverno pods can now start without passing their own webhooks&lt;/span&gt;
&lt;span class="c"&gt;# Once running, Kyverno recreates the webhook configs automatically&lt;/span&gt;
kubectl rollout restart deployment/kyverno &lt;span class="nt"&gt;-n&lt;/span&gt; kyverno
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Falco — Runtime Threat Detection
&lt;/h2&gt;

&lt;p&gt;Falco operates at the Linux kernel level using eBPF probes. It monitors every system call made by every process in every container. When a pattern matches a rule, it fires an alert within milliseconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────┐
│  HOW FALCO WORKS                                             │
│                                                              │
│  Kernel syscalls (open, exec, connect, read, write...)       │
│         │                                                    │
│         │  eBPF probe (kernel module or ebpf driver)         │
│         ▼                                                    │
│  Falco engine                                                │
│  ├── Checks each syscall against rule set                    │
│  ├── Rule: "exec of sh in container → ALERT"                 │
│  └── Rule: "read /etc/shadow → ALERT"                        │
│         │                                                    │
│         │  JSON alert output to stdout                       │
│         ▼                                                    │
│  Fluent Bit (DaemonSet) picks up stdout                      │
│         │                                                    │
│         ▼                                                    │
│  CloudWatch Logs: /eks/cluster-name/falco                    │
└──────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/falco/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://falcosecurity.github.io/charts&lt;/span&gt;
  &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;falco&lt;/span&gt;
  &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.8.7"&lt;/span&gt;
  &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;driver:&lt;/span&gt;
        &lt;span class="s"&gt;kind: ebpf   # Modern eBPF driver (no kernel module compilation)&lt;/span&gt;
      &lt;span class="s"&gt;falco:&lt;/span&gt;
        &lt;span class="s"&gt;json_output: true     # JSON output for Fluent Bit parsing&lt;/span&gt;
        &lt;span class="s"&gt;log_stderr: true&lt;/span&gt;
        &lt;span class="s"&gt;log_level: info&lt;/span&gt;
      &lt;span class="s"&gt;falcosidekick:&lt;/span&gt;
        &lt;span class="s"&gt;enabled: false   # Using Fluent Bit for log shipping instead&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Rules That Fire by Default
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Terminal shell in container&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;exec&lt;/code&gt; of sh/bash/zsh in container&lt;/td&gt;
&lt;td&gt;WARNING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read sensitive file untrusted&lt;/td&gt;
&lt;td&gt;Read of /etc/shadow, /etc/passwd&lt;/td&gt;
&lt;td&gt;WARNING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write below root&lt;/td&gt;
&lt;td&gt;Any write to / or system dirs&lt;/td&gt;
&lt;td&gt;ERROR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outbound Connection Not Expected&lt;/td&gt;
&lt;td&gt;Container connects to unexpected IP&lt;/td&gt;
&lt;td&gt;NOTICE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privilege Escalation via setuid&lt;/td&gt;
&lt;td&gt;setuid/setgid syscall&lt;/td&gt;
&lt;td&gt;WARNING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modify binary dirs&lt;/td&gt;
&lt;td&gt;Write to /bin, /usr/bin&lt;/td&gt;
&lt;td&gt;ERROR&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Testing Falco
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In one terminal, watch Falco logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; falco &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/name&lt;span class="o"&gt;=&lt;/span&gt;falco &lt;span class="nt"&gt;-c&lt;/span&gt; falco | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"Notice&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;Informational"&lt;/span&gt;

&lt;span class="c"&gt;# In another terminal, trigger a rule&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; sh
&lt;span class="c"&gt;# Falco fires: "Notice A shell was spawned in a container with an attached terminal"&lt;/span&gt;
&lt;span class="c"&gt;# (Note: distroless containers have no shell — this only works if you exec into a debug container)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Custom Rule: Alert on curl/wget
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/falco/custom-rules.yaml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Unexpected curl or wget in container&lt;/span&gt;
  &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Detect curl or wget being used in a container (potential exfiltration)&lt;/span&gt;
  &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;spawned_process and&lt;/span&gt;
    &lt;span class="s"&gt;container and&lt;/span&gt;
    &lt;span class="s"&gt;proc.name in (curl, wget, python, python3) and&lt;/span&gt;
    &lt;span class="s"&gt;not proc.pname in (sh, bash)&lt;/span&gt;
  &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;Curl/wget detected in container&lt;/span&gt;
    &lt;span class="s"&gt;(user=%user.name command=%proc.cmdline container=%container.name&lt;/span&gt;
     &lt;span class="s"&gt;image=%container.image.repository)&lt;/span&gt;
  &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WARNING&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;mitre_exfiltration&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  AWS WAF — Web Application Firewall
&lt;/h2&gt;

&lt;p&gt;WAF sits in front of your ALB and inspects every HTTP request before it reaches your pods. This happens at the AWS network edge — your application code never sees malicious requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Terraform Module
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/waf/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_wafv2_web_acl"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.env}-${var.region_alias}-web-acl"&lt;/span&gt;
  &lt;span class="nx"&gt;scope&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"REGIONAL"&lt;/span&gt;   &lt;span class="c1"&gt;# For ALB (not CloudFront)&lt;/span&gt;

  &lt;span class="nx"&gt;default_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;allow&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;   &lt;span class="c1"&gt;# Allow by default; rules below explicitly block&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Rule 1: AWS Managed — Common Rule Set (OWASP Top 10)&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWSManagedRulesCommonRuleSet"&lt;/span&gt;
    &lt;span class="nx"&gt;priority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="nx"&gt;override_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;none&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;managed_rule_group_statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWSManagedRulesCommonRuleSet"&lt;/span&gt;
        &lt;span class="nx"&gt;vendor_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWS"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;visibility_config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;cloudwatch_metrics_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;metric_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CommonRuleSet"&lt;/span&gt;
      &lt;span class="nx"&gt;sampled_requests_enabled&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Rule 2: SQL Injection protection&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWSManagedRulesSQLiRuleSet"&lt;/span&gt;
    &lt;span class="nx"&gt;priority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="nx"&gt;override_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;none&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;managed_rule_group_statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWSManagedRulesSQLiRuleSet"&lt;/span&gt;
        &lt;span class="nx"&gt;vendor_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWS"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;visibility_config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;cloudwatch_metrics_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;metric_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SQLiRuleSet"&lt;/span&gt;
      &lt;span class="nx"&gt;sampled_requests_enabled&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Rule 3: Known bad inputs (log4shell, Spring4Shell, etc.)&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWSManagedRulesKnownBadInputsRuleSet"&lt;/span&gt;
    &lt;span class="nx"&gt;priority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="nx"&gt;override_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;none&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;managed_rule_group_statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWSManagedRulesKnownBadInputsRuleSet"&lt;/span&gt;
        &lt;span class="nx"&gt;vendor_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AWS"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;visibility_config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;cloudwatch_metrics_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;metric_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"KnownBadInputs"&lt;/span&gt;
      &lt;span class="nx"&gt;sampled_requests_enabled&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Rule 4: Rate limiting — 2000 req/5min per IP&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"RateLimitPerIP"&lt;/span&gt;
    &lt;span class="nx"&gt;priority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;rate_based_statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;limit&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;
        &lt;span class="nx"&gt;aggregate_key_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"IP"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;visibility_config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;cloudwatch_metrics_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;metric_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"RateLimit"&lt;/span&gt;
      &lt;span class="nx"&gt;sampled_requests_enabled&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;visibility_config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cloudwatch_metrics_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;metric_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.env}-web-acl"&lt;/span&gt;
    &lt;span class="nx"&gt;sampled_requests_enabled&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"web_acl_arn"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_wafv2_web_acl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Associating WAF with Your Ingress
&lt;/h3&gt;

&lt;p&gt;The WAF ACL ARN is injected per-cluster via the ApplicationSet and added as an ALB annotation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/ingress.yaml&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.ingress.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.fullname" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/ingress.class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alb&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;internet-facing&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/target-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ip&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/listen-ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[{"HTTPS":443},{"HTTP":80}]'&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/ssl-redirect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;443"&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/certificate-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.ingress.certArn&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.ingress.wafAclArn&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/wafv2-acl-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.ingress.wafAclArn&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
    &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  AWS GuardDuty — Threat Intelligence
&lt;/h2&gt;

&lt;p&gt;GuardDuty operates at the AWS account level — it analyzes CloudTrail API logs, VPC Flow Logs, and DNS query logs using machine learning to identify threats.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/guardduty/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_guardduty_detector"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;datasources&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;s3_logs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# Detect unusual S3 access patterns&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;kubernetes&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;audit_logs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Monitor EKS audit logs for suspicious API calls&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;malware_protection&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;scan_ec2_instance_with_findings&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;ebs_volumes&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Scan EBS volumes when GuardDuty finds a threat&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What GuardDuty Detects
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Finding Type&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CryptoCurrency:EC2/BitcoinTool.B!DNS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;EC2 instance querying known crypto mining pools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UnauthorizedAccess:IAMUser/TorIPCaller&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;API calls originating from Tor exit nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CredentialAccess:Kubernetes/SuccessfulAnonymousAccess&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Anonymous access to Kubernetes API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Execution:Kubernetes/ExecInKubernetes.Medium&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;kubectl exec into a running pod (suspicious context)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Exfiltration:S3/ObjectRead.Unusual&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unusual S3 read patterns suggesting data theft&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Attack Scenario: All Layers in Action
&lt;/h2&gt;

&lt;p&gt;Here's how the security layers stop a real attack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario: Attacker finds a dependency with RCE vulnerability&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Attacker discovers CVE-XXXX-1234 in one of your npm packages

   → Trivy: if the CVE is HIGH/CRITICAL, the build FAILS — image never pushed
     (Layer 1: Supply Chain)

2. If Trivy missed it (unfixed CVE) and image was pushed:
   Attacker triggers the RCE, gets command execution in the pod

   → Falco: "A shell was spawned in container myapp-abc123"
     Alert fires within 1 second to CloudWatch
     (Layer 3: Runtime Detection)

   → But wait — distroless has no /bin/sh to spawn
     Attacker needs a writable filesystem — which is also blocked
     (Layer 1: Distroless base)

3. Attacker tries to deploy a privileged pod to escape to the node:

   → Kyverno: BLOCKS the pod at admission — "Privileged containers not allowed"
     Pod never starts
     (Layer 2: Admission Control)

4. Attacker tries SQL injection via the public HTTP endpoint:

   → AWS WAF: blocks the request at the ALB
     Your pod code never executes the malicious query
     (Layer 4: Perimeter)

5. Attacker's stolen AWS key starts making API calls:

   → GuardDuty: unusual API call pattern detected
     Finding generated: "UnauthorizedAccess:IAMUser/AnomalousBehavior"
     (Layer 5: Threat Intelligence)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 8 you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Kyverno 3.2.6 running on all 6 clusters (compatible with k8s 1.29)&lt;/li&gt;
&lt;li&gt;✅ Five Kyverno policies enforcing: no privileged, no root, no hostPath, resource limits, signed images&lt;/li&gt;
&lt;li&gt;✅ Falco DaemonSet monitoring all syscalls with eBPF driver&lt;/li&gt;
&lt;li&gt;✅ Falco alerts flowing to CloudWatch via Fluent Bit&lt;/li&gt;
&lt;li&gt;✅ AWS WAF WebACL with OWASP Top 10, SQLi, known bad inputs, and rate limiting&lt;/li&gt;
&lt;li&gt;✅ GuardDuty enabled in all accounts with EKS audit log monitoring&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: AWS WAF console showing WebACL with managed rule groups and request metrics&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3mf667e3erwf3mkv8x1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3mf667e3erwf3mkv8x1.png" alt="Show in frame: The rules list showing RateLimit, SQLi, XSS, BadBots rules with Allow/Block actions." width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: AWS GuardDuty console showing Findings summary (hopefully empty in production)&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjfai4xgrqdmf085twsfh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjfai4xgrqdmf085twsfh.png" alt="Show in frame: The service status showing " width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: kubectl get clusterpolicies showing all Kyverno policies as Ready&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjs54gsuu1e0m2ew21jwo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjs54gsuu1e0m2ew21jwo.png" alt="Show in frame: All 5 policies with READY: True and BACKGROUND: True, Mode: Enforce. Already in the appendix — take a clean terminal screenshot." width="800" height="734"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Next: Part 9 — Observability: Prometheus, Grafana, Fluent Bit, and CloudWatch&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>aws</category>
    </item>
    <item>
      <title>Part 7: Secrets Management</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 29 Apr 2026 08:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-7-secrets-management-108k</link>
      <guid>https://dev.to/matthewdipo/part-7-secrets-management-108k</guid>
      <description>&lt;h2&gt;
  
  
  Part 7: Secrets Management — AWS Secrets Manager + External Secrets Operator + IRSA
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Secrets management is one of the areas where teams most commonly take shortcuts that later become security incidents. The three anti-patterns to avoid:&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ieyrvjniq5vie8ybbd7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ieyrvjniq5vie8ybbd7.png" alt="IRSA Secrets Flow — How Kubernetes pods access AWS Secrets Manager without &lt;br&gt;
static credentials using IAM Roles for Service Accounts and OIDC token &lt;br&gt;
exchange" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The pod never holds AWS credentials. It exchanges a short-lived Kubernetes &lt;br&gt;
JWT (1 hour) for temporary IAM credentials via AWS STS, then ESO fetches &lt;br&gt;
the secret and creates a Kubernetes Secret.&lt;/em&gt;&lt;/p&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Secrets in Git&lt;/strong&gt; — even in a private repo, any developer with access can read them; git history preserves them forever even after deletion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets baked into images&lt;/strong&gt; — anyone who pulls the image gets the secrets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets in Kubernetes ConfigMaps&lt;/strong&gt; — ConfigMaps are not encrypted by default; any pod in the namespace can read them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pipeline uses a three-layer approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AWS Secrets Manager&lt;/strong&gt; — the source of truth; all secrets live here, encrypted with KMS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External Secrets Operator (ESO)&lt;/strong&gt; — a Kubernetes controller that fetches secrets from AWS and creates Kubernetes Secret objects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IRSA (IAM Roles for Service Accounts)&lt;/strong&gt; — pod-level AWS identity so ESO can call Secrets Manager without node-level credentials&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────────┐
│  SECRETS FLOW                                                        │
│                                                                      │
│  AWS Secrets Manager                                                 │
│  production/myapp/db-password: "s3cr3t-v@lue"                        │
│         │                                                            │
│         │  GetSecretValue (every 1h)                                 │
│         │  Authenticated via IRSA (OIDC token → STS → temp creds)    │
│         ▼                                                            │
│  ESO Operator Pod                                                    │
│  (external-secrets namespace)                                        │
│         │                                                            │
│         │  Creates/updates                                           │
│         ▼                                                            │
│  Kubernetes Secret: myapp-db-password                                │
│  (myapp namespace)                                                   │
│         │                                                            │
│         │  Mounted as env var                                        │
│         ▼                                                            │
│  myapp Pod: process.env.DB_PASSWORD = "s3cr3t-v@lue"                 │
└──────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  IRSA Deep Dive
&lt;/h2&gt;

&lt;p&gt;IRSA (IAM Roles for Service Accounts) is the mechanism that gives individual Kubernetes pods fine-grained AWS IAM permissions without sharing credentials at the node level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────────────────────┐
│  HOW IRSA WORKS                                                    │
│                                                                    │
│  1. EKS creates an OIDC provider for the cluster                   │
│     URL: oidc.eks.us-east-1.amazonaws.com/id/CLUSTER_ID            │
│                                                                    │
│  2. Pod's ServiceAccount is annotated:                             │
│     eks.amazonaws.com/role-arn: arn:aws:iam::ACCT:role/eso-role    │
│                                                                    │
│  3. EKS projects a signed OIDC JWT into the pod at:                │
│     /var/run/secrets/eks.amazonaws.com/serviceaccount/token        │
│                                                                    │
│  4. ESO SDK calls sts:AssumeRoleWithWebIdentity with that token    │
│                                                                    │
│  5. AWS validates: token signed by trusted OIDC provider?          │
│                    sub matches trust policy condition?             │
│                                                                    │
│  6. AWS returns temporary creds (AccessKeyId + SecretKey + Token)  │
│     Valid for 1 hour, then automatically expire                    │
│                                                                    │
│  Result: only THIS pod in THIS namespace with THIS SA can          │
│  assume the role — not any other pod on the same node              │
└────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why IRSA over node-level IAM?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Node-level IAM (EC2 instance profile) gives every pod on the node the same permissions. If one pod is compromised, the attacker has the permissions of every pod on that node. IRSA scopes permissions to the exact service account — the blast radius of a compromise is just that one workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  Terraform: ESO IRSA Module
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/eso-irsa/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"cluster_name"&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"env"&lt;/span&gt;               &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"oidc_provider_arn"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"oidc_provider"&lt;/span&gt;     &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# URL without https://&lt;/span&gt;
&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"account_id"&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"aws_region"&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# Helm fullname = {release}-{chart}&lt;/span&gt;
  &lt;span class="c1"&gt;# Release name = cluster name (e.g., myapp-production-use1)&lt;/span&gt;
  &lt;span class="c1"&gt;# Chart name = myapp&lt;/span&gt;
  &lt;span class="c1"&gt;# So the SA name = myapp-production-use1-myapp&lt;/span&gt;
  &lt;span class="c1"&gt;#&lt;/span&gt;
  &lt;span class="c1"&gt;# CRITICAL: The IRSA trust policy sub claim must match this exactly.&lt;/span&gt;
  &lt;span class="c1"&gt;# Getting this wrong = ESO pods cannot assume the role = secrets don't sync.&lt;/span&gt;
  &lt;span class="nx"&gt;sa_name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-myapp"&lt;/span&gt;
  &lt;span class="nx"&gt;namespace&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"external-secrets"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"eso"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-eso"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Federated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:aud"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;
          &lt;span class="c1"&gt;# Must match: system:serviceaccount:{namespace}:{sa-name}&lt;/span&gt;
          &lt;span class="s2"&gt;"${var.oidc_provider}:sub"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"system:serviceaccount:${local.namespace}:${local.sa_name}"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy"&lt;/span&gt; &lt;span class="s2"&gt;"eso"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eso-secrets-policy"&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SecretsManagerRead"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s2"&gt;"secretsmanager:GetSecretValue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"secretsmanager:DescribeSecret"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"secretsmanager:ListSecretVersionIds"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="c1"&gt;# Wildcard covers all secrets for this environment&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:secretsmanager:*:${var.account_id}:secret:${var.env}/myapp/*"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"role_arn"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Creating Secrets in AWS Secrets Manager
&lt;/h2&gt;

&lt;p&gt;Use a structured naming convention: &lt;code&gt;{env}/myapp/{secret-name}&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Development secrets (dev account)&lt;/span&gt;
aws secretsmanager create-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"dev/myapp/db-password"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"password":"dev-db-secret-here"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-dev-use1

&lt;span class="c"&gt;# Production secrets (production account)&lt;/span&gt;
aws secretsmanager create-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"production/myapp/db-password"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"password":"prod-super-secret-here"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1

&lt;span class="c"&gt;# To update a secret (rotation):&lt;/span&gt;
aws secretsmanager update-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-id&lt;/span&gt; &lt;span class="s2"&gt;"production/myapp/db-password"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"password":"new-rotated-password"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1
&lt;span class="c"&gt;# ESO picks up the new value within refreshInterval (1h default)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ESO Installation via ArgoCD
&lt;/h2&gt;

&lt;p&gt;ESO runs as a controller in the &lt;code&gt;external-secrets&lt;/code&gt; namespace on every cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/eso/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets-operator&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-dev-use1&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;us-east-1&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-dev-usw2&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;us-west-2&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-staging-use1&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;us-east-1&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-staging-usw2&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;us-west-2&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-production-use1&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;us-east-1&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-production-usw2&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;us-west-2&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eso-{{cluster}}"&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://charts.external-secrets.io&lt;/span&gt;
        &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="s"&gt;external-secrets&lt;/span&gt;
        &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.9.13"&lt;/span&gt;
        &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;serviceAccount:&lt;/span&gt;
              &lt;span class="s"&gt;annotations:&lt;/span&gt;
                &lt;span class="s"&gt;eks.amazonaws.com/role-arn: "{{irsaRoleArn}}"&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}"&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets&lt;/span&gt;
      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;CreateNamespace=true&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SecretStore CRD
&lt;/h2&gt;

&lt;p&gt;The SecretStore tells ESO where to find secrets and how to authenticate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/secretstore.yaml&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.externalSecrets.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-secrets-manager&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;aws&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretsManager&lt;/span&gt;
      &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;   &lt;span class="c1"&gt;# Always pull from us-east-1 (single source of truth)&lt;/span&gt;
      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;jwt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;serviceAccountRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;include "myapp.serviceAccountName" .&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
            &lt;span class="c1"&gt;# DO NOT add namespace: field here.&lt;/span&gt;
            &lt;span class="c1"&gt;# Namespaced SecretStore rejects serviceAccountRef.namespace.&lt;/span&gt;
            &lt;span class="c1"&gt;# The SA is in the same namespace as the SecretStore.&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lesson learned:&lt;/strong&gt; A namespaced &lt;code&gt;SecretStore&lt;/code&gt; (not &lt;code&gt;ClusterSecretStore&lt;/code&gt;) will reject the config if &lt;code&gt;serviceAccountRef.namespace&lt;/code&gt; is specified. The namespace is implicit — it's the same namespace as the SecretStore itself. Only &lt;code&gt;ClusterSecretStore&lt;/code&gt; supports cross-namespace references.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  ExternalSecret CRD
&lt;/h2&gt;

&lt;p&gt;The ExternalSecret tells ESO which secret to fetch and what Kubernetes Secret to create:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/external-secret.yaml&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.externalSecrets.enabled&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ExternalSecret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-secrets&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Release.Namespace&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.externalSecrets.refreshInterval | default "1h"&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;

  &lt;span class="na"&gt;secretStoreRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-secrets-manager&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;

  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-db-credentials&lt;/span&gt;    &lt;span class="c1"&gt;# Name of the Kubernetes Secret to create&lt;/span&gt;
    &lt;span class="na"&gt;creationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owner&lt;/span&gt;         &lt;span class="c1"&gt;# ESO owns this secret — it will delete it if ExternalSecret is deleted&lt;/span&gt;

  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;       &lt;span class="c1"&gt;# Key in the Kubernetes Secret&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production/myapp/db-password&lt;/span&gt;   &lt;span class="c1"&gt;# Secret name in AWS Secrets Manager&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;                  &lt;span class="c1"&gt;# JSON key within the secret value&lt;/span&gt;
&lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Helm Chart Integration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/values.yaml (defaults — disabled)&lt;/span&gt;
&lt;span class="na"&gt;externalSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
  &lt;span class="na"&gt;irsaRoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# apps/myapp/values-production.yaml&lt;/span&gt;
&lt;span class="na"&gt;externalSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;irsaRoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# Injected per-cluster via ApplicationSet parameter&lt;/span&gt;

&lt;span class="c1"&gt;# apps/myapp/values-dev.yaml&lt;/span&gt;
&lt;span class="na"&gt;externalSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;irsaRoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::557702566877:role/myapp-dev-eso"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ApplicationSet injects the correct IRSA role ARN per cluster via parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;externalSecrets.irsaRoleArn"&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{irsaRoleArn}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Deployment Mounting the Secret
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/myapp/templates/deployment.yaml (relevant section)&lt;/span&gt;
&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;
    &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-db-credentials&lt;/span&gt;   &lt;span class="c1"&gt;# Created by ESO from ExternalSecret&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Verification
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check ExternalSecret sync status&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get externalsecret &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
&lt;span class="c"&gt;# NAME            STORE                 REFRESH INTERVAL   STATUS          READY&lt;/span&gt;
&lt;span class="c"&gt;# myapp-secrets   aws-secrets-manager   1h                 SecretSynced    True&lt;/span&gt;

&lt;span class="c"&gt;# Check the Kubernetes Secret was created&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get secret myapp-db-credentials &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
&lt;span class="c"&gt;# NAME                   TYPE     DATA   AGE&lt;/span&gt;
&lt;span class="c"&gt;# myapp-db-credentials   Opaque   1      5m&lt;/span&gt;

&lt;span class="c"&gt;# View (base64 encoded — decode to see value)&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get secret myapp-db-credentials &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.data.DB_PASSWORD}'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ArgoCD false positive:&lt;/strong&gt; ArgoCD will show the myapp Application as &lt;code&gt;OutOfSync&lt;/code&gt; even when everything is working. This is because ESO writes &lt;code&gt;status.refreshTime&lt;/code&gt; to the ExternalSecret at every sync cycle. ArgoCD detects this runtime write as a diff from the Git-defined manifest. This is a known, safe false positive — the secret IS syncing correctly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Secret Rotation
&lt;/h2&gt;

&lt;p&gt;When you need to rotate a secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Update in AWS Secrets Manager&lt;/span&gt;
aws secretsmanager update-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-id&lt;/span&gt; &lt;span class="s2"&gt;"production/myapp/db-password"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"password":"new-rotated-value"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1

&lt;span class="c"&gt;# 2. ESO picks it up automatically within refreshInterval (1h)&lt;/span&gt;
&lt;span class="c"&gt;# OR force immediate refresh:&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 annotate externalsecret myapp-secrets &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="se"&gt;\&lt;/span&gt;
  force-sync&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--overwrite&lt;/span&gt;

&lt;span class="c"&gt;# 3. Kubernetes Secret is updated&lt;/span&gt;
&lt;span class="c"&gt;# 4. Pods need restart to pick up new env var value:&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 rollout restart deployment/myapp &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Grafana Admin Secret — A Real-World Lesson
&lt;/h2&gt;

&lt;p&gt;During this build, we used &lt;code&gt;existingSecret: grafana-admin-secret&lt;/code&gt; in the kube-prometheus-stack values. The secret was created with a placeholder password &lt;code&gt;YourStrongPassword123!&lt;/code&gt;. Grafana initialized its SQLite database with this value when the pod first started.&lt;/p&gt;

&lt;p&gt;Later, the secret value was changed — but Grafana's database still had the old value. The env var &lt;code&gt;GF_SECURITY_ADMIN_PASSWORD&lt;/code&gt; only sets the initial password; it cannot change an existing one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &amp;lt;grafana-pod&amp;gt; &lt;span class="nt"&gt;-c&lt;/span&gt; grafana &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  grafana-cli admin reset-admin-password &lt;span class="s1"&gt;'NewPassword!'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; For any tool that reads credentials once at database initialization, changing the Kubernetes Secret is not enough. You must either reset via the tool's CLI or destroy and recreate the persistent volume.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 7 you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ All secrets stored exclusively in AWS Secrets Manager (never in Git or images)&lt;/li&gt;
&lt;li&gt;✅ ESO installed on all 6 clusters via ArgoCD ApplicationSet&lt;/li&gt;
&lt;li&gt;✅ IRSA roles per cluster with least-privilege Secrets Manager read access&lt;/li&gt;
&lt;li&gt;✅ SecretStore + ExternalSecret CRDs syncing secrets into Kubernetes&lt;/li&gt;
&lt;li&gt;✅ Helm chart values.yaml integration with enabled/disabled toggle per environment&lt;/li&gt;
&lt;li&gt;✅ Secret rotation workflow (update in ASM → ESO auto-refreshes)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: AWS Secrets Manager console showing secrets per environment&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84ffm8fa2yj6y41scd7g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84ffm8fa2yj6y41scd7g.png" alt="Show in frame: The secret names dev/myapp/db-password and production/myapp/db-password in the list. This shows the naming convention." width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: kubectl get externalsecret showing SecretSynced: True&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr1967wvwiz4p7euu747.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr1967wvwiz4p7euu747.png" alt="Show in frame: Output showing STATUS: SecretSynced and READY: True. This is in the Live Data Appendix already but a terminal screenshot is more visual." width="800" height="734"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: ArgoCD showing ESO app as OutOfSync/Healthy (expected false positive)&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9my555e58i7uz3z02ob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9my555e58i7uz3z02ob.png" alt="ArgoCD showing ESO app as OutOfSync/Healthy (expected false positive" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Next: Part 8 — Security Stack: Kyverno, Falco, WAF, and GuardDuty&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>devops</category>
      <category>docker</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Part 6: CI/CD Pipeline</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 15 Apr 2026 14:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-6-cicd-pipeline-5fde</link>
      <guid>https://dev.to/matthewdipo/part-6-cicd-pipeline-5fde</guid>
      <description>&lt;h2&gt;
  
  
  Part 6: CI/CD Pipeline — GitHub Actions, Trivy, Cosign, and ECR
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The CI/CD pipeline is the gateway between a developer's &lt;code&gt;git push&lt;/code&gt; and a running container in production. Every security control that can be automated should live here — not as an afterthought but as a first-class gate that blocks bad artifacts from ever reaching a cluster.&lt;/p&gt;

&lt;p&gt;This pipeline enforces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No HIGH or CRITICAL CVEs&lt;/strong&gt; — Trivy blocks the build before the image is pushed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No static AWS credentials&lt;/strong&gt; — GitHub OIDC exchanges JWT tokens for temporary STS creds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cryptographic image provenance&lt;/strong&gt; — Cosign signs every image with an AWS KMS key; Kyverno verifies the signature before admitting the pod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable image tags&lt;/strong&gt; — &lt;code&gt;sha-&amp;lt;full-commit-sha&amp;gt;&lt;/code&gt;, never &lt;code&gt;:latest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trail&lt;/strong&gt; — every push is logged to S3 with digest, timestamp, and caller identity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Application (myapp)
&lt;/h2&gt;

&lt;p&gt;A minimal Node.js/Express API that represents any real production service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;myapp/
├── src/
│   └── index.js        # Express app with /health and /metrics
├── Dockerfile
├── package.json
└── .github/
    └── workflows/
        └── ci.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/index.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;express&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;promClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;prom-client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;express&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;register&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;promClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Registry&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;promClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collectDefaultMetrics&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;register&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;httpRequestsTotal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;promClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;myapp_http_requests_total&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;help&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Total HTTP requests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;labelNames&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;method&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;route&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;status_code&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;registers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;register&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;finish&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;httpRequestsTotal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;healthy&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AWS_REGION&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unknown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/metrics&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;register&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;register&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Listening on :8080&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Dockerfile — Distroless Nonroot
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stage 1: Build dependencies&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;node:18-alpine&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; package*.json ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci &lt;span class="nt"&gt;--only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;production

&lt;span class="c"&gt;# Stage 2: Runtime — distroless (no shell, no package manager)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; gcr.io/distroless/nodejs18-debian12:nonroot&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Copy only what's needed — no node_modules dev dependencies, no source maps&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/node_modules ./node_modules&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; src/ ./src/&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; package.json ./&lt;/span&gt;

&lt;span class="c"&gt;# nonroot image runs as uid 65532 by default — no root, ever&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8080&lt;/span&gt;

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["src/index.js"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why distroless?&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;nonroot&lt;/code&gt; variant contains only the Node.js runtime. There is no:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/bin/sh&lt;/code&gt; or &lt;code&gt;/bin/bash&lt;/code&gt; — an attacker with RCE cannot spawn an interactive shell&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;apt&lt;/code&gt;, &lt;code&gt;apk&lt;/code&gt;, &lt;code&gt;yum&lt;/code&gt; — cannot install additional tools&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;curl&lt;/code&gt;, &lt;code&gt;wget&lt;/code&gt; — cannot exfiltrate data or download payloads&lt;/li&gt;
&lt;li&gt;Any other utility that would help lateral movement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Falco still alerts on any unexpected syscalls, but the attack surface is dramatically reduced.&lt;/p&gt;


&lt;h2&gt;
  
  
  Pipeline Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmlcmtteraw33w16cqrau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmlcmtteraw33w16cqrau.png" alt="GitHub Actions CI/CD Pipeline — 7 stages from git push to pods running, &lt;br&gt;
including OIDC auth, Trivy scanning, Cosign signing, and ArgoCD GitOps &lt;br&gt;
deployment" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The complete pipeline: no static AWS credentials anywhere. OIDC authenticates &lt;br&gt;
GitHub Actions to AWS. Every image is signed with Cosign before Kyverno will &lt;br&gt;
admit it to the cluster.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────────┐
│  git push → main                                                        │
│                    │                                                    │
│                    ▼                                                    │
│         ┌──────────────────┐                                            │
│         │  Job 1: test     │  npm ci + npm test                         │
│         └────────┬─────────┘                                            │
│                  │ needs: test                                          │
│                  ▼                                                      │
│         ┌──────────────────┐                                            │
│         │  Job 2: scan     │  trivy image → fail on HIGH/CRITICAL       │
│         └────────┬─────────┘                                            │
│                  │ needs: scan                                          │
│                  ▼                                                      │
│         ┌──────────────────────────────────────────────┐                │
│         │  Job 3: build-push-sign                      │                │
│         │  ├─ OIDC → assume IAM role (no static keys)  │                │
│         │  ├─ docker build (distroless)                │                │
│         │  ├─ push → ECR us-east-1                     │                │
│         │  ├─ push → ECR us-west-2                     │                │
│         │  ├─ cosign sign (AWS KMS)                    │                │
│         │  └─ S3 audit log                             │                │
│         └────────┬─────────────────────────────────────┘                │
│                  │ needs: build-push-sign                               │
│                  ▼                                                      │
│         ┌──────────────────────────────────────────────┐                │
│         │  Job 4: update-gitops                        │                │
│         │  └─ patch image.tag in values-*.yaml → push  │                │
│         └──────────────────────────────────────────────┘                │
└─────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Full GitHub Actions Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/ci.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ECR_REGISTRY_USE1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;206617159586.dkr.ecr.us-east-1.amazonaws.com&lt;/span&gt;
  &lt;span class="na"&gt;ECR_REGISTRY_USW2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;206617159586.dkr.ecr.us-west-2.amazonaws.com&lt;/span&gt;
  &lt;span class="na"&gt;IMAGE_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;AWS_REGION_USE1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
  &lt;span class="na"&gt;AWS_REGION_USW2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-west-2&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;   &lt;span class="c1"&gt;# Required for OIDC&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;   &lt;span class="c1"&gt;# Required for gitops update commit&lt;/span&gt;
  &lt;span class="na"&gt;packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Setup Node.js&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;18'&lt;/span&gt;
          &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;npm'&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install dependencies&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;

  &lt;span class="na"&gt;scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build image for scanning&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;docker build -t ${{ env.IMAGE_NAME }}:scan-${{ github.sha }} .&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy vulnerability scan&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.IMAGE_NAME }}:scan-${{ github.sha }}&lt;/span&gt;
          &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;table&lt;/span&gt;
          &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;              &lt;span class="c1"&gt;# Fail the pipeline on findings&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HIGH,CRITICAL'&lt;/span&gt;   &lt;span class="c1"&gt;# Only fail on HIGH and CRITICAL&lt;/span&gt;
          &lt;span class="na"&gt;ignore-unfixed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;        &lt;span class="c1"&gt;# Skip CVEs with no available fix&lt;/span&gt;

  &lt;span class="na"&gt;build-push-sign&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scan&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.ref == 'refs/heads/main'&lt;/span&gt;  &lt;span class="c1"&gt;# Only push on main branch, not PRs&lt;/span&gt;
    &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;image-digest-use1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.push-use1.outputs.digest }}&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS credentials via OIDC&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ vars.AWS_ROLE_ARN_USE1 }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.AWS_REGION_USE1 }}&lt;/span&gt;
          &lt;span class="c1"&gt;# No static access keys — OIDC exchanges GitHub JWT for STS temp creds&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Login to ECR us-east-1&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/amazon-ecr-login@v2&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.AWS_REGION_USE1 }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Login to ECR us-west-2&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/amazon-ecr-login@v2&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.AWS_REGION_USW2 }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Docker Buildx&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/setup-buildx-action@v3&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push to us-east-1&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;push-use1&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
          &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;${{ env.ECR_REGISTRY_USE1 }}/${{ env.IMAGE_NAME }}:sha-${{ github.sha }}&lt;/span&gt;
            &lt;span class="s"&gt;${{ env.ECR_REGISTRY_USE1 }}/${{ env.IMAGE_NAME }}:latest&lt;/span&gt;
          &lt;span class="na"&gt;cache-from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;type=gha&lt;/span&gt;
          &lt;span class="na"&gt;cache-to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;type=gha,mode=max&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Copy image to us-west-2&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;DIGEST="${{ steps.push-use1.outputs.digest }}"&lt;/span&gt;
          &lt;span class="s"&gt;docker buildx imagetools create \&lt;/span&gt;
            &lt;span class="s"&gt;--tag ${{ env.ECR_REGISTRY_USW2 }}/${{ env.IMAGE_NAME }}:sha-${{ github.sha }} \&lt;/span&gt;
            &lt;span class="s"&gt;${{ env.ECR_REGISTRY_USE1 }}/${{ env.IMAGE_NAME }}@${DIGEST}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Cosign&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sigstore/cosign-installer@v3&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Sign image with AWS KMS&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;COSIGN_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ vars.COSIGN_KMS_KEY_ARN }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;DIGEST="${{ steps.push-use1.outputs.digest }}"&lt;/span&gt;
          &lt;span class="s"&gt;IMAGE="${{ env.ECR_REGISTRY_USE1 }}/${{ env.IMAGE_NAME }}@${DIGEST}"&lt;/span&gt;

          &lt;span class="s"&gt;# Sign the image — creates an OCI attestation artifact in ECR&lt;/span&gt;
          &lt;span class="s"&gt;cosign sign --key awskms:///${COSIGN_KEY} \&lt;/span&gt;
            &lt;span class="s"&gt;--yes \&lt;/span&gt;
            &lt;span class="s"&gt;${IMAGE}&lt;/span&gt;

          &lt;span class="s"&gt;echo "Signed: ${IMAGE}"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Write S3 audit log&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;DIGEST="${{ steps.push-use1.outputs.digest }}"&lt;/span&gt;
          &lt;span class="s"&gt;TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)&lt;/span&gt;
          &lt;span class="s"&gt;CALLER=$(aws sts get-caller-identity --query Arn --output text)&lt;/span&gt;

          &lt;span class="s"&gt;echo "{&lt;/span&gt;
            &lt;span class="s"&gt;\"timestamp\": \"${TIMESTAMP}\",&lt;/span&gt;
            &lt;span class="s"&gt;\"repo\": \"${{ github.repository }}\",&lt;/span&gt;
            &lt;span class="s"&gt;\"sha\": \"${{ github.sha }}\",&lt;/span&gt;
            &lt;span class="s"&gt;\"digest\": \"${DIGEST}\",&lt;/span&gt;
            &lt;span class="s"&gt;\"pushed_by\": \"${CALLER}\",&lt;/span&gt;
            &lt;span class="s"&gt;\"workflow\": \"${{ github.workflow }}\",&lt;/span&gt;
            &lt;span class="s"&gt;\"run_id\": \"${{ github.run_id }}\"&lt;/span&gt;
          &lt;span class="s"&gt;}" | aws s3 cp - \&lt;/span&gt;
            &lt;span class="s"&gt;s3://${{ vars.AUDIT_BUCKET }}/ci-push-audit/${{ github.sha }}.json&lt;/span&gt;

  &lt;span class="na"&gt;update-gitops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build-push-sign&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.ref == 'refs/heads/main'&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout myapp-gitops&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MatthewDipo/myapp-gitops&lt;/span&gt;
          &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITOPS_PAT }}&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-gitops&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update image tags&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;cd myapp-gitops&lt;/span&gt;
          &lt;span class="s"&gt;NEW_TAG="sha-${{ github.sha }}"&lt;/span&gt;

          &lt;span class="s"&gt;# Update all environment value files&lt;/span&gt;
          &lt;span class="s"&gt;for FILE in apps/myapp/values-dev.yaml \&lt;/span&gt;
                      &lt;span class="s"&gt;apps/myapp/values-staging.yaml \&lt;/span&gt;
                      &lt;span class="s"&gt;apps/myapp/values-production.yaml; do&lt;/span&gt;
            &lt;span class="s"&gt;sed -i "s|tag: sha-[a-f0-9]*|tag: ${NEW_TAG}|g" $FILE&lt;/span&gt;
            &lt;span class="s"&gt;echo "Updated $FILE → ${NEW_TAG}"&lt;/span&gt;
          &lt;span class="s"&gt;done&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Commit and push&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;cd myapp-gitops&lt;/span&gt;
          &lt;span class="s"&gt;git config user.email "ci@github.com"&lt;/span&gt;
          &lt;span class="s"&gt;git config user.name "GitHub Actions"&lt;/span&gt;
          &lt;span class="s"&gt;git add apps/myapp/values-*.yaml&lt;/span&gt;
          &lt;span class="s"&gt;git commit -m "ci: update image tag to sha-${{ github.sha }}"&lt;/span&gt;
          &lt;span class="s"&gt;git push origin main&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  GitHub Repository Variables (not secrets)
&lt;/h2&gt;

&lt;p&gt;These are set in the GitHub UI under &lt;strong&gt;Settings → Secrets and variables → Actions → Variables&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Why not a secret?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AWS_ROLE_ARN_USE1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;arn:aws:iam::206617159586:role/myapp-dev-use1-github-ci&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Not sensitive — it's a role ARN, useless without the OIDC token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;COSIGN_KMS_KEY_ARN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;arn:aws:kms:us-east-1:206617159586:key/...&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Not sensitive — KMS key ID alone grants nothing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AUDIT_BUCKET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;myapp-ci-audit-206617159586&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Not sensitive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only &lt;code&gt;GITOPS_PAT&lt;/code&gt; (GitHub Personal Access Token to push to myapp-gitops) is a &lt;strong&gt;secret&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;No AWS_ACCESS_KEY_ID. No AWS_SECRET_ACCESS_KEY.&lt;/strong&gt; These should never appear in a modern CI pipeline.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Cosign Image Signing
&lt;/h2&gt;

&lt;p&gt;Cosign attaches a cryptographic signature to the image in the same ECR repository. The signature is stored as a separate OCI artifact tagged with the image digest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ECR Repository: myapp

sha-abc123def456...          ← Your application image
sha256-abc123def456...sig    ← Cosign signature (OCI artifact)
sha256-abc123def456...att    ← Cosign attestation (SBOM, etc.)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verification (what Kyverno does at admission time):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign verify &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; awskms:///arn:aws:kms:us-east-1:206617159586:key/YOUR_KEY_ID &lt;span class="se"&gt;\&lt;/span&gt;
  206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:sha-abc123

&lt;span class="c"&gt;# Output if valid:&lt;/span&gt;
&lt;span class="c"&gt;# Verification for 206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:sha-abc123&lt;/span&gt;
&lt;span class="c"&gt;# The following checks were performed on each of these signatures:&lt;/span&gt;
&lt;span class="c"&gt;#   - The cosign claims were validated&lt;/span&gt;
&lt;span class="c"&gt;#   - The signatures were verified against the specified public key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Image Naming Convention
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:sha-&amp;lt;full-40-char-git-sha&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;sha-&amp;lt;full-sha&amp;gt;&lt;/code&gt; and not semantic versioning?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traceability:&lt;/strong&gt; Given any running pod, you can run &lt;code&gt;git show &amp;lt;sha&amp;gt;&lt;/code&gt; to see the exact commit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutability:&lt;/strong&gt; Two builds of the same SHA produce the same image (deterministic builds)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No tag collisions:&lt;/strong&gt; &lt;code&gt;v1.0.0&lt;/code&gt; can be overwritten; a git SHA cannot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;:latest&lt;/code&gt;:&lt;/strong&gt; &lt;code&gt;:latest&lt;/code&gt; is the devil — it means different things at different times and breaks reproducibility&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pre-commit Hooks
&lt;/h2&gt;

&lt;p&gt;Before code ever reaches GitHub, pre-commit hooks catch common issues locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .pre-commit-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;repos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/pre-commit/pre-commit-hooks&lt;/span&gt;
    &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v4.5.0&lt;/span&gt;
    &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trailing-whitespace&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;end-of-file-fixer&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-yaml&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-merge-conflict&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/hadolint/hadolint&lt;/span&gt;
    &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v2.12.0&lt;/span&gt;
    &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hadolint-docker&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--ignore'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DL3006'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# DL3006: always tag FROM image&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/Yelp/detect-secrets&lt;/span&gt;
    &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1.5.0&lt;/span&gt;
    &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;detect-secrets&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--baseline'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.secrets.baseline'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install pre-commit&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;pre-commit
pre-commit &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Run manually against all files&lt;/span&gt;
pre-commit run &lt;span class="nt"&gt;--all-files&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pipeline Security Properties
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;How Achieved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No secrets in Git&lt;/td&gt;
&lt;td&gt;detect-secrets pre-commit hook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No secrets in CI&lt;/td&gt;
&lt;td&gt;OIDC replaces static AWS keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No HIGH/CRITICAL CVEs&lt;/td&gt;
&lt;td&gt;Trivy scan blocks pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No unverified images&lt;/td&gt;
&lt;td&gt;Kyverno admission webhook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Immutable artifacts&lt;/td&gt;
&lt;td&gt;ECR &lt;code&gt;IMMUTABLE&lt;/code&gt; tag mutability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cryptographic provenance&lt;/td&gt;
&lt;td&gt;Cosign + AWS KMS signing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full audit trail&lt;/td&gt;
&lt;td&gt;S3 + CloudTrail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reproducible builds&lt;/td&gt;
&lt;td&gt;Pinned base image SHA, &lt;code&gt;npm ci&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 6 you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ A secure Dockerfile using distroless/nonroot with multi-stage build&lt;/li&gt;
&lt;li&gt;✅ GitHub Actions pipeline with 4 jobs: test → scan → build/push/sign → gitops update&lt;/li&gt;
&lt;li&gt;✅ OIDC-based AWS authentication (zero static credentials)&lt;/li&gt;
&lt;li&gt;✅ Trivy CVE scanning blocking HIGH/CRITICAL vulnerabilities&lt;/li&gt;
&lt;li&gt;✅ Cosign image signing with AWS KMS&lt;/li&gt;
&lt;li&gt;✅ Automatic image tag update in myapp-gitops triggering ArgoCD sync&lt;/li&gt;
&lt;li&gt;✅ S3 audit log for every image push&lt;/li&gt;
&lt;li&gt;✅ Pre-commit hooks catching issues before they reach CI&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: GitHub Actions — workflow run showing all 4 jobs passing (green checkmarks)&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsp8eecy972exm49fze9h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsp8eecy972exm49fze9h.png" alt="Show in frame: A completed workflow run showing all 4 jobs (build-test, build-push, sign, deploy) with green checkmarks. Click into the run to show the job list." width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: ECR repository showing images with sha- tags and scan results (no HIGH/CRITICAL)&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9koskql8iqphgf5aldyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9koskql8iqphgf5aldyt.png" alt="ECR repository showing images with sha- tags and scan results (no HIGH/CRITICAL)" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: GitHub Actions — Job 3 logs showing "Signed: ..." cosign output&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy44tu0l190e6wvmh5myv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy44tu0l190e6wvmh5myv.png" alt=" " width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Next: Part 7 — Secrets Management: AWS Secrets Manager + External Secrets Operator + IRSA&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>devops</category>
      <category>docker</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Part 5: GitOps with ArgoCD</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Thu, 09 Apr 2026 23:50:20 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-5-gitops-with-argocd-4p36</link>
      <guid>https://dev.to/matthewdipo/part-5-gitops-with-argocd-4p36</guid>
      <description>&lt;p&gt;5: GitOps with ArgoCD — Hub-Spoke Model&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;GitOps flips the traditional CI/CD model. Instead of a pipeline &lt;em&gt;pushing&lt;/em&gt; manifests into a cluster, the cluster &lt;em&gt;pulls&lt;/em&gt; its desired state from Git. The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit trail built-in:&lt;/strong&gt; every cluster change is a Git commit with author, timestamp, and diff&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing:&lt;/strong&gt; ArgoCD continuously reconciles — if someone &lt;code&gt;kubectl apply&lt;/code&gt;s something manually, ArgoCD reverts it within minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rollback is &lt;code&gt;git revert&lt;/code&gt;:&lt;/strong&gt; no special tooling, no cluster access needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift detection:&lt;/strong&gt; ArgoCD shows you exactly when a cluster diverges from what's in Git&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pipeline uses ArgoCD in a &lt;strong&gt;hub-spoke&lt;/strong&gt; topology: one ArgoCD installation on &lt;code&gt;myapp-production-use1&lt;/code&gt; manages all six clusters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│          ArgoCD HUB: myapp-production-use1                      │
│                                                                 │
│  Watches: github.com/MatthewDipo/myapp-gitops (main branch)     │
│  Manages: 6 clusters via registered cluster credentials         │
│                                                                 │
│  ApplicationSets generate Applications per cluster:             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  environments/production/applicationset.yaml             │   │
│  │  → prometheus-myapp-production-use1                      │   │
│  │  → prometheus-myapp-production-usw2                      │   │
│  │  → prometheus-myapp-staging-use1    (staging project)    │   │
│  │  → prometheus-myapp-staging-usw2                         │   │
│  └──────────────────────────────────────────────────────────┘   │
└────────────────────────────┬────────────────────────────────────┘
         VPC Peering (private │ endpoints)
    ┌────────────────────────┤
    │    ┌───────────────────┤
    │    │    ┌──────────────┤──────────────┐
    ▼    ▼    ▼              ▼              ▼
prod-usw2  staging-use1  staging-usw2  dev-use1  dev-usw2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why hub-spoke over running ArgoCD in every cluster?&lt;/strong&gt;&lt;br&gt;
One ArgoCD = one UI, one set of secrets, one audit log. Running six ArgoCD instances means six independent control planes to maintain, upgrade, and monitor. The operational overhead multiplies linearly with clusters.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 1: Install ArgoCD on the Hub
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable public endpoint on prod-use1 for bootstrapping&lt;/span&gt;
aws eks update-cluster-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resources-vpc-config&lt;/span&gt; &lt;span class="nv"&gt;endpointPublicAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,endpointPrivateAccess&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,publicAccessCidrs&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;

&lt;span class="nb"&gt;sleep &lt;/span&gt;180  &lt;span class="c"&gt;# Wait for propagation&lt;/span&gt;

&lt;span class="c"&gt;# Install ArgoCD&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 create namespace argocd

helm repo add argo https://argoproj.github.io/argo-helm
helm &lt;span class="nb"&gt;install &lt;/span&gt;argocd argo/argo-cd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; argocd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; 6.7.3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; server.service.type&lt;span class="o"&gt;=&lt;/span&gt;LoadBalancer &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; configs.params.&lt;span class="s2"&gt;"server&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;insecure"&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-hooks&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt; &lt;span class="nt"&gt;--timeout&lt;/span&gt; 5m

&lt;span class="c"&gt;# Get the LoadBalancer URL&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get svc &lt;span class="nt"&gt;-n&lt;/span&gt; argocd argocd-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.status.loadBalancer.ingress[0].hostname}'&lt;/span&gt;

&lt;span class="c"&gt;# Get initial admin password&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get secret &lt;span class="nt"&gt;-n&lt;/span&gt; argocd argocd-initial-admin-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.data.password}'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt;

&lt;span class="c"&gt;# Login via CLI&lt;/span&gt;
argocd login &amp;lt;LB_HOSTNAME&amp;gt; &lt;span class="nt"&gt;--username&lt;/span&gt; admin &lt;span class="nt"&gt;--password&lt;/span&gt; &amp;lt;PASSWORD&amp;gt; &lt;span class="nt"&gt;--insecure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Create AppProjects
&lt;/h2&gt;

&lt;p&gt;AppProjects define what each group of Applications is allowed to do — which source repos, which destination clusters, and which namespaces. They are the RBAC boundary in ArgoCD.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# argocd/project-production.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AppProject&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Production workloads and infrastructure&lt;/span&gt;

  &lt;span class="na"&gt;sourceRepos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://github.com/MatthewDipo/myapp-gitops.git&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://prometheus-community.github.io/helm-charts&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://charts.external-secrets.io&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://aws.github.io/eks-charts&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://kyverno.github.io/kyverno&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://falcosecurity.github.io/charts&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://vmware-tanzu.github.io/helm-charts&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://charts.jetstack.io&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;https://grafana.github.io/helm-charts&lt;/span&gt;

  &lt;span class="na"&gt;destinations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;       &lt;span class="c1"&gt;# IMPORTANT: use server: "*" not name: "*"&lt;/span&gt;
      &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;    &lt;span class="c1"&gt;# ArgoCD resolves cluster name → server URL for permission checks&lt;/span&gt;

  &lt;span class="na"&gt;clusterResourceWhitelist&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;

  &lt;span class="na"&gt;namespaceResourceWhitelist&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical lesson:&lt;/strong&gt; Use &lt;code&gt;server: "*"&lt;/code&gt; not &lt;code&gt;name: "*"&lt;/code&gt; in AppProject destinations. ArgoCD resolves cluster names to server URLs when enforcing project permissions. With &lt;code&gt;name: "*"&lt;/code&gt; alone, syncs to named clusters fail with permission errors. With &lt;code&gt;server: "*"&lt;/code&gt;, it works correctly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Apply all three projects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; argocd/project-dev.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; argocd/project-staging.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; argocd/project-production.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Add the Private GitOps Repo
&lt;/h2&gt;

&lt;p&gt;ArgoCD needs credentials to pull from a private repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;argocd repo add https://github.com/MatthewDipo/myapp-gitops.git &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--username&lt;/span&gt; MatthewDipo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--password&lt;/span&gt; &amp;lt;GITHUB_PAT&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--insecure-skip-server-verification&lt;/span&gt;

&lt;span class="c"&gt;# Verify&lt;/span&gt;
argocd repo list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Register Spoke Clusters
&lt;/h2&gt;

&lt;p&gt;Each spoke cluster is registered by running &lt;code&gt;argocd cluster add&lt;/code&gt; with the cluster's kubeconfig context. This creates a ServiceAccount and ClusterRoleBinding in the spoke cluster that ArgoCD uses to manage resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dev clusters (public endpoints — accessible directly)&lt;/span&gt;
argocd cluster add dev-use1 &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-dev-use1 &lt;span class="nt"&gt;--yes&lt;/span&gt;
argocd cluster add dev-usw2 &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-dev-usw2 &lt;span class="nt"&gt;--yes&lt;/span&gt;

&lt;span class="c"&gt;# Private clusters — must temporarily enable public endpoint first&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;CLUSTER &lt;span class="k"&gt;in &lt;/span&gt;myapp-staging-use1 myapp-staging-usw2 myapp-production-usw2&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$CLUSTER&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"use1"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"us-west-2"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="nv"&gt;ENV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$CLUSTER&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-d-&lt;/span&gt; &lt;span class="nt"&gt;-f2&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="nv"&gt;PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"myapp-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ENV&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="p"&gt;//us-east-1/use1&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="p"&gt;//us-west-2/usw2&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Enabling public endpoint on &lt;/span&gt;&lt;span class="nv"&gt;$CLUSTER&lt;/span&gt;&lt;span class="s2"&gt;..."&lt;/span&gt;
  aws eks update-cluster-config &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="nv"&gt;$CLUSTER&lt;/span&gt; &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ENV&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/us-east-1/use1/;s/us-west-2/usw2/'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--resources-vpc-config&lt;/span&gt; &lt;span class="nv"&gt;endpointPublicAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,endpointPrivateAccess&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,publicAccessCidrs&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
  &lt;span class="nb"&gt;sleep &lt;/span&gt;180

  &lt;span class="nv"&gt;CTX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$CLUSTER&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/myapp-//;s/-use1/-use1/;s/-usw2/-usw2/'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  argocd cluster add &lt;span class="nv"&gt;$CTX&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="nv"&gt;$CLUSTER&lt;/span&gt; &lt;span class="nt"&gt;--yes&lt;/span&gt;

  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Locking &lt;/span&gt;&lt;span class="nv"&gt;$CLUSTER&lt;/span&gt;&lt;span class="s2"&gt; back to private..."&lt;/span&gt;
  aws eks update-cluster-config &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="nv"&gt;$CLUSTER&lt;/span&gt; &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ENV&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/us-east-1/use1/;s/us-west-2/usw2/'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--resources-vpc-config&lt;/span&gt; &lt;span class="nv"&gt;endpointPublicAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;,endpointPrivateAccess&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true
&lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Verify all clusters registered&lt;/span&gt;
argocd cluster list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;SERVER                          NAME                      VERSION  STATUS
https://kubernetes.default.svc  in-cluster                1.29     Successful
&lt;/span&gt;&lt;span class="gp"&gt;&amp;lt;spoke-endpoint&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;myapp-dev-use1            1.29     Successful
&lt;span class="gp"&gt;&amp;lt;spoke-endpoint&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;myapp-dev-usw2            1.29     Successful
&lt;span class="gp"&gt;&amp;lt;spoke-endpoint&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;myapp-staging-use1        1.29     Successful
&lt;span class="gp"&gt;&amp;lt;spoke-endpoint&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;myapp-staging-usw2        1.29     Successful
&lt;span class="gp"&gt;&amp;lt;spoke-endpoint&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;myapp-production-usw2     1.29     Successful
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqx02fxnc4ybl2t3xltd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdqx02fxnc4ybl2t3xltd.png" alt="ArgoCD Hub-Spoke GitOps Architecture — production-use1 as hub managing 5 &lt;br&gt;
spoke clusters via VPC Peering, with dev clusters on public endpoints" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Solid lines = VPC Peering (private). Dashed lines = public endpoints (dev only). &lt;br&gt;
The hub cluster runs ArgoCD and manages 47 Applications across all 6 clusters.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 5: ApplicationSet — The Core of Hub-Spoke GitOps
&lt;/h2&gt;

&lt;p&gt;An ApplicationSet is a controller that generates ArgoCD Applications from a template and a generator. The &lt;strong&gt;list generator&lt;/strong&gt; creates one Application per element in the list — one per cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# environments/production/applicationset.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-production&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;myapp-production-use1&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;us-east-1&lt;/span&gt;
            &lt;span class="na"&gt;certArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:acm:us-east-1:591120834781:certificate/9ab022c9-..."&lt;/span&gt;
            &lt;span class="na"&gt;wafAclArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:wafv2:us-east-1:591120834781:regional/webacl/..."&lt;/span&gt;
            &lt;span class="na"&gt;irsaRoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-use1-eso"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;myapp-production-usw2&lt;/span&gt;
            &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;us-west-2&lt;/span&gt;
            &lt;span class="na"&gt;certArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:acm:us-west-2:591120834781:certificate/171cac9d-..."&lt;/span&gt;
            &lt;span class="na"&gt;wafAclArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:wafv2:us-west-2:591120834781:regional/webacl/..."&lt;/span&gt;
            &lt;span class="na"&gt;irsaRoleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::591120834781:role/myapp-production-usw2-eso"&lt;/span&gt;

  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myapp-production-{{cluster}}"&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{region}}"&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
      &lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Source 1: the Helm chart (from gitops repo)&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://github.com/MatthewDipo/myapp-gitops.git&lt;/span&gt;
          &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
          &lt;span class="na"&gt;ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gitopsValues&lt;/span&gt;   &lt;span class="c1"&gt;# Named reference — used in source 2&lt;/span&gt;

        &lt;span class="c1"&gt;# Source 2: Helm chart with values from the gitops repo&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;https://github.com/MatthewDipo/myapp-gitops.git&lt;/span&gt;
          &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="s"&gt;apps/myapp&lt;/span&gt;
          &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;valueFiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$gitopsValues/apps/myapp/values.yaml&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$gitopsValues/apps/myapp/values-production.yaml&lt;/span&gt;
            &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image.tag"&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha-45d92fc5ffd4555caf35b996ed1eec4e45152dce"&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env.AWS_REGION"&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{region}}"&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingress.certArn"&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{certArn}}"&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingress.wafAclArn"&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{wafAclArn}}"&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;externalSecrets.irsaRoleArn"&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{irsaRoleArn}}"&lt;/span&gt;

      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}"&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;

      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="kc"&gt;true&lt;/span&gt;    &lt;span class="c1"&gt;# Remove resources deleted from Git&lt;/span&gt;
          &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;    &lt;span class="c1"&gt;# Revert manual kubectl changes&lt;/span&gt;
        &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ServerSideApply=true&lt;/span&gt;
        &lt;span class="na"&gt;retry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
          &lt;span class="na"&gt;backoff&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;30s&lt;/span&gt;
            &lt;span class="na"&gt;maxDuration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
            &lt;span class="na"&gt;factor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key ApplicationSet concepts:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;automated.prune: true&lt;/code&gt;&lt;/strong&gt; — If you delete a file from Git, ArgoCD deletes the corresponding Kubernetes resource. Without this, deleted manifests leave orphaned resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;automated.selfHeal: true&lt;/code&gt;&lt;/strong&gt; — If someone runs &lt;code&gt;kubectl edit&lt;/code&gt; and changes something, ArgoCD detects the drift and reverts it within ~3 minutes. This enforces Git as the only source of truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;ServerSideApply: true&lt;/code&gt;&lt;/strong&gt; — Required for CRDs and resources that multiple controllers manage (e.g., Prometheus operator patches its own webhook config). Without this, field ownership conflicts cause sync failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;retry&lt;/code&gt;&lt;/strong&gt; — Network blips or transient API errors shouldn't fail a deployment permanently. The retry policy handles intermittent failures automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Deploy the ApplicationSets
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Apply all ApplicationSets from the hub cluster&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; environments/dev/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; environments/staging/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; environments/production/applicationset.yaml

&lt;span class="c"&gt;# Infrastructure ApplicationSets&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/eso/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/kyverno/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/falco/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/monitoring/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/logging/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/velero/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/karpenter/applicationset.yaml
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 apply &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/argo-rollouts/applicationset.yaml

&lt;span class="c"&gt;# Watch sync status&lt;/span&gt;
argocd app list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Sync Status and Known False Positives
&lt;/h2&gt;

&lt;p&gt;After deploying, most apps show &lt;code&gt;Synced/Healthy&lt;/code&gt;. A few will permanently show &lt;code&gt;OutOfSync/Healthy&lt;/code&gt; — this is expected and safe:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;App&lt;/th&gt;
&lt;th&gt;Why OutOfSync&lt;/th&gt;
&lt;th&gt;Safe?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ESO (&lt;code&gt;external-secrets-*&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;ESO writes &lt;code&gt;status.refreshTime&lt;/code&gt; at runtime; ArgoCD sees this as drift&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;myapp-*&lt;/code&gt; (with ESO)&lt;/td&gt;
&lt;td&gt;Same ESO status drift propagates to the parent app&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prometheus-*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prometheus operator patches its own ValidatingWebhookConfiguration at runtime&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Real sync failures&lt;/strong&gt; (need action):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SyncFailed&lt;/code&gt; — usually a YAML error or missing CRD&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Degraded&lt;/code&gt; — pods crashing; check &lt;code&gt;kubectl describe pod&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Missing&lt;/code&gt; — ArgoCD can't reach the cluster&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Orphaned resources after a Helm release name change:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Force prune all resources not in current Git state&lt;/span&gt;
argocd app &lt;span class="nb"&gt;sync &lt;/span&gt;argocd/karpenter-myapp-production-use1 &lt;span class="nt"&gt;--prune&lt;/span&gt; &lt;span class="nt"&gt;--force&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;App stuck in Progressing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;argocd app get argocd/myapp-production-myapp-production-use1
&lt;span class="c"&gt;# Check the "Conditions" and "Events" sections&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cluster shows Unknown:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Re-register the cluster (token may have expired)&lt;/span&gt;
argocd cluster add prod-usw2 &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-usw2 &lt;span class="nt"&gt;--yes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AppProject permission denied:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Ensure project destinations include server: "*"&lt;/span&gt;
&lt;span class="c"&gt;# Ensure sourceRepos includes the chart repo URL&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 edit appproject production &lt;span class="nt"&gt;-n&lt;/span&gt; argocd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  VPC Peering for Hub → Spoke Connectivity
&lt;/h2&gt;

&lt;p&gt;ArgoCD hub reaches spoke private endpoints via VPC peering. Three peering connections:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prod-use1 VPC (10.20.0.0/16) ◄──────► prod-usw2 VPC   (10.21.0.0/16)
prod-use1 VPC (10.20.0.0/16) ◄──────► staging-use1 VPC (10.10.0.0/16)
prod-use1 VPC (10.20.0.0/16) ◄──────► staging-usw2 VPC (10.11.0.0/16)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the peering is active, ArgoCD can communicate with spoke API servers on their private IPs (&lt;code&gt;10.x.x.x:443&lt;/code&gt;) without any traffic touching the internet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 5 you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ ArgoCD running on the hub cluster with a public LoadBalancer URL&lt;/li&gt;
&lt;li&gt;✅ Three AppProjects (dev, staging, production) with correct source and destination restrictions&lt;/li&gt;
&lt;li&gt;✅ All 5 spoke clusters registered (ArgoCD SA + ClusterRoleBinding installed in each)&lt;/li&gt;
&lt;li&gt;✅ ApplicationSets generating Applications for all environments and infrastructure components&lt;/li&gt;
&lt;li&gt;✅ GitOps loop closed: a commit to &lt;code&gt;myapp-gitops&lt;/code&gt; triggers automatic sync to all clusters&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: ArgoCD UI — Applications view showing all apps, most Synced/Healthy&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk62osg4qfx71l3zlnim.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk62osg4qfx71l3zlnim.png" alt="Applications view" width="800" height="444"&gt;&lt;/a&gt;## Part&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5h35ndmkjns6nx7zittk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5h35ndmkjns6nx7zittk.png" alt="Applications view" width="800" height="444"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ful8wrc5q11xqapyvpkcn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ful8wrc5q11xqapyvpkcn.png" alt="Applications view" width="800" height="443"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fguld8zxneyw3qvv19dyf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fguld8zxneyw3qvv19dyf.png" alt="Applications view" width="800" height="444"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdldjo262ufhw8v7gpm1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdldjo262ufhw8v7gpm1l.png" alt="Applications view" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: ArgoCD UI — Cluster list showing all 6 clusters registered and Successful&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbxyl1gli84tvj3irpki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbxyl1gli84tvj3irpki.png" alt="Show in frame: All 6 cluster entries with status Successful, showing cluster names and server URLs." width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: ArgoCD UI — One Application detail view showing resource tree&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg6fswzlu2dqis9oi8kjj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg6fswzlu2dqis9oi8kjj.png" alt="Show in frame: The resource tree showing Rollout → ReplicaSet → Pods, Service, Ingress, HPA, ExternalSecret all connected." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: GitOps Repo Structure on GitHub&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvw08qoe1mq8y9sag5nqv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvw08qoe1mq8y9sag5nqv.png" alt="Show in frame: The folder tree showing environments/, infrastructure/, argocd/ with subfolders expanded." width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Next: Part 6 — CI/CD Pipeline: GitHub Actions, Trivy, Cosign, ECR&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>gitops</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Part 4: EKS Multi-Cluster Setup</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 01 Apr 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-4-eks-multi-cluster-setup-2i3m</link>
      <guid>https://dev.to/matthewdipo/part-4-eks-multi-cluster-setup-2i3m</guid>
      <description>&lt;h2&gt;
  
  
  Part 4: EKS Multi-Cluster Setup — Six Clusters Across Two Regions
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Why six clusters instead of one? The answer is isolation and resilience:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────────────┐
│  ONE CLUSTER (anti-pattern)                                              │
│                                                                          │
│  Dev pods → same etcd as Production pods                                 │
│  A misconfigured dev deployment can consume all cluster resources        │
│  Cluster upgrade = every environment goes down simultaneously            │
│  Cost visibility: impossible to attribute spend per environment          │
└──────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│  SIX CLUSTERS (this guide)                                                 │
│                                                                            │
│  myapp-dev-use1      (us-east-1, public endpoint,  2 nodes)                │
│  myapp-dev-usw2      (us-west-2, public endpoint,  2 nodes)                │
│  myapp-staging-use1  (us-east-1, private endpoint, 2 nodes)                │
│  myapp-staging-usw2  (us-west-2, private endpoint, 2 nodes)                │
│  myapp-production-use1 (us-east-1, private endpoint, 2+ nodes + Karpenter) │
│  myapp-production-usw2 (us-west-2, private endpoint, 2+ nodes + Karpenter) │
│                                                                            │
│  Benefits:                                                                 │
│  ✓ Complete IAM isolation between environments                             │
│  ✓ Production upgrade independent of dev                                   │
│  ✓ Regional failover — us-east-1 outage → us-west-2 serves traffic         │
│  ✓ Clear cost attribution per cluster tag                                  │
└────────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Cluster Overview
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;th&gt;Region&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Nodes&lt;/th&gt;
&lt;th&gt;Karpenter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;us-east-1&lt;/td&gt;
&lt;td&gt;Public&lt;/td&gt;
&lt;td&gt;2 × t3.medium&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-dev-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;us-west-2&lt;/td&gt;
&lt;td&gt;Public&lt;/td&gt;
&lt;td&gt;2 × t3.medium&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;us-east-1&lt;/td&gt;
&lt;td&gt;Private&lt;/td&gt;
&lt;td&gt;2 × t3.medium&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-staging-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;us-west-2&lt;/td&gt;
&lt;td&gt;Private&lt;/td&gt;
&lt;td&gt;2 × t3.medium&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-production-use1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;us-east-1&lt;/td&gt;
&lt;td&gt;Private&lt;/td&gt;
&lt;td&gt;2+ × t3.medium&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-production-usw2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;us-west-2&lt;/td&gt;
&lt;td&gt;Private&lt;/td&gt;
&lt;td&gt;2+ × t3.medium&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  EKS Terraform Module
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/eks/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"eks"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/eks/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 20.0"&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"1.29"&lt;/span&gt;

  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;   &lt;span class="c1"&gt;# Nodes always in private subnets&lt;/span&gt;
  &lt;span class="nx"&gt;control_plane_subnet_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;

  &lt;span class="c1"&gt;# Endpoint access: private always on, public only for dev&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_endpoint_private_access&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_endpoint_public_access&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_api&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_endpoint_public_access_cidrs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_api&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="c1"&gt;# Note: AWS rejects empty list when public is disabled — always set to 0.0.0.0/0&lt;/span&gt;

  &lt;span class="c1"&gt;# Without this the cluster creator IAM role (your Terragrunt role) can't kubectl&lt;/span&gt;
  &lt;span class="nx"&gt;enable_cluster_creator_admin_permissions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;# Encrypt Kubernetes secrets in etcd with KMS&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_encryption_config&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;provider_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms_key_arn&lt;/span&gt;
    &lt;span class="nx"&gt;resources&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"secrets"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# EKS managed add-ons (AWS manages patching)&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_addons&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;coredns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;kube-proxy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;vpc-cni&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;service_account_role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cni_irsa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iam_role_arn&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;aws-ebs-csi-driver&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="nx"&gt;service_account_role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ebs_csi_irsa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;iam_role_arn&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;eks_managed_node_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;main&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;instance_types&lt;/span&gt;   &lt;span class="c1"&gt;# ["t3.medium"]&lt;/span&gt;
      &lt;span class="nx"&gt;min_size&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;min_nodes&lt;/span&gt;
      &lt;span class="nx"&gt;max_size&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;max_nodes&lt;/span&gt;
      &lt;span class="nx"&gt;desired_size&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;desired_nodes&lt;/span&gt;

      &lt;span class="c1"&gt;# IMPORTANT: name_prefix has a 38 character limit.&lt;/span&gt;
      &lt;span class="c1"&gt;# "myapp-production-use1-eks-node-group-" = 39 chars → FAILS.&lt;/span&gt;
      &lt;span class="c1"&gt;# Fix: use explicit role name (IAM limit is 64 chars).&lt;/span&gt;
      &lt;span class="nx"&gt;iam_role_name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-node-group"&lt;/span&gt;
      &lt;span class="nx"&gt;iam_role_use_name_prefix&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

      &lt;span class="c1"&gt;# Nodes need these policies to pull from ECR, write to CloudWatch, etc.&lt;/span&gt;
      &lt;span class="nx"&gt;iam_role_additional_policies&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;AmazonSSMManagedInstanceCore&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="nx"&gt;labels&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"node-type"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"general"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="c1"&gt;# Karpenter discovery tag — only needed on production node groups&lt;/span&gt;
      &lt;span class="c1"&gt;# (Karpenter uses this to find the right security group)&lt;/span&gt;
      &lt;span class="nx"&gt;taints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;karpenter_enabled&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Node security group — allow Karpenter to manage nodes&lt;/span&gt;
  &lt;span class="nx"&gt;node_security_group_tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;karpenter_enabled&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"karpenter.sh/discovery"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# IRSA for VPC CNI (pod networking)&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc_cni_irsa"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 5.0"&lt;/span&gt;

  &lt;span class="nx"&gt;role_name&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-vpc-cni"&lt;/span&gt;
  &lt;span class="nx"&gt;attach_vpc_cni_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_cni_enable_ipv4&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;oidc_providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;main&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;provider_arn&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider_arn&lt;/span&gt;
      &lt;span class="nx"&gt;namespace_service_accounts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kube-system:aws-node"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# IRSA for EBS CSI Driver (persistent volumes)&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"ebs_csi_irsa"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 5.0"&lt;/span&gt;

  &lt;span class="nx"&gt;role_name&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-ebs-csi"&lt;/span&gt;
  &lt;span class="nx"&gt;attach_ebs_csi_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;oidc_providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;main&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;provider_arn&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider_arn&lt;/span&gt;
      &lt;span class="nx"&gt;namespace_service_accounts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kube-system:ebs-csi-controller-sa"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/eks/outputs.tf&lt;/span&gt;

&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"cluster_name"&lt;/span&gt;                &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"cluster_endpoint"&lt;/span&gt;            &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_endpoint&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"cluster_certificate_authority_data"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_certificate_authority_data&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"oidc_provider_arn"&lt;/span&gt;           &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider_arn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"oidc_provider"&lt;/span&gt;               &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oidc_provider&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"node_security_group_id"&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;node_security_group_id&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"node_subnet_ids"&lt;/span&gt;             &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Per-Environment Terragrunt Configs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dev (public endpoint):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# live/dev/us-east-1/eks/terragrunt.hcl&lt;/span&gt;

&lt;span class="nx"&gt;include&lt;/span&gt; &lt;span class="s2"&gt;"root"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;find_in_parent_folders&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../../../_modules/eks"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../vpc"&lt;/span&gt;
  &lt;span class="nx"&gt;mock_outputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;vpc_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-mock"&lt;/span&gt;
    &lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"subnet-mock1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"subnet-mock2"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"kms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../kms"&lt;/span&gt;
  &lt;span class="nx"&gt;mock_outputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:kms:us-east-1:123456789:key/mock"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-dev-use1"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_arn&lt;/span&gt;
  &lt;span class="nx"&gt;public_api&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;    &lt;span class="c1"&gt;# Dev gets public endpoint for laptop + CI access&lt;/span&gt;
  &lt;span class="nx"&gt;min_nodes&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="nx"&gt;max_nodes&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="nx"&gt;desired_nodes&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="nx"&gt;instance_types&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;karpenter_enabled&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production (private endpoint + Karpenter):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# live/production/us-east-1/eks/terragrunt.hcl&lt;/span&gt;

&lt;span class="nx"&gt;include&lt;/span&gt; &lt;span class="s2"&gt;"root"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;find_in_parent_folders&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../../../_modules/eks"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../vpc"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;dependency&lt;/span&gt; &lt;span class="s2"&gt;"kms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;config_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../kms"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-production-use1"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_arn&lt;/span&gt;
  &lt;span class="nx"&gt;public_api&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;   &lt;span class="c1"&gt;# Private endpoint only — no public internet access&lt;/span&gt;
  &lt;span class="nx"&gt;min_nodes&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="nx"&gt;max_nodes&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
  &lt;span class="nx"&gt;desired_nodes&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="nx"&gt;instance_types&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;karpenter_enabled&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;    &lt;span class="c1"&gt;# Karpenter manages additional nodes beyond the initial 2&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Bootstrapping Private Clusters
&lt;/h2&gt;

&lt;p&gt;Staging and production clusters have &lt;code&gt;endpointPublicAccess: false&lt;/code&gt;. This means &lt;code&gt;kubectl&lt;/code&gt; from your laptop or CI cannot reach the API server directly. You must temporarily enable public access, bootstrap the cluster (install ArgoCD, register spokes, etc.), then lock it back down.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Temporarily enable public access&lt;/span&gt;
aws eks update-cluster-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resources-vpc-config&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;endpointPublicAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;endpointPrivateAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;publicAccessCidrs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;

&lt;span class="c"&gt;# Step 2: Wait 3 minutes — AWS takes time to update the Elastic Network Interfaces&lt;/span&gt;
&lt;span class="nb"&gt;sleep &lt;/span&gt;180

&lt;span class="c"&gt;# Step 3: Verify access&lt;/span&gt;
kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; prod-use1 get nodes

&lt;span class="c"&gt;# Step 4: Bootstrap (install ArgoCD, apply ApplicationSets, etc.)&lt;/span&gt;
&lt;span class="c"&gt;# ... your bootstrap commands ...&lt;/span&gt;

&lt;span class="c"&gt;# Step 5: Lock back to private-only&lt;/span&gt;
aws eks update-cluster-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resources-vpc-config&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;endpointPublicAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;,&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;endpointPrivateAccess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Do not forget Step 5.&lt;/strong&gt; A production cluster with a public API endpoint is a security risk — the API server is internet-accessible, relying solely on authentication for protection.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  kubectl Context Setup
&lt;/h2&gt;

&lt;p&gt;After &lt;code&gt;terragrunt apply&lt;/code&gt; completes for each cluster, add it to your kubeconfig:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update kubeconfig for all 6 clusters&lt;/span&gt;
aws eks update-kubeconfig &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-dev-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; dev-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-dev-use1

aws eks update-kubeconfig &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-dev-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; dev-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-dev-usw2

aws eks update-kubeconfig &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-staging-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; staging-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-staging-use1

aws eks update-kubeconfig &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-staging-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; staging-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-staging-usw2

aws eks update-kubeconfig &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1

aws eks update-kubeconfig &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myapp-production-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; prod-usw2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-usw2

&lt;span class="c"&gt;# Verify&lt;/span&gt;
kubectl config get-contexts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  EKS Add-ons
&lt;/h2&gt;

&lt;p&gt;EKS managed add-ons are maintained by AWS — they patch security vulnerabilities in CoreDNS, kube-proxy, and vpc-cni without you having to manage Helm releases.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kube-proxy     — handles iptables rules for Service routing
coredns        — in-cluster DNS resolution
vpc-cni        — AWS VPC networking for pods (each pod gets a real VPC IP)
aws-ebs-csi-driver — allows EKS to provision EBS volumes for PersistentVolumeClaims
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why IRSA for vpc-cni and ebs-csi?&lt;/strong&gt;&lt;br&gt;
These add-ons need to call AWS APIs (EC2 for ENI management, EC2 for EBS volume ops). Without IRSA they would use the node's EC2 instance profile — giving every pod on the node those permissions. With IRSA, only the specific add-on service account has the permissions.&lt;/p&gt;


&lt;h2&gt;
  
  
  Fixing kubectl 401 Errors
&lt;/h2&gt;

&lt;p&gt;If &lt;code&gt;kubectl get nodes&lt;/code&gt; returns HTTP 401 Unauthorized, the IAM role you're using is not in the cluster's access entries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List current access entries&lt;/span&gt;
aws eks list-access-entries &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1

&lt;span class="c"&gt;# If your OrganizationAccountAccessRole is missing, add it:&lt;/span&gt;
&lt;span class="nv"&gt;ROLE_ARN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::591120834781:role/OrganizationAccountAccessRole"&lt;/span&gt;

aws eks create-access-entry &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--principal-arn&lt;/span&gt; &lt;span class="nv"&gt;$ROLE_ARN&lt;/span&gt;

aws eks associate-access-policy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--principal-arn&lt;/span&gt; &lt;span class="nv"&gt;$ROLE_ARN&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--policy-arn&lt;/span&gt; arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--access-scope&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The root cause: &lt;code&gt;enable_cluster_creator_admin_permissions = true&lt;/code&gt; must be explicitly set in the EKS module. If it's missing, Terraform creates the cluster but the IAM role that ran Terraform doesn't get an access entry.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  AWS Load Balancer Controller
&lt;/h2&gt;

&lt;p&gt;The AWS Load Balancer Controller (LBC) runs in every cluster and watches for Ingress resources with &lt;code&gt;ingressClassName: alb&lt;/code&gt;. When it sees one, it provisions an Application Load Balancer in AWS automatically.&lt;/p&gt;

&lt;p&gt;Install via Helm (or ArgoCD) after the cluster is up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# IRSA for LBC&lt;/span&gt;
eksctl create iamserviceaccount &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; aws-load-balancer-controller &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attach-policy-arn&lt;/span&gt; arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--override-existing-serviceaccounts&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--approve&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1

&lt;span class="c"&gt;# Install controller&lt;/span&gt;
helm repo add eks https://aws.github.io/eks-charts
helm &lt;span class="nb"&gt;install &lt;/span&gt;aws-load-balancer-controller eks/aws-load-balancer-controller &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;clusterName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp-production-use1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; serviceAccount.create&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; serviceAccount.name&lt;span class="o"&gt;=&lt;/span&gt;aws-load-balancer-controller &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;vpcId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;VPC_ID&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this pipeline, the LBC is deployed via ArgoCD ApplicationSet — the Helm release is version-controlled in &lt;code&gt;myapp-gitops/infrastructure/aws-lbc/&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  StorageClass for EBS PVCs
&lt;/h2&gt;

&lt;p&gt;kube-prometheus-stack needs persistent storage for Prometheus and Grafana data. With the EBS CSI driver installed, create a StorageClass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;storage.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StorageClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp2&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;storageclass.kubernetes.io/is-default-class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;span class="na"&gt;provisioner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ebs.csi.aws.com&lt;/span&gt;
&lt;span class="na"&gt;volumeBindingMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WaitForFirstConsumer&lt;/span&gt;   &lt;span class="c1"&gt;# Don't provision until pod is scheduled&lt;/span&gt;
&lt;span class="na"&gt;reclaimPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Retain&lt;/span&gt;                      &lt;span class="c1"&gt;# Don't delete EBS volume if PVC is deleted&lt;/span&gt;
&lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp2&lt;/span&gt;
  &lt;span class="na"&gt;encrypted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
  &lt;span class="na"&gt;kmsKeyId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;your-kms-key-arn&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Node Security Group Tags
&lt;/h2&gt;

&lt;p&gt;For Karpenter to manage node lifecycles, it needs to find the cluster's node security group. Tag it during EKS creation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;node_security_group_tags&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"karpenter.sh/discovery"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similarly, private subnets need the discovery tag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;private_subnet_tags&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"karpenter.sh/discovery"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Verifying All Six Clusters
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;CTX &lt;span class="k"&gt;in &lt;/span&gt;dev-use1 dev-usw2 staging-use1 staging-usw2 prod-use1 prod-usw2&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"=== &lt;/span&gt;&lt;span class="nv"&gt;$CTX&lt;/span&gt;&lt;span class="s2"&gt; ==="&lt;/span&gt;
  kubectl &lt;span class="nt"&gt;--context&lt;/span&gt; &lt;span class="nv"&gt;$CTX&lt;/span&gt; get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; wide
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;=== dev-use1 ===
NAME                           STATUS   ROLES    AGE   VERSION
&lt;/span&gt;&lt;span class="gp"&gt;ip-10-0-8-xx.ec2.internal      Ready    &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;5d    v1.29.15-eks-ac2d5a0
&lt;span class="gp"&gt;ip-10-0-16-xx.ec2.internal     Ready    &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;5d    v1.29.15-eks-ac2d5a0
&lt;span class="go"&gt;
=== prod-use1 ===
NAME                           STATUS   ROLES    AGE   VERSION
&lt;/span&gt;&lt;span class="gp"&gt;ip-10-20-8-xx.ec2.internal     Ready    &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;5d    v1.29.15-eks-ac2d5a0
&lt;span class="gp"&gt;ip-10-20-16-xx.ec2.internal    Ready    &amp;lt;none&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;5d    v1.29.15-eks-ac2d5a0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 4 you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Six EKS clusters (Kubernetes 1.29) across three environments and two regions&lt;/li&gt;
&lt;li&gt;✅ Private endpoints on staging and production (public on dev)&lt;/li&gt;
&lt;li&gt;✅ KMS encryption for Kubernetes secrets in etcd&lt;/li&gt;
&lt;li&gt;✅ IAM IRSA for VPC CNI and EBS CSI add-ons&lt;/li&gt;
&lt;li&gt;✅ AWS Load Balancer Controller installed&lt;/li&gt;
&lt;li&gt;✅ kubectl contexts configured for all six clusters&lt;/li&gt;
&lt;li&gt;✅ Karpenter discovery tags on production node security groups and subnets&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: AWS EKS console showing 2 clusters running in production with ACTIVE status&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlriaib1dnv6lg8xvsaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlriaib1dnv6lg8xvsaj.png" alt="AWS EKS console showing 2 clusters running in production with ACTIVE status" width="800" height="443"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5n9ggmwhiy27xdg2puo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5n9ggmwhiy27xdg2puo.png" alt="AWS EKS console showing 2 clusters running in production with ACTIVE status" width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: kubectl get nodes output for all 6 clusters&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4nsb4n1jnm1bz2cfrb55.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4nsb4n1jnm1bz2cfrb55.png" alt="kubectl get nodes output for all 6 clusters" width="800" height="733"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Next: Part 5 — GitOps with ArgoCD: Hub-Spoke Model&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>aws</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Part 3: Infrastructure as Code</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 25 Mar 2026 08:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-3-infrastructure-as-code-2o86</link>
      <guid>https://dev.to/matthewdipo/part-3-infrastructure-as-code-2o86</guid>
      <description>&lt;h2&gt;
  
  
  Part 3: Infrastructure as Code — Terraform Modules + Terragrunt
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Plain Terraform works fine for a single environment. But this pipeline has 6 clusters across 3 environments and 2 regions — 18+ Terragrunt child directories. Without a DRY strategy, you end up copy-pasting the same &lt;code&gt;provider&lt;/code&gt;, &lt;code&gt;backend&lt;/code&gt;, and &lt;code&gt;module&lt;/code&gt; blocks everywhere, and a single account ID change means updating 18 files.&lt;/p&gt;

&lt;p&gt;Terragrunt solves this with two mechanisms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;include&lt;/code&gt;&lt;/strong&gt; — child configs inherit the root config's provider generation and remote state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;dependency&lt;/code&gt;&lt;/strong&gt; — explicit ordering ensures VPC exists before EKS, KMS before EKS, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: each child &lt;code&gt;terragrunt.hcl&lt;/code&gt; is typically 10–30 lines of pure inputs, with all boilerplate generated automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Repository Layout
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;myapp-infra/
├── _modules/                    # Reusable Terraform modules (no Terragrunt here)
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── eks/
│   ├── kms/
│   ├── iam/
│   ├── ecr/
│   ├── waf/
│   ├── guardduty/
│   ├── eso-irsa/
│   ├── fluent-bit-irsa/
│   ├── karpenter/
│   └── velero/
│
└── live/                        # Terragrunt wrappers — one dir per resource per env/region
    ├── terragrunt.hcl           # ROOT config (provider + backend generation)
    ├── dev/
    │   ├── us-east-1/
    │   │   ├── vpc/
    │   │   │   └── terragrunt.hcl
    │   │   ├── kms/
    │   │   │   └── terragrunt.hcl
    │   │   ├── eks/
    │   │   │   └── terragrunt.hcl
    │   │   └── iam/
    │   │       └── terragrunt.hcl
    │   └── us-west-2/
    │       └── ... (mirror)
    ├── staging/
    │   └── ...
    └── production/
        └── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key principle:&lt;/strong&gt; modules in &lt;code&gt;_modules/&lt;/code&gt; are pure Terraform — no Terragrunt, no state config, no provider config. They are just reusable building blocks. The &lt;code&gt;live/&lt;/code&gt; tree contains nothing but thin Terragrunt wrappers that call those modules with environment-specific values.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dependency Ordering
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│ APPLY ORDER (Terragrunt resolves this from dependency graph)│
│                                                             │
│  1. kms          (no dependencies)                          │
│  2. vpc          (no dependencies)                          │
│  3. eks          (depends on: vpc, kms)                     │
│  4. iam          (depends on: eks — needs OIDC provider URL)│
│  5. eso-irsa     (depends on: eks, iam)                     │
│  6. fluent-bit-irsa (depends on: eks)                       │
│  7. karpenter    (depends on: eks, iam)                     │
│  8. velero       (depends on: eks)                          │
│  9. waf          (no dependencies)                          │
│  10. guardduty   (no dependencies)                          │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run everything in order automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;live/production/us-east-1
terragrunt run-all apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terragrunt reads all &lt;code&gt;dependency&lt;/code&gt; blocks, builds a DAG, and applies in the correct order.&lt;/p&gt;




&lt;h2&gt;
  
  
  Root Terragrunt Config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# live/terragrunt.hcl&lt;/span&gt;

&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;path_parts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;path_relative_to_include&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="nx"&gt;env&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path_parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# dev | staging | production&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path_parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# us-east-1 | us-west-2&lt;/span&gt;

  &lt;span class="nx"&gt;account_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;dev&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"557702566877"&lt;/span&gt;
    &lt;span class="nx"&gt;staging&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"YOUR_STAGING_ACCOUNT_ID"&lt;/span&gt;
    &lt;span class="nx"&gt;production&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"591120834781"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_ids&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Region short alias for naming (avoids long names hitting IAM limits)&lt;/span&gt;
  &lt;span class="nx"&gt;region_alias&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;region&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="s2"&gt;"use1"&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"usw2"&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-${local.env}-${local.region_alias}"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Auto-generate provider.tf in every child directory&lt;/span&gt;
&lt;span class="nx"&gt;generate&lt;/span&gt; &lt;span class="s2"&gt;"provider"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;path&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"provider.tf"&lt;/span&gt;
  &lt;span class="nx"&gt;if_exists&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"overwrite_terragrunt"&lt;/span&gt;
  &lt;span class="nx"&gt;contents&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;-&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
    provider "aws" {
      region = "${local.region}"
      assume_role {
        role_arn = "arn:aws:iam::${local.account_id}:role/OrganizationAccountAccessRole"
      }
      default_tags {
        tags = {
          Environment = "${local.env}"
          Region      = "${local.region}"
          ManagedBy   = "Terraform"
          Project     = "myapp"
          Cluster     = "${local.cluster_name}"
        }
      }
    }
&lt;/span&gt;&lt;span class="no"&gt;  EOF
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Auto-generate backend.tf — per-module state file in S3&lt;/span&gt;
&lt;span class="nx"&gt;remote_state&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt;
  &lt;span class="nx"&gt;generate&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;path&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"backend.tf"&lt;/span&gt;
    &lt;span class="nx"&gt;if_exists&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"overwrite_terragrunt"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-state-${local.account_id}"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${path_relative_to_include()}/terraform.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;   &lt;span class="c1"&gt;# State always in us-east-1 regardless of resource region&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-locks"&lt;/span&gt;
    &lt;span class="nx"&gt;role_arn&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${local.account_id}:role/OrganizationAccountAccessRole"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  VPC Module
&lt;/h2&gt;

&lt;p&gt;The VPC is the network foundation everything else sits in. Each environment gets its own VPC per region — 6 VPCs total.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CIDR allocation:
  dev     us-east-1:  10.0.0.0/16
  dev     us-west-2:  10.1.0.0/16
  staging us-east-1:  10.10.0.0/16
  staging us-west-2:  10.11.0.0/16
  prod    us-east-1:  10.20.0.0/16
  prod    us-west-2:  10.21.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/vpc/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/vpc/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 5.0"&lt;/span&gt;

  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_name&lt;/span&gt;
  &lt;span class="nx"&gt;cidr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;

  &lt;span class="nx"&gt;azs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s2"&gt;"${var.region}a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"${var.region}b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"${var.region}c"&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Public subnets — for NAT Gateways, Internet-facing ALBs&lt;/span&gt;
  &lt;span class="nx"&gt;public_subnets&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;cidrsubnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# x.x.0.0/24&lt;/span&gt;
    &lt;span class="nx"&gt;cidrsubnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# x.x.1.0/24&lt;/span&gt;
    &lt;span class="nx"&gt;cidrsubnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# x.x.2.0/24&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Private subnets — EKS nodes, RDS, ElastiCache&lt;/span&gt;
  &lt;span class="nx"&gt;private_subnets&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;cidrsubnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# x.x.8.0/21  (2048 IPs)&lt;/span&gt;
    &lt;span class="nx"&gt;cidrsubnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# x.x.16.0/21&lt;/span&gt;
    &lt;span class="nx"&gt;cidrsubnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# x.x.24.0/21&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;enable_nat_gateway&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;single_nat_gateway&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;single_nat_gateway&lt;/span&gt;   &lt;span class="c1"&gt;# true for dev (cost), false for prod (HA)&lt;/span&gt;
  &lt;span class="nx"&gt;enable_vpn_gateway&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="nx"&gt;enable_dns_hostnames&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;enable_dns_support&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;# Required tags for AWS Load Balancer Controller to discover subnets&lt;/span&gt;
  &lt;span class="nx"&gt;public_subnet_tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"kubernetes.io/role/elb"&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"1"&lt;/span&gt;
    &lt;span class="s2"&gt;"kubernetes.io/cluster/${var.cluster_name}"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"shared"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;private_subnet_tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"kubernetes.io/role/internal-elb"&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"1"&lt;/span&gt;
    &lt;span class="s2"&gt;"kubernetes.io/cluster/${var.cluster_name}"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"shared"&lt;/span&gt;
    &lt;span class="s2"&gt;"karpenter.sh/discovery"&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;  &lt;span class="c1"&gt;# Karpenter node discovery&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# VPC Flow Logs for network traffic auditing&lt;/span&gt;
  &lt;span class="nx"&gt;enable_flow_log&lt;/span&gt;                      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;create_flow_log_cloudwatch_log_group&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;create_flow_log_cloudwatch_iam_role&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;flow_log_max_aggregation_interval&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/vpc/outputs.tf&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"vpc_id"&lt;/span&gt;              &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"private_subnet_ids"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnets&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"public_subnet_ids"&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnets&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"vpc_cidr_block"&lt;/span&gt;     &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr_block&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Terragrunt child config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# live/production/us-east-1/vpc/terragrunt.hcl&lt;/span&gt;

&lt;span class="nx"&gt;include&lt;/span&gt; &lt;span class="s2"&gt;"root"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;find_in_parent_folders&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../../../_modules/vpc"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_name&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-production-use1"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.20.0.0/16"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-production-use1"&lt;/span&gt;
  &lt;span class="nx"&gt;single_nat_gateway&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;   &lt;span class="c1"&gt;# HA: one NAT GW per AZ&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  KMS Module
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/kms/main.tf&lt;/span&gt;

&lt;span class="c1"&gt;# Handle the AWSServiceRoleForAutoScaling chicken-and-egg problem.&lt;/span&gt;
&lt;span class="c1"&gt;# In a fresh account this SLR doesn't exist yet, so we optionally create it.&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_service_linked_role"&lt;/span&gt; &lt;span class="s2"&gt;"autoscaling"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_autoscaling_slr&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;aws_service_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"autoscaling.amazonaws.com"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Wait 10s for IAM to propagate before referencing it in KMS key policy&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"null_resource"&lt;/span&gt; &lt;span class="s2"&gt;"wait_for_slr"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_autoscaling_slr&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;depends_on&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_iam_service_linked_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;autoscaling&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;provisioner&lt;/span&gt; &lt;span class="s2"&gt;"local-exec"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sleep 10"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_kms_key"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;depends_on&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;null_resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;wait_for_slr&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.env}-${var.region}-main"&lt;/span&gt;
  &lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
  &lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"RootFullAccess"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${var.account_id}:root"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kms:*"&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AutoScalingSLR"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${var.account_id}:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Encrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:ReEncrypt*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="s2"&gt;"kms:GenerateDataKey*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:CreateGrant"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_kms_alias"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"alias/${var.env}-${var.region_alias}-main"&lt;/span&gt;
  &lt;span class="nx"&gt;target_key_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_kms_key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical lesson:&lt;/strong&gt; &lt;code&gt;AWSServiceRoleForAutoScaling&lt;/code&gt; is an account-scoped IAM entity, not region-scoped. If you're deploying to two regions in the same account, only the &lt;strong&gt;first region&lt;/strong&gt; should set &lt;code&gt;create_autoscaling_slr = true&lt;/code&gt;. The second region's KMS config uses &lt;code&gt;create_autoscaling_slr = false&lt;/code&gt; because the SLR already exists from the first apply.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# live/production/us-east-1/kms/terragrunt.hcl&lt;/span&gt;
&lt;span class="nx"&gt;include&lt;/span&gt; &lt;span class="s2"&gt;"root"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;find_in_parent_folders&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"../../../../_modules/kms"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;env&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;                 &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
  &lt;span class="nx"&gt;region_alias&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"use1"&lt;/span&gt;
  &lt;span class="nx"&gt;account_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"591120834781"&lt;/span&gt;
  &lt;span class="nx"&gt;create_autoscaling_slr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;# Already created by us-west-2 first apply&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  EKS Module (overview — full detail in Part 4)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/eks/main.tf (abbreviated)&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"eks"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/eks/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 20.0"&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"1.29"&lt;/span&gt;

  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;
  &lt;span class="nx"&gt;control_plane_subnet_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;

  &lt;span class="c1"&gt;# Private endpoint — spokes only; dev gets public too&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_endpoint_private_access&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_endpoint_public_access&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_api&lt;/span&gt;

  &lt;span class="c1"&gt;# Must be explicit — without this the creator role can't kubectl&lt;/span&gt;
  &lt;span class="nx"&gt;enable_cluster_creator_admin_permissions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_encryption_config&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;provider_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms_key_arn&lt;/span&gt;
    &lt;span class="nx"&gt;resources&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"secrets"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;eks_managed_node_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;main&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;min_size&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
      &lt;span class="nx"&gt;max_size&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
      &lt;span class="nx"&gt;desired_size&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

      &lt;span class="c1"&gt;# Workaround: name_prefix is limited to 38 chars.&lt;/span&gt;
      &lt;span class="c1"&gt;# Long cluster names (staging, production) overflow this limit.&lt;/span&gt;
      &lt;span class="c1"&gt;# Using explicit name bypasses the prefix (IAM name limit is 64 chars).&lt;/span&gt;
      &lt;span class="nx"&gt;iam_role_name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-node-group"&lt;/span&gt;
      &lt;span class="nx"&gt;iam_role_use_name_prefix&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  VPC Peering (for ArgoCD hub-spoke)
&lt;/h2&gt;

&lt;p&gt;ArgoCD on &lt;code&gt;myapp-production-use1&lt;/code&gt; needs to reach the private API endpoints of the 5 spoke clusters. VPC peering provides private connectivity without internet traversal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prod-use1 (10.20.0.0/16)  ←──── VPC Peering ────► prod-usw2   (10.21.0.0/16)
prod-use1 (10.20.0.0/16)  ←──── VPC Peering ────► staging-use1 (10.10.0.0/16)
prod-use1 (10.20.0.0/16)  ←──── VPC Peering ────► staging-usw2 (10.11.0.0/16)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Dev clusters use public endpoints — no VPC peering needed.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# live/production/us-east-1/vpc-peering/terragrunt.hcl&lt;/span&gt;
&lt;span class="c1"&gt;# NOTE: vpc-peering configs CANNOT use include "root" with remote_state.&lt;/span&gt;
&lt;span class="c1"&gt;# They must define remote_state explicitly because the generate label&lt;/span&gt;
&lt;span class="c1"&gt;# conflicts with the parent. Define it inline instead.&lt;/span&gt;

&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;env&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;remote_state&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt;
  &lt;span class="nx"&gt;generate&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;path&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"backend.tf"&lt;/span&gt;
    &lt;span class="nx"&gt;if_exists&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"overwrite_terragrunt"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-state-591120834781"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production/us-east-1/vpc-peering/terraform.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-locks"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;generate&lt;/span&gt; &lt;span class="s2"&gt;"provider"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;path&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"provider.tf"&lt;/span&gt;
  &lt;span class="nx"&gt;if_exists&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"overwrite_terragrunt"&lt;/span&gt;
  &lt;span class="nx"&gt;contents&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;-&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
    provider "aws" {
      region = "us-east-1"
      assume_role {
        role_arn = "arn:aws:iam::591120834781:role/OrganizationAccountAccessRole"
      }
    }
    # Peer VPC is in the staging account — needs its own provider alias
    provider "aws" {
      alias  = "staging"
      region = "us-east-1"
      assume_role {
        role_arn = "arn:aws:iam::STAGING_ACCOUNT_ID:role/OrganizationAccountAccessRole"
      }
    }
&lt;/span&gt;&lt;span class="no"&gt;  EOF
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;requester_vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prod_use1_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;accepter_vpc_id&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;staging_use1_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="c1"&gt;# ... route table IDs, CIDR blocks, etc.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Running the Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# First apply: production us-east-1 (this region creates the AutoScaling SLR)&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;live/production/us-west-2
terragrunt run-all apply &lt;span class="nt"&gt;--terragrunt-non-interactive&lt;/span&gt;

&lt;span class="c"&gt;# Second: production us-east-1 (SLR already exists, create_autoscaling_slr=false)&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;live/production/us-east-1
terragrunt run-all apply &lt;span class="nt"&gt;--terragrunt-non-interactive&lt;/span&gt;

&lt;span class="c"&gt;# Check what changed before applying&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;live/staging/us-east-1
terragrunt run-all plan

&lt;span class="c"&gt;# Destroy a specific module (e.g., for re-creating)&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;live/dev/us-east-1/eks
terragrunt destroy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  State Management Best Practices
&lt;/h2&gt;

&lt;p&gt;Each module has its own state file: &lt;code&gt;{env}/{region}/{module}/terraform.tfstate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not one big state file?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A corrupt or locked state file affects only one module, not the entire environment&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;terraform plan&lt;/code&gt; on EKS doesn't load/lock VPC state — faster, safer&lt;/li&gt;
&lt;li&gt;Different engineers can work on different modules concurrently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;State file key examples:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;production/us-east-1/vpc/terraform.tfstate
production/us-east-1/eks/terraform.tfstate
production/us-east-1/iam/terraform.tfstate
production/us-west-2/vpc/terraform.tfstate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AutoScaling SLR doesn't exist&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;MalformedPolicyDocumentException&lt;/code&gt; on KMS create&lt;/td&gt;
&lt;td&gt;Set &lt;code&gt;create_autoscaling_slr = true&lt;/code&gt; in first region; &lt;code&gt;false&lt;/code&gt; in subsequent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAM name_prefix &amp;gt; 38 chars&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ValidationError: name_prefix&lt;/code&gt; on node group create&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;iam_role_name&lt;/code&gt; + &lt;code&gt;iam_role_use_name_prefix = false&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC peering uses &lt;code&gt;include "root"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;generate label already defined&lt;/code&gt; error&lt;/td&gt;
&lt;td&gt;Define &lt;code&gt;remote_state&lt;/code&gt; block explicitly in vpc-peering configs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS SG description has Unicode&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Invalid description&lt;/code&gt; on security group&lt;/td&gt;
&lt;td&gt;Use plain ASCII only in SG descriptions — no arrows (→) or greater-than (&amp;gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of Part 3 you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ DRY Terragrunt root config (provider + backend auto-generated from path)&lt;/li&gt;
&lt;li&gt;✅ VPC module with public/private subnets, NAT gateways, flow logs&lt;/li&gt;
&lt;li&gt;✅ KMS module handling the AutoScaling SLR chicken-and-egg problem&lt;/li&gt;
&lt;li&gt;✅ Dependency graph ensuring correct apply order&lt;/li&gt;
&lt;li&gt;✅ Per-module S3 state isolation&lt;/li&gt;
&lt;li&gt;✅ VPC peering between production hub and all spoke VPCs&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Next: Part 4 — EKS Multi-Cluster: Six Clusters Across Two Regions&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>devops</category>
      <category>aws</category>
      <category>infrastructureascode</category>
    </item>
    <item>
      <title>Part 2: AWS Foundation</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 18 Mar 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-2-aws-foundation-m5o</link>
      <guid>https://dev.to/matthewdipo/part-2-aws-foundation-m5o</guid>
      <description>&lt;h2&gt;
  
  
  Part 2: AWS Foundation — Organizations, SSO, and Account Setup
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The foundation of any serious AWS deployment is a well-structured multi-account setup. Running everything in one AWS account is the equivalent of putting all your files on a single server with no access controls — the blast radius of any mistake or breach is your entire infrastructure.&lt;/p&gt;

&lt;p&gt;In this part we set up AWS Organizations with four accounts, configure AWS IAM Identity Center (SSO) for human access, and establish the IAM trust relationships that allow GitHub Actions to deploy to our clusters without static credentials.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Multi-Account?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│  SINGLE ACCOUNT (anti-pattern)                                      │
│                                                                     │
│  Dev workloads ─────────────────────────────┐                       │
│  Staging workloads ─────────────────────────┤── Same IAM boundary   │
│  Production workloads ──────────────────────┘   Same VPC space      │
│                                                  Same billing       │
│  Risk: dev engineer accidentally deletes production RDS             │
│  Risk: security incident in dev reaches production secrets          │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│  MULTI-ACCOUNT (this guide)                                         │
│                                                                     │
│  Dev Account:         strong IAM isolation, cheap instance sizes    │
│  Staging Account:     production-like, but no real data             │
│  Production Account:  SCPs block destructive operations             │
│  Management Account:  no workloads, only ECR + SSO + billing        │
│                                                                     │
│  Benefit: IAM permissions are account-scoped                        │
│  Benefit: Service Control Policies (SCPs) protect production        │
│  Benefit: Separate billing per environment                          │
│  Benefit: VPC IP space per account (no CIDR conflicts)              │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1: Create AWS Organizations
&lt;/h2&gt;

&lt;p&gt;Log in to your root AWS account (the one you used to sign up for AWS).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS Console → AWS Organizations → Create Organization
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS Organizations gives you a single management pane for all accounts, consolidated billing, and — crucially — Service Control Policies (SCPs) that act as guardrails even for account root users.&lt;/p&gt;

&lt;p&gt;After creating the organization, note your &lt;strong&gt;Organization ID&lt;/strong&gt; (format: &lt;code&gt;o-xxxxxxxxxx&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Create Member Accounts
&lt;/h2&gt;

&lt;p&gt;Navigate to &lt;strong&gt;AWS Organizations → AWS Accounts → Add an AWS Account&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Create three accounts:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Account Name&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Example ID&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-dev&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Development environment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;557702566877&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-staging&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Staging environment&lt;/td&gt;
&lt;td&gt;(your value)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp-production&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Production environment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;591120834781&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Use &lt;code&gt;+&lt;/code&gt; email aliases to reuse your existing email. If your email is &lt;code&gt;you@gmail.com&lt;/code&gt;, use &lt;code&gt;you+aws-dev@gmail.com&lt;/code&gt;, &lt;code&gt;you+aws-staging@gmail.com&lt;/code&gt;, etc. Gmail (and most providers) deliver these to the same inbox.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Record each account ID immediately. You will reference them throughout this series.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SCREENSHOT: AWS Organizations showing all 4 accounts in their OUs&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri4vrvqinuo5ejw3lgpx.png" alt="AWS Organizations showing all 4 accounts in their OUs" width="800" height="427"&gt;
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 3: Organize Accounts into OUs
&lt;/h2&gt;

&lt;p&gt;Organizational Units (OUs) let you apply different SCPs to groups of accounts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS Console → AWS Organizations → AWS Accounts → Root

Create OUs:
  Root
  ├── Management (leave root account here)
  ├── Workloads
  │   ├── Dev        (move myapp-dev here)
  │   ├── Staging    (move myapp-staging here)
  │   └── Production (move myapp-production here)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: Apply Service Control Policies
&lt;/h2&gt;

&lt;p&gt;SCPs are JSON IAM policies attached to OUs. They define the &lt;strong&gt;maximum permissions&lt;/strong&gt; any principal in that OU can ever have — even the account root user cannot exceed them.&lt;/p&gt;

&lt;p&gt;Apply this SCP to the &lt;strong&gt;Production OU&lt;/strong&gt; to prevent accidental deletion of critical resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DenyDangerousEKSOperations"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"eks:DeleteCluster"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"eks:DeleteNodegroup"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringNotEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"aws:PrincipalTag/AllowDestructive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DenyRDSDelete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"rds:DeleteDBInstance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"rds:DeleteDBCluster"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RequireRegions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"NotAction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"iam:*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"sts:*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"route53:*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"cloudfront:*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"waf:*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"acm:*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"support:*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"health:*"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringNotEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"aws:RequestedRegion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"us-west-2"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The region restriction SCP ensures no resources are accidentally created outside your approved regions. IAM, STS, Route53, and ACM are excluded because they are global services.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 5: Configure AWS IAM Identity Center (SSO)
&lt;/h2&gt;

&lt;p&gt;AWS IAM Identity Center (formerly AWS SSO) lets your team log in with a single set of credentials across all accounts. It is far superior to creating IAM users in every account.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS Console → IAM Identity Center → Enable

Steps:
1. Choose identity source: "Identity Center directory" (built-in, no external IdP needed)
2. Create users for each team member
3. Create Permission Sets (these become IAM roles in each account)
4. Assign users to accounts with appropriate Permission Sets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create two Permission Sets:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AdministratorAccess (for platform engineers):&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name: AdministratorAccess
Session duration: 8 hours
Managed policy: AdministratorAccess
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;ReadOnlyAccess (for developers / auditors):&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name: ReadOnlyAccess
Session duration: 8 hours
Managed policy: ReadOnlyAccess
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Assign to accounts:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;myapp-dev:         you → AdministratorAccess
myapp-staging:     you → AdministratorAccess
myapp-production:  you → AdministratorAccess
Management:        you → AdministratorAccess
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configure the AWS CLI for SSO:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run on your local machine&lt;/span&gt;
aws configure sso

&lt;span class="c"&gt;# When prompted:&lt;/span&gt;
SSO session name: admin
SSO start URL: https://your-id.awsapps.com/start
SSO region: us-east-1
SSO registration scopes: sso:account:access

&lt;span class="c"&gt;# After login, name each profile:&lt;/span&gt;
&lt;span class="c"&gt;# Profile for dev/us-east-1: myapp-dev-use1&lt;/span&gt;
&lt;span class="c"&gt;# Profile for dev/us-west-2: myapp-dev-usw2&lt;/span&gt;
&lt;span class="c"&gt;# etc.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your &lt;code&gt;~/.aws/config&lt;/code&gt; will look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[sso-session admin]&lt;/span&gt;
&lt;span class="py"&gt;sso_start_url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;https://your-id.awsapps.com/start&lt;/span&gt;
&lt;span class="py"&gt;sso_region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="py"&gt;sso_registration_scopes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;sso:account:access&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-prod-use1]&lt;/span&gt;
&lt;span class="py"&gt;sso_session&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
&lt;span class="py"&gt;sso_account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;591120834781&lt;/span&gt;
&lt;span class="py"&gt;sso_role_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="py"&gt;output&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-prod-usw2]&lt;/span&gt;
&lt;span class="py"&gt;sso_session&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
&lt;span class="py"&gt;sso_account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;591120834781&lt;/span&gt;
&lt;span class="py"&gt;sso_role_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-west-2&lt;/span&gt;
&lt;span class="py"&gt;output&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;

&lt;span class="nn"&gt;[profile myapp-dev-use1]&lt;/span&gt;
&lt;span class="py"&gt;sso_session&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
&lt;span class="py"&gt;sso_account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;557702566877&lt;/span&gt;
&lt;span class="py"&gt;sso_role_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;AdministratorAccess&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="py"&gt;output&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Authenticate:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws sso login &lt;span class="nt"&gt;--sso-session&lt;/span&gt; admin
&lt;span class="c"&gt;# Opens browser → log in → token saved locally for 8 hours&lt;/span&gt;

&lt;span class="c"&gt;# Test:&lt;/span&gt;
aws sts get-caller-identity &lt;span class="nt"&gt;--profile&lt;/span&gt; myapp-prod-use1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: IAM Identity Center showing SSO portal with all accounts and permission sets&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02oyn6ra1zcetkf73vdp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02oyn6ra1zcetkf73vdp.png" alt="IAM Identity Center showing SSO portal with all accounts and permission sets" width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 6: GitHub OIDC — No Static AWS Keys in CI
&lt;/h2&gt;

&lt;p&gt;This is one of the most important security decisions in the entire pipeline. Traditional CI/CD stores AWS access keys as GitHub Secrets. Those keys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never expire automatically&lt;/li&gt;
&lt;li&gt;Are as powerful as the IAM user they belong to&lt;/li&gt;
&lt;li&gt;Can be exfiltrated from logs if misconfigured&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OIDC (OpenID Connect) eliminates static keys. GitHub Actions generates a short-lived JWT token for each workflow run. AWS validates this token cryptographically and exchanges it for a temporary STS credential that expires when the job ends.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│  OIDC TOKEN FLOW                                                    │
│                                                                     │
│  GitHub Actions Job starts                                          │
│       │                                                             │
│       ▼                                                             │
│  GitHub generates JWT (signed by GitHub's OIDC provider)            │
│  Claims include:                                                    │
│    sub: repo:MatthewDipo/myapp:ref:refs/heads/main                  │
│    aud: sts.amazonaws.com                                           │
│       │                                                             │
│       ▼                                                             │
│  aws sts assume-role-with-web-identity                              │
│    --role-arn arn:aws:iam::ACCOUNT:role/ROLE                        │
│    --web-identity-token &amp;lt;JWT&amp;gt;                                       │
│       │                                                             │
│       ▼                                                             │
│  AWS validates JWT against GitHub's OIDC endpoint                   │
│  (https://token.actions.githubusercontent.com)                      │
│       │                                                             │
│       ▼                                                             │
│  Returns: AccessKeyId + SecretAccessKey + SessionToken              │
│  (valid for 1 hour maximum, then automatically expire)              │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Terraform to create the OIDC provider (management account, us-east-1 only — it is global):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/iam/main.tf&lt;/span&gt;

&lt;span class="c1"&gt;# Fetch GitHub's OIDC thumbprint automatically&lt;/span&gt;
&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"tls_certificate"&lt;/span&gt; &lt;span class="s2"&gt;"github"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_openid_connect_provider"&lt;/span&gt; &lt;span class="s2"&gt;"github"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create_github_oidc&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

  &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt;

  &lt;span class="nx"&gt;client_id_list&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;thumbprint_list&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tls_certificate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;github&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;certificates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;sha1_fingerprint&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# IAM role for GitHub Actions CI — one per cluster&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"github_ci"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.cluster_name}-github-ci"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Federated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;github_oidc_provider_arn&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"token.actions.githubusercontent.com:aud"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;StringLike&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="c1"&gt;# Only main branch of your specific repo can assume this role&lt;/span&gt;
          &lt;span class="s2"&gt;"token.actions.githubusercontent.com:sub"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
            &lt;span class="s2"&gt;"repo:${var.github_user}/${var.github_repo}:ref:refs/heads/main"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy"&lt;/span&gt; &lt;span class="s2"&gt;"github_ci"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"github-ci-policy"&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;github_ci&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ECRAuth"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ecr:GetAuthorizationToken"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ECRPush"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:BatchCheckLayerAvailability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:GetDownloadUrlForLayer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:BatchGetImage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:InitiateLayerUpload"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:UploadLayerPart"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:CompleteLayerUpload"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:PutImage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:DescribeImages"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:ListImages"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:ecr:*:${var.account_id}:repository/myapp"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"KMSCosignSign"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s2"&gt;"kms:Sign"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"kms:GetPublicKey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cosign_kms_key_arn&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"S3AuditWrite"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:PutObject"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.audit_bucket_arn}/ci-push-audit/*"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key security insight:&lt;/strong&gt; The &lt;code&gt;Condition&lt;/code&gt; block in the trust policy is critical. It restricts role assumption to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Only your specific repository (&lt;code&gt;repo:MatthewDipo/myapp&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Only the &lt;code&gt;main&lt;/code&gt; branch&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A fork of your repository, or a branch other than &lt;code&gt;main&lt;/code&gt;, cannot assume this role.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: ECR Repository in the Management Account
&lt;/h2&gt;

&lt;p&gt;We store Docker images in the management account's ECR rather than per-environment accounts. This means one place to manage image lifecycle policies and one IAM policy for cross-account pull permissions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/ecr/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecr_repository"&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                 &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;image_tag_mutability&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"IMMUTABLE"&lt;/span&gt;   &lt;span class="c1"&gt;# Tags cannot be overwritten&lt;/span&gt;

  &lt;span class="nx"&gt;image_scanning_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;scan_on_push&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# ECR runs basic CVE scan on every push&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;encryption_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;encryption_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"KMS"&lt;/span&gt;
    &lt;span class="nx"&gt;kms_key&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;kms_key_arn&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecr_lifecycle_policy"&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;repository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecr_repository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;rules&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;rulePriority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="nx"&gt;description&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Keep last 30 tagged images"&lt;/span&gt;
        &lt;span class="nx"&gt;selection&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;tagStatus&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tagged"&lt;/span&gt;
          &lt;span class="nx"&gt;tagPrefixList&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"sha-"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
          &lt;span class="nx"&gt;countType&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"imageCountMoreThan"&lt;/span&gt;
          &lt;span class="nx"&gt;countNumber&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"expire"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;rulePriority&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
        &lt;span class="nx"&gt;description&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Delete untagged images after 1 day"&lt;/span&gt;
        &lt;span class="nx"&gt;selection&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;tagStatus&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"untagged"&lt;/span&gt;
          &lt;span class="nx"&gt;countType&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sinceImagePushed"&lt;/span&gt;
          &lt;span class="nx"&gt;countUnit&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"days"&lt;/span&gt;
          &lt;span class="nx"&gt;countNumber&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"expire"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Cross-account pull policy — allows dev/staging/prod accounts to pull images&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecr_repository_policy"&lt;/span&gt; &lt;span class="s2"&gt;"cross_account"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;repository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecr_repository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CrossAccountPull"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="s2"&gt;"arn:aws:iam::${var.dev_account_id}:root"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;"arn:aws:iam::${var.staging_account_id}:root"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;"arn:aws:iam::${var.prod_account_id}:root"&lt;/span&gt;
          &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:GetDownloadUrlForLayer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:BatchGetImage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s2"&gt;"ecr:BatchCheckLayerAvailability"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;IMMUTABLE&lt;/code&gt; tags mean that once you push &lt;code&gt;sha-abc123&lt;/code&gt;, that tag forever points to that exact image digest. No one can silently overwrite an existing tag with a different image — a subtle but important supply chain security control.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: ECR repository showing images with sha- tags and scan results&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyjuzifv2e29obgl1n0w8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyjuzifv2e29obgl1n0w8.png" alt="ECR repository showing images with sha- tags and scan results" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 8: KMS Keys for Encryption at Rest
&lt;/h2&gt;

&lt;p&gt;Each environment gets its own KMS key for encrypting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EKS Kubernetes secrets (etcd encryption)&lt;/li&gt;
&lt;li&gt;EBS volumes (Prometheus/Grafana/Velero PVCs)&lt;/li&gt;
&lt;li&gt;ECR images&lt;/li&gt;
&lt;li&gt;S3 buckets (Velero backups, CI audit logs)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# _modules/kms/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_kms_key"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.env}-${var.region}-main"&lt;/span&gt;
  &lt;span class="nx"&gt;deletion_window_in_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
  &lt;span class="nx"&gt;enable_key_rotation&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# Rotate annually, automatically&lt;/span&gt;

  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enable IAM User Permissions"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${var.account_id}:root"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"kms:*"&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow EKS"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;Service&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eks.amazonaws.com"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Encrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:GenerateDataKey*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Sid&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow AutoScaling"&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${var.account_id}:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"kms:Encrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:Decrypt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:ReEncrypt*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:GenerateDataKey*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:DescribeKey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"kms:CreateGrant"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;Resource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_kms_alias"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"alias/${var.env}-${var.region}-main"&lt;/span&gt;
  &lt;span class="nx"&gt;target_key_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_kms_key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;key_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lesson learned:&lt;/strong&gt; The &lt;code&gt;AWSServiceRoleForAutoScaling&lt;/code&gt; must exist in the account before you can reference it in the KMS key policy. In fresh accounts, create this Service Linked Role first or AWS will reject the key policy with &lt;code&gt;MalformedPolicyDocumentException&lt;/code&gt;. See Part 3 for the Terragrunt pattern that handles this.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 9: Terragrunt Root Configuration
&lt;/h2&gt;

&lt;p&gt;Before writing any Terragrunt configs, establish the root &lt;code&gt;terragrunt.hcl&lt;/code&gt; that all child configs inherit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# live/terragrunt.hcl  (root)&lt;/span&gt;

&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# Parse path to extract env and region&lt;/span&gt;
  &lt;span class="c1"&gt;# e.g., live/production/us-east-1/eks → env=production, region=us-east-1&lt;/span&gt;
  &lt;span class="nx"&gt;path_parts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;path_relative_to_include&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="nx"&gt;env&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path_parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path_parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;account_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;dev&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"557702566877"&lt;/span&gt;
    &lt;span class="nx"&gt;staging&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"STAGING_ACCOUNT_ID"&lt;/span&gt;
    &lt;span class="nx"&gt;production&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"591120834781"&lt;/span&gt;
    &lt;span class="nx"&gt;management&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"MGMT_ACCOUNT_ID"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;account_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_ids&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Generate provider.tf in every child directory&lt;/span&gt;
&lt;span class="nx"&gt;generate&lt;/span&gt; &lt;span class="s2"&gt;"provider"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;path&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"provider.tf"&lt;/span&gt;
  &lt;span class="nx"&gt;if_exists&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"overwrite_terragrunt"&lt;/span&gt;
  &lt;span class="nx"&gt;contents&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
provider "aws" {
  region = "${local.region}"
  assume_role {
    role_arn = "arn:aws:iam::${local.account_id}:role/OrganizationAccountAccessRole"
  }
  default_tags {
    tags = {
      Environment = "${local.env}"
      ManagedBy   = "Terraform"
      Project     = "myapp"
    }
  }
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Generate backend.tf — S3 state, DynamoDB lock table&lt;/span&gt;
&lt;span class="nx"&gt;remote_state&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt;
  &lt;span class="nx"&gt;generate&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;path&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"backend.tf"&lt;/span&gt;
    &lt;span class="nx"&gt;if_exists&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"overwrite_terragrunt"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-state-${local.account_id}"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${path_relative_to_include()}/terraform.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-locks"&lt;/span&gt;
    &lt;span class="nx"&gt;role_arn&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::${local.account_id}:role/OrganizationAccountAccessRole"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;generate "provider"&lt;/code&gt; block means you never write a &lt;code&gt;provider.tf&lt;/code&gt; by hand. Every module automatically gets the correct AWS account and region based purely on its directory path. This is the key DRY benefit of Terragrunt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding &lt;code&gt;OrganizationAccountAccessRole&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;When you create a member account via AWS Organizations, AWS automatically creates a role called &lt;code&gt;OrganizationAccountAccessRole&lt;/code&gt; in that account. This role trusts your management account, allowing management account principals to assume it and perform actions in the member account.&lt;/p&gt;

&lt;p&gt;This is how Terraform (running with management account credentials) deploys infrastructure into dev, staging, and production without needing separate credentials per account.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Management Account (your terminal / CI)
    │
    │ sts:AssumeRole
    ▼
arn:aws:iam::591120834781:role/OrganizationAccountAccessRole
    │
    │ (full AdministratorAccess in production account)
    ▼
Production Account resources
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 10: Bootstrap S3 State Buckets
&lt;/h2&gt;

&lt;p&gt;Before running any Terragrunt, each account needs its S3 bucket and DynamoDB table for Terraform state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run this once per account (adjust account ID and profile)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;PROFILE &lt;span class="k"&gt;in &lt;/span&gt;myapp-dev-use1 myapp-staging-use1 myapp-prod-use1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;ACCOUNT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws sts get-caller-identity &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="nv"&gt;$PROFILE&lt;/span&gt; &lt;span class="nt"&gt;--query&lt;/span&gt; Account &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;

  &lt;span class="c"&gt;# Create state bucket&lt;/span&gt;
  aws s3api create-bucket &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--bucket&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-state-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ACCOUNT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="nv"&gt;$PROFILE&lt;/span&gt;

  &lt;span class="c"&gt;# Enable versioning (lets you recover from bad applies)&lt;/span&gt;
  aws s3api put-bucket-versioning &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--bucket&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-state-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ACCOUNT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--versioning-configuration&lt;/span&gt; &lt;span class="nv"&gt;Status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Enabled &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="nv"&gt;$PROFILE&lt;/span&gt;

  &lt;span class="c"&gt;# Enable encryption&lt;/span&gt;
  aws s3api put-bucket-encryption &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--bucket&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-state-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ACCOUNT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--server-side-encryption-configuration&lt;/span&gt; &lt;span class="s1"&gt;'{
      "Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}}]
    }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="nv"&gt;$PROFILE&lt;/span&gt;

  &lt;span class="c"&gt;# Block public access&lt;/span&gt;
  aws s3api put-public-access-block &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--bucket&lt;/span&gt; &lt;span class="s2"&gt;"myapp-terraform-state-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ACCOUNT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--public-access-block-configuration&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="nv"&gt;$PROFILE&lt;/span&gt;

  &lt;span class="c"&gt;# Create DynamoDB lock table&lt;/span&gt;
  aws dynamodb create-table &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--table-name&lt;/span&gt; myapp-terraform-locks &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--attribute-definitions&lt;/span&gt; &lt;span class="nv"&gt;AttributeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LockID,AttributeType&lt;span class="o"&gt;=&lt;/span&gt;S &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--key-schema&lt;/span&gt; &lt;span class="nv"&gt;AttributeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LockID,KeyType&lt;span class="o"&gt;=&lt;/span&gt;HASH &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--billing-mode&lt;/span&gt; PAY_PER_REQUEST &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="nv"&gt;$PROFILE&lt;/span&gt;

  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"State backend ready for account &lt;/span&gt;&lt;span class="nv"&gt;$ACCOUNT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;By the end of this part you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ AWS Organizations with four accounts (management, dev, staging, production)&lt;/li&gt;
&lt;li&gt;✅ Service Control Policies protecting production from accidental destruction&lt;/li&gt;
&lt;li&gt;✅ AWS SSO for human access (no IAM users with permanent credentials)&lt;/li&gt;
&lt;li&gt;✅ GitHub OIDC provider enabling keyless CI authentication&lt;/li&gt;
&lt;li&gt;✅ ECR repository with immutable tags, cross-account pull, and lifecycle policies&lt;/li&gt;
&lt;li&gt;✅ KMS keys for encryption at rest in every environment&lt;/li&gt;
&lt;li&gt;✅ Terragrunt root config that automatically derives account/region from directory path&lt;/li&gt;
&lt;li&gt;✅ S3 + DynamoDB Terraform state backend per account&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;If this was useful, follow me on dev.to&lt;/strong&gt; — I publish Part 3 next Wednesday covering the Infrastructure as Code — Terraform Modules + Terragrunt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Questions?&lt;/strong&gt; Drop them in the comments — I read and reply to every one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next: Part 3 — Infrastructure as Code: Terraform Modules + Terragrunt&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow the series&lt;/strong&gt; — next part publishes next Wednesday.&lt;br&gt;
&lt;strong&gt;Live system:&lt;/strong&gt; &lt;a href="https://www.matthewoladipupo.dev/health" rel="noopener noreferrer"&gt;https://www.matthewoladipupo.dev/health&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>security</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Part 1: Architecture Overview</title>
      <dc:creator>Matthew</dc:creator>
      <pubDate>Wed, 11 Mar 2026 09:24:20 +0000</pubDate>
      <link>https://dev.to/matthewdipo/part-1-architecture-overview-20p2</link>
      <guid>https://dev.to/matthewdipo/part-1-architecture-overview-20p2</guid>
      <description>&lt;h2&gt;
  
  
  Building a Production-Grade DevSecOps Pipeline on AWS: A Complete Guide
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series Overview:&lt;/strong&gt; This 10-part series walks you through building a real-world, production-grade DevSecOps platform on AWS from scratch — the same architecture used at mature engineering organizations. By the end, you will have six EKS clusters, a GitOps delivery model, a hardened CI/CD pipeline, runtime security, full-stack observability, and automated disaster recovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live system:&lt;/strong&gt; Everything in this series is running in production right now.&lt;br&gt;
Check it: &lt;code&gt;curl https://www.matthewoladipupo.dev/health&lt;/code&gt;&lt;br&gt;
→ &lt;code&gt;{"status":"healthy","region":"us-east-1"}&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is Part 1 of a 10-part series. You can follow the full series here on dev.to.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: Architecture Overview &amp;amp; What We Are Building
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;Most DevSecOps tutorials show you one piece of the puzzle — a Kubernetes cluster here, a CI pipeline there. This series is different. We build the entire platform end to end: infrastructure-as-code, multi-environment clusters, GitOps, security policy enforcement, runtime threat detection, secrets management, observability, canary deployments, autoscaling, and backup — all wired together the way a production engineering team would actually build it.&lt;/p&gt;

&lt;p&gt;Every component in this guide is running live. The screenshots you will see throughout this series come from the actual deployment at &lt;code&gt;matthewoladipupo.dev&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0wyx83zqs883ffi533o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0wyx83zqs883ffi533o.png" alt="Production-Grade DevSecOps Pipeline — Full Architecture Overview showing &lt;br&gt;
6 EKS clusters, ArgoCD hub-spoke GitOps, GitHub Actions CI/CD, and AWS &lt;br&gt;
security services" width="800" height="545"&gt;&lt;/a&gt;&lt;em&gt;The complete system: GitHub → CI/CD → ECR → ArgoCD hub → 6 EKS clusters &lt;br&gt;
across 3 environments and 2 AWS regions. Every component is covered in this &lt;br&gt;
10-part series.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you will build:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure as Code&lt;/td&gt;
&lt;td&gt;Terraform + Terragrunt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Platform&lt;/td&gt;
&lt;td&gt;AWS (multi-account, multi-region)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container Orchestration&lt;/td&gt;
&lt;td&gt;Amazon EKS (6 clusters)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps Delivery&lt;/td&gt;
&lt;td&gt;ArgoCD (hub-spoke)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD Pipeline&lt;/td&gt;
&lt;td&gt;GitHub Actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container Security&lt;/td&gt;
&lt;td&gt;Trivy, Cosign, Distroless images&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Policy Enforcement&lt;/td&gt;
&lt;td&gt;Kyverno&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime Security&lt;/td&gt;
&lt;td&gt;Falco&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Management&lt;/td&gt;
&lt;td&gt;AWS Secrets Manager + External Secrets Operator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;Prometheus + Grafana (kube-prometheus-stack)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;Fluent Bit → AWS CloudWatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canary Deployments&lt;/td&gt;
&lt;td&gt;Argo Rollouts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Autoscaling&lt;/td&gt;
&lt;td&gt;Karpenter + HPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backup &amp;amp; DR&lt;/td&gt;
&lt;td&gt;Velero + S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web Application Firewall&lt;/td&gt;
&lt;td&gt;AWS WAF v2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Threat Detection&lt;/td&gt;
&lt;td&gt;AWS GuardDuty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS &amp;amp; TLS&lt;/td&gt;
&lt;td&gt;Route53 + ACM (wildcard cert)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Balancing&lt;/td&gt;
&lt;td&gt;AWS Load Balancer Controller&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  High-Level Architecture
&lt;/h3&gt;

&lt;p&gt;The platform follows a &lt;strong&gt;hub-spoke GitOps model&lt;/strong&gt; across three environments and two AWS regions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────────────────┐
│                            AWS ORGANIZATION                                     │
│                                                                                 │
│  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────────────┐   │
│  │  Management Acct │    │  Dev Account     │    │  Staging Account         │   │
│  │  (ECR, CI Audit) │    │  557702566877    │    │                          │   │
│  │                  │    │  ┌─────────────┐ │    │  ┌────────┐ ┌────────┐   │   │
│  │  ┌───────────┐   │    │  │ EKS use1    │ │    │  │EKS use1│ │EKS usw2│   │   │
│  │  │ECR: myapp │   │    │  │ (public ep) │ │    │  │        │ │        │   │   │
│  │  └───────────┘   │    │  └─────────────┘ │    │  └────────┘ └────────┘   │   │
│  └──────────────────┘    │  ┌─────────────┐ │    └──────────────────────────┘   │
│                          │  │ EKS usw2    │ │                                   │
│                          │  │ (public ep) │ │    ┌──────────────────────────┐   │
│                          │  └─────────────┘ │    │  Production Account      │   │
│                          └──────────────────┘    │  591120834781            │   │
│                                                  │                          │   │
│                                                  │  ┌────────┐ ┌────────┐   │   │
│                                                  │  │EKS use1│ │EKS usw2│   │   │
│                                                  │  │  HUB   │ │ Spoke  │   │   │
│                                                  │  │(ArgoCD)│ │        │   │   │
│                                                  │  └────────┘ └────────┘   │   │
│                                                  └──────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Detailed Architecture Diagram
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                              DEVELOPER WORKFLOW
                              ──────────────────
                              git push → GitHub

                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         GITHUB ACTIONS CI PIPELINE                              │
│                                                                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
│  │  Lint +  │  │  Trivy   │  │  Docker  │  │  Cosign  │  │  Push to ECR     │   │
│  │  Test    │→ │  Scan    │→ │  Build   │→ │  Sign    │→ │  (multi-region)  │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘   │
│                                                                    │            │
│                                              OIDC (no static keys) │            │
│                                              IAM Roles per cluster │            │
└────────────────────────────────────────────────────────────────────┼────────────┘
                                                                     │
                                                                     ▼
                                                             myapp-gitops repo
                                                           (image tag updated)
                                                                     │
                                                                     ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                    ARGOCD HUB  (myapp-production-use1)                          │
│                                                                                 │
│   ApplicationSets (list generators per cluster)                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │  environments/  infrastructure/  argocd/                                │   │
│   │  ├─ dev         ├─ monitoring    └─ project-*.yaml                      │   │
│   │  ├─ staging     ├─ logging                                              │   │
│   │  └─ production  ├─ eso                                                  │   │
│   │                 ├─ kyverno                                              │   │
│   │                 ├─ falco                                                │   │
│   │                 ├─ velero                                               │   │
│   │                 ├─ karpenter                                            │   │
│   │                 └─ argo-rollouts                                        │   │
│   └─────────────────────────────────────────────────────────────────────────┘   │
│                                                                                 │
│   Syncs to ──────────────────────────────────────────────────────────────────►  │
└──────┬──────────────────────────────────────────────────────────────────────────┘
       │  VPC Peering (private endpoints)
       ├────────────────────────────────► myapp-production-usw2
       ├────────────────────────────────► myapp-staging-use1
       ├────────────────────────────────► myapp-staging-usw2
       ├────────────────────────────────► myapp-dev-use1
       └────────────────────────────────► myapp-dev-usw2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Per-Cluster Component Stack
&lt;/h3&gt;

&lt;p&gt;Every cluster runs the same security and observability baseline. The diagram below shows what runs on each cluster after bootstrapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────────┐
│                    EKS CLUSTER (per-cluster stack)                   │
│                                                                      │
│  SYSTEM NAMESPACES                                                   │
│  ┌─────────────┐  ┌──────────────────┐  ┌────────────────────────┐   │
│  │  kube-system│  │  kyverno         │  │  falco                 │   │
│  │  (aws-lbc,  │  │  (policy engine) │  │  (runtime security)    │   │
│  │   coreDNS)  │  │                  │  │                        │   │
│  └─────────────┘  └──────────────────┘  └────────────────────────┘   │
│                                                                      │
│  ┌─────────────────────┐  ┌───────────────────────────────────────┐  │
│  │  external-secrets   │  │  monitoring                           │  │
│  │  (ESO operator)     │  │  Prometheus ── Grafana ── Alertmanager│  │
│  └─────────────────────┘  └───────────────────────────────────────┘  │
│                                                                      │
│  ┌───────────────────────┐  ┌──────────────────┐  ┌─────────────┐    │
│  │  logging              │  │  velero          │  │  karpenter  │    │
│  │  (Fluent Bit DS)      │  │  (backup)        │  │  (prod only)│    │
│  └───────────────────────┘  └──────────────────┘  └─────────────┘    │
│                                                                      │
│  ┌───────────────────────┐  ┌──────────────────────────────────────┐ │
│  │  argo-rollouts        │  │  myapp (prod) / myapp (dev/staging)  │ │
│  │  (canary controller)  │  │  Rollout ──► canary ──► stable       │ │
│  └───────────────────────┘  └──────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Network Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        AWS REGION: us-east-1
┌───────────────────────────────────────────────────────────┐
│  VPC: production-use1  (10.20.0.0/16)                     │
│                                                           │
│  ┌──────────────────────────┐  ┌──────────────────────┐   │
│  │  Public Subnets          │  │  Private Subnets     │   │
│  │  10.20.0.0/24 (us-east-1a)  10.20.8.0/21 (use1a)   │   │
│  │  10.20.1.0/24 (us-east-1b)  10.20.16.0/21 (use1b)  │   │
│  │  10.20.2.0/24 (us-east-1c)  10.20.24.0/21 (use1c)  │   │
│  │                          │  │                      │   │
│  │  NAT Gateways            │  │  EKS Node Groups     │   │
│  │  Internet Gateway        │  │  EKS API endpoint    │   │
│  │  ALB (internet-facing)   │  │  (private only)      │   │
│  └──────────────────────────┘  └──────────────────────┘   │
└───────────────────────────────────────────────────────────┘
          │                            │
          │ VPC Peering                │
          │ (private, encrypted)       │
          ▼                            ▼
┌────────────────────┐    ┌────────────────────────────────┐
│  VPC: prod-usw2    │    │  VPC: staging-use1             │
│  (10.21.0.0/16)    │    │  (10.10.0.0/16)                │
└────────────────────┘    └────────────────────────────────┘

Internet Traffic Flow:
User → Route53 (latency routing) → ALB → AWS WAF → ALB Target Group
     → Pod (via ALB target type: ip, directly to pod IP)
     → Response back to user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  CI/CD Pipeline Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│  GitHub: MatthewDipo/myapp                                          │
│                                                                     │
│  Developer: git push origin main                                    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ triggers
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  GitHub Actions: .github/workflows/ci.yaml                          │
│                                                                     │
│  Job 1: lint-and-test                                               │
│    └─ npm ci &amp;amp;&amp;amp; npm test                                            │
│                                                                     │
│  Job 2: scan (needs: lint-and-test)                                 │
│    └─ trivy image --severity HIGH,CRITICAL                          │
│                                                                     │
│  Job 3: build-push-sign (needs: scan)                               │
│    ├─ OIDC → assume IAM role (no static AWS keys)                   │
│    ├─ docker build (distroless:nonroot base)                        │
│    ├─ docker push → ECR us-east-1 (management account)              │
│    ├─ docker push → ECR us-west-2 (management account)              │
│    ├─ cosign sign --key awskms:// (KMS signing key)                 │
│    └─ Write S3 audit log (image digest + timestamp)                 │
│                                                                     │
│  Job 4: update-gitops (needs: build-push-sign)                      │
│    └─ Patch myapp-gitops/apps/myapp/values-*.yaml (image.tag)       │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ git push to myapp-gitops
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  GitHub: MatthewDipo/myapp-gitops                                   │
│  ArgoCD detects diff → triggers sync per cluster                    │
│                                                                     │
│  Production: Argo Rollouts Canary                                   │
│    ├─ Step 1: setWeight 20% (canary gets 20% traffic)               │
│    ├─ Step 2: pause 5 minutes                                       │
│    ├─ Step 3: AnalysisRun (check error rate &amp;lt; 1%)                   │
│    └─ Step 4: setWeight 100% (promote to stable)                    │
│                                                                     │
│  Dev/Staging: Rolling Update (immediate)                            │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Security Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────────┐
│                       SECURITY LAYERS                                   │
│                                                                         │
│  Layer 1: SUPPLY CHAIN SECURITY                                         │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │  Source: GitHub branch protection + required reviews               │ │
│  │  Build:  Trivy CVE scan (fails pipeline on HIGH/CRITICAL)          │ │
│  │  Image:  Distroless base (no shell, no package manager)            │ │
│  │  Sign:   Cosign + AWS KMS (cryptographic attestation)              │ │
│  │  Verify: Kyverno policy blocks unsigned images at admission        │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                         │
│  Layer 2: INFRASTRUCTURE SECURITY                                       │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │  Network:  Private EKS endpoints (staging/prod)                    │ │
│  │            VPC peering (no internet traversal between clusters)    │ │
│  │            NetworkPolicies (deny-all default, allow explicitly)    │ │
│  │  IAM:      IRSA (pod-level IAM, no node-level credentials)         │ │
│  │            OIDC for GitHub Actions (no static AWS keys in CI)      │ │
│  │  Secrets:  AWS Secrets Manager (never stored in Git)               │ │
│  │  KMS:      Envelope encryption for EKS secrets + ECR + S3          │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                         │
│  Layer 3: WORKLOAD ADMISSION CONTROL (Kyverno)                          │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │  ✗ Block privileged containers                                     │ │
│  │  ✗ Block hostPath volume mounts                                    │ │
│  │  ✗ Block containers running as root (uid=0)                        │ │
│  │  ✗ Block images without valid Cosign signature                     │ │
│  │  ✗ Block missing resource limits                                   │ │
│  │  ✓ Allow myapp from ECR with valid signature                       │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                         │
│  Layer 4: RUNTIME THREAT DETECTION (Falco)                              │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │  Alert on: shell spawned in container                              │ │
│  │  Alert on: sensitive file read (/etc/shadow, /etc/passwd)          │ │
│  │  Alert on: unexpected outbound network connection                  │ │
│  │  Alert on: privilege escalation attempts                           │ │
│  │  Output: CloudWatch Logs (via Fluent Bit)                          │ │
│  └────────────────────────────────────────────────────────────────────┘ │
│                                                                         │
│  Layer 5: PERIMETER SECURITY                                            │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │  AWS WAF v2: Managed rules (OWASP Top 10, SQL injection, XSS)      │ │
│  │  AWS GuardDuty: Account-level threat intelligence                  │ │
│  │  ACM: TLS 1.2+ enforced, HTTP → HTTPS redirect                     │ │
│  └────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  GitOps Repository Structure
&lt;/h3&gt;

&lt;p&gt;Two GitHub repositories drive everything after the CI pipeline builds the image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MatthewDipo/myapp-gitops/
│
├── argocd/
│   ├── project-dev.yaml             # AppProject: dev clusters
│   ├── project-staging.yaml         # AppProject: staging clusters
│   └── project-production.yaml      # AppProject: production clusters
│
├── environments/
│   ├── dev/
│   │   └── applicationset.yaml      # Deploys myapp to dev-use1, dev-usw2
│   ├── staging/
│   │   └── applicationset.yaml      # Deploys myapp to staging-use1, staging-usw2
│   └── production/
│       └── applicationset.yaml      # Deploys myapp to prod-use1, prod-usw2
│
├── infrastructure/
│   ├── monitoring/
│   │   ├── applicationset.yaml      # kube-prometheus-stack (4 clusters)
│   │   ├── prometheus-values.yaml
│   │   └── alert-rules/
│   │       └── applicationset.yaml  # PrometheusRule CRDs
│   ├── logging/
│   │   └── applicationset.yaml      # Fluent Bit DaemonSet (6 clusters)
│   ├── eso/
│   │   └── applicationset.yaml      # External Secrets Operator (6 clusters)
│   ├── kyverno/
│   │   └── applicationset.yaml      # Kyverno + policies (6 clusters)
│   ├── falco/
│   │   └── applicationset.yaml      # Falco DaemonSet (6 clusters)
│   ├── velero/
│   │   └── applicationset.yaml      # Velero (6 clusters)
│   ├── karpenter/
│   │   ├── applicationset.yaml      # Karpenter controller (2 prod)
│   │   └── nodepools/
│   │       └── applicationset.yaml  # NodePool + EC2NodeClass CRDs
│   └── argo-rollouts/
│       └── applicationset.yaml      # Argo Rollouts controller (2 prod)
│
└── apps/
    └── myapp/                       # Helm chart for the application
        ├── Chart.yaml
        ├── values.yaml              # Default values
        ├── values-dev.yaml
        ├── values-staging.yaml
        ├── values-production.yaml
        └── templates/
            ├── deployment.yaml      # Deployment OR Rollout (conditional)
            ├── service.yaml
            ├── service-canary.yaml  # Canary service (prod only)
            ├── ingress.yaml
            ├── hpa.yaml
            ├── networkpolicy.yaml
            ├── serviceaccount.yaml
            ├── external-secret.yaml
            ├── servicemonitor.yaml  # Prometheus scraping
            └── analysis-template.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Infrastructure Repository Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MatthewDipo/myapp-infra/
│
├── _modules/                        # Reusable Terraform modules
│   ├── vpc/
│   ├── eks/
│   ├── kms/
│   ├── iam/
│   ├── ecr/
│   ├── waf/
│   ├── guardduty/
│   ├── eso-irsa/
│   ├── fluent-bit-irsa/
│   ├── karpenter/
│   └── velero/
│
└── live/                            # Terragrunt configurations (per env/region)
    ├── terragrunt.hcl               # Root config (provider, remote state)
    ├── dev/
    │   ├── us-east-1/
    │   │   ├── vpc/terragrunt.hcl
    │   │   ├── kms/terragrunt.hcl
    │   │   ├── eks/terragrunt.hcl
    │   │   ├── iam/terragrunt.hcl
    │   │   └── fluent-bit-irsa/terragrunt.hcl
    │   └── us-west-2/
    │       └── ... (mirror of use1)
    ├── staging/
    │   ├── us-east-1/
    │   │   ├── vpc/ kms/ eks/ iam/
    │   │   ├── waf/terragrunt.hcl
    │   │   ├── guardduty/terragrunt.hcl
    │   │   ├── eso-irsa/terragrunt.hcl
    │   │   └── fluent-bit-irsa/terragrunt.hcl
    │   └── us-west-2/
    │       └── ...
    └── production/
        ├── us-east-1/
        │   ├── vpc/ kms/ eks/ iam/
        │   ├── waf/
        │   ├── guardduty/
        │   ├── eso-irsa/
        │   ├── fluent-bit-irsa/
        │   ├── karpenter/
        │   └── velero/
        └── us-west-2/
            └── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  AWS Account Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────┐
│         AWS Organizations Root               │
│                                              │
│  ┌───────────────────────────────────────┐   │
│  │  Management / Root Account            │   │
│  │  • AWS SSO (Identity Center)          │   │
│  │  • ECR repositories (shared)          │   │
│  │  • S3 CI audit bucket                 │   │
│  │  • GitHub OIDC provider               │   │
│  └───────────────────────────────────────┘   │
│           │            │            │        │
│           ▼            ▼            ▼        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │   Dev    │  │ Staging  │  │Production│    │
│  │ Account  │  │ Account  │  │ Account  │    │
│  │          │  │          │  │          │    │
│  │ 2x EKS   │  │ 2x EKS   │  │ 2x EKS   │    │
│  │ VPCs     │  │ VPCs     │  │ VPCs     │    │
│  │ KMS keys │  │ KMS keys │  │ KMS keys │    │
│  │ Secrets  │  │ Secrets  │  │ Secrets  │    │
│  │ Manager  │  │ Manager  │  │ Manager  │    │
│  └──────────┘  └──────────┘  └──────────┘    │
└──────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Technology Decision Rationale
&lt;/h3&gt;

&lt;p&gt;Understanding WHY each tool was chosen matters as much as knowing HOW to configure it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terragrunt over plain Terraform&lt;/strong&gt;&lt;br&gt;
Terragrunt provides DRY (Don't Repeat Yourself) configuration. Without it, you would have near-identical &lt;code&gt;provider&lt;/code&gt;, &lt;code&gt;backend&lt;/code&gt;, and &lt;code&gt;module&lt;/code&gt; blocks repeated across 18+ directories. Terragrunt's &lt;code&gt;include&lt;/code&gt; and &lt;code&gt;dependency&lt;/code&gt; blocks eliminate 90% of that duplication while keeping each environment's overrides explicit and auditable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArgoCD hub-spoke over fleet management per cluster&lt;/strong&gt;&lt;br&gt;
Running ArgoCD on every cluster is operationally expensive. The hub-spoke model means one ArgoCD installation manages all six clusters via VPC peering. This single pane of glass dramatically simplifies debugging — you see all cluster states in one place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kyverno over OPA/Gatekeeper&lt;/strong&gt;&lt;br&gt;
Kyverno policies are written in YAML and operate on the same resource schema as Kubernetes objects. OPA/Gatekeeper requires learning Rego, a purpose-built policy language. For Kubernetes-native teams, Kyverno is faster to adopt and maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External Secrets Operator over Sealed Secrets&lt;/strong&gt;&lt;br&gt;
Sealed Secrets encrypts secrets and commits them to Git — meaning the encrypted value is your source of truth. ESO keeps secrets out of Git entirely: the secret lives in AWS Secrets Manager, and ESO fetches it at runtime with IRSA credentials. This is a fundamentally stronger posture because a compromised Git repo never exposes secret material.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Argo Rollouts over plain Kubernetes rolling updates&lt;/strong&gt;&lt;br&gt;
Rolling updates are binary — you either roll forward or roll back. Argo Rollouts adds weighted traffic splitting between stable and canary versions, analysis runs (automated metric-based promotion gates), and pause steps for manual inspection. A canary deployment that automatically fails back on a rising error rate is far safer than a rolling update you monitor manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distroless base images&lt;/strong&gt;&lt;br&gt;
The Google distroless &lt;code&gt;nonroot&lt;/code&gt; image contains only the application runtime and its direct dependencies — no shell (&lt;code&gt;sh&lt;/code&gt;, &lt;code&gt;bash&lt;/code&gt;), no package manager (&lt;code&gt;apt&lt;/code&gt;, &lt;code&gt;apk&lt;/code&gt;), no &lt;code&gt;curl&lt;/code&gt; or &lt;code&gt;wget&lt;/code&gt;. If an attacker achieves code execution inside the container, they have almost no tools available to escalate or exfiltrate. Combined with Falco alerting on shell spawning, you get both prevention and detection.&lt;/p&gt;


&lt;h3&gt;
  
  
  Cost Estimate
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;These are rough estimates for running the full stack 24/7 in AWS us-east-1 + us-west-2. Production workloads should be sized to actual usage.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Approx. Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6x EKS cluster control planes&lt;/td&gt;
&lt;td&gt;~$216 ($0.10/hr × 6)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12x EC2 t3.medium nodes (2 per cluster)&lt;/td&gt;
&lt;td&gt;~$300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6x EBS volumes (gp2, 50–100GB each)&lt;/td&gt;
&lt;td&gt;~$60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateways (2 per VPC, 6 VPCs)&lt;/td&gt;
&lt;td&gt;~$250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ALBs (production + monitoring)&lt;/td&gt;
&lt;td&gt;~$30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Route53 hosted zone + queries&lt;/td&gt;
&lt;td&gt;~$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ACM (free)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECR storage&lt;/td&gt;
&lt;td&gt;~$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch logs (Fluent Bit)&lt;/td&gt;
&lt;td&gt;~$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 (Velero backups, CI audit)&lt;/td&gt;
&lt;td&gt;~$10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS WAF&lt;/td&gt;
&lt;td&gt;~$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GuardDuty&lt;/td&gt;
&lt;td&gt;~$10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Manager&lt;/td&gt;
&lt;td&gt;~$2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$918/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cost reduction tip:&lt;/strong&gt; For a demo/learning setup, use 1 node per cluster, &lt;code&gt;t3.small&lt;/code&gt; instances, and skip the second region. This brings the cost to approximately $200–300/month.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before starting Part 2, ensure you have the following:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools to install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# AWS CLI v2&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"awscliv2.zip"&lt;/span&gt;
unzip awscliv2.zip &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo&lt;/span&gt; ./aws/install

&lt;span class="c"&gt;# Terraform 1.6+&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;terraform   &lt;span class="c"&gt;# or download from terraform.io&lt;/span&gt;

&lt;span class="c"&gt;# Terragrunt 0.54+&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;terragrunt  &lt;span class="c"&gt;# or download from terragrunt.gruntwork.io&lt;/span&gt;

&lt;span class="c"&gt;# kubectl&lt;/span&gt;
curl &lt;span class="nt"&gt;-LO&lt;/span&gt; &lt;span class="s2"&gt;"https://dl.k8s.io/release/&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-sL&lt;/span&gt; https://dl.k8s.io/release/stable.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/bin/linux/amd64/kubectl"&lt;/span&gt;
&lt;span class="nb"&gt;sudo install&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; root &lt;span class="nt"&gt;-g&lt;/span&gt; root &lt;span class="nt"&gt;-m&lt;/span&gt; 0755 kubectl /usr/local/bin/kubectl

&lt;span class="c"&gt;# Helm 3&lt;/span&gt;
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

&lt;span class="c"&gt;# ArgoCD CLI&lt;/span&gt;
curl &lt;span class="nt"&gt;-sSL&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
&lt;span class="nb"&gt;sudo install&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; 555 argocd /usr/local/bin/argocd

&lt;span class="c"&gt;# Cosign&lt;/span&gt;
curl &lt;span class="nt"&gt;-sSfL&lt;/span&gt; https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; cosign &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo install&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; 0755 cosign /usr/local/bin/cosign

&lt;span class="c"&gt;# Trivy&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;wget apt-transport-https gnupg lsb-release
wget &lt;span class="nt"&gt;-qO&lt;/span&gt; - https://aquasecurity.github.io/trivy-repo/deb/public.key | &lt;span class="nb"&gt;sudo &lt;/span&gt;apt-key add -
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"deb https://aquasecurity.github.io/trivy-repo/deb &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;lsb_release &lt;span class="nt"&gt;-sc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; main"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/trivy.list
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;trivy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AWS accounts needed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Organization root account (or a management account)&lt;/li&gt;
&lt;li&gt;Three member accounts: dev, staging, production&lt;/li&gt;
&lt;li&gt;AWS SSO (IAM Identity Center) configured&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two repositories: &lt;code&gt;myapp&lt;/code&gt; (application code) and &lt;code&gt;myapp-gitops&lt;/code&gt; (manifests)&lt;/li&gt;
&lt;li&gt;A Personal Access Token for ArgoCD to pull from the GitOps repo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Domain name:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A registered domain you control (we use &lt;code&gt;matthewoladipupo.dev&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Nameservers pointed to Route53&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Series Roadmap
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Part&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Architecture Overview (this article)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS Foundation: Organizations, SSO, and Account Setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure as Code: Terraform Modules + Terragrunt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;EKS Multi-Cluster: Six Clusters Across Two Regions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitOps with ArgoCD: Hub-Spoke Model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CI/CD Pipeline: GitHub Actions, Trivy, Cosign, ECR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secrets Management: AWS Secrets Manager + ESO + IRSA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security Stack: Kyverno, Falco, WAF, GuardDuty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Observability: Prometheus, Grafana, Fluent Bit, CloudWatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Part 10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resilience: Karpenter, HPA, Argo Rollouts, Velero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Live Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Appendix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Codes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runbook&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  Screenshot Placeholders
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: ArgoCD UI showing all 6 clusters registered and all ApplicationSets synced&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx4bvv9xqr6w1fcyb3ck.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx4bvv9xqr6w1fcyb3ck.png" alt="ArgoCD UI showing all 6 clusters registered and all ApplicationSets synced" width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: Grafana dashboard — Node CPU/Memory overview across production clusters&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdz9aks78837grslqzru4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdz9aks78837grslqzru4.png" alt="Grafana dashboard — Node CPU/Memory overview across production clusters" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: GitHub Actions workflow showing all steps passing (lint → scan → build → sign → push → gitops update)&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ylgumnafddwkd6jmqyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ylgumnafddwkd6jmqyc.png" alt="GitHub Actions workflow showing all steps passing (lint → scan → build → sign → push → gitops update" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SCREENSHOT: AWS ECR showing signed image with cosign attestation tag&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4l8a7we3a8zkfthkivi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4l8a7we3a8zkfthkivi.png" alt="AWS ECR showing signed image with cosign attestation tag" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;If this was useful, follow me on dev.to&lt;/strong&gt; — I will publish Part 2 next Wednesday covering the AWS Organizations + IAM Identity Center setup.&lt;br&gt;
&lt;strong&gt;Questions?&lt;/strong&gt; Drop them in the comments — I read and reply to every one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next up: Part 2 — AWS Foundation: Organizations, SSO, and Account Setup&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Source code:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra" rel="noopener noreferrer"&gt;myapp-infra&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp-gitops" rel="noopener noreferrer"&gt;myapp-gitops&lt;/a&gt; | &lt;a href="https://github.com/MatthewDipo/myapp" rel="noopener noreferrer"&gt;myapp&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Runbook:&lt;/strong&gt; &lt;a href="https://github.com/MatthewDipo/myapp-infra/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Operations Guide&lt;/a&gt; — every operational procedure for this pipeline.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>aws</category>
      <category>terraform</category>
    </item>
  </channel>
</rss>
