<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: alok shankar</title>
    <description>The latest articles on DEV Community by alok shankar (@alok_shankar).</description>
    <link>https://dev.to/alok_shankar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3173851%2F29285227-7afb-4815-bbd9-2ea7c8b4b6ba.jpg</url>
      <title>DEV Community: alok shankar</title>
      <link>https://dev.to/alok_shankar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alok_shankar"/>
    <language>en</language>
    <item>
      <title>EKS Ingress Address Not Assigned (Application Outage)-Incident &amp; Resolution Guide</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Thu, 16 Apr 2026 06:06:29 +0000</pubDate>
      <link>https://dev.to/alok_shankar/eks-ingress-address-not-assigned-application-outage-incident-resolution-guide-18ed</link>
      <guid>https://dev.to/alok_shankar/eks-ingress-address-not-assigned-application-outage-incident-resolution-guide-18ed</guid>
      <description>&lt;p&gt;&lt;strong&gt;1. Introduction&lt;/strong&gt;&lt;br&gt;
In Kubernetes, applications are typically exposed internally using Services (ClusterIP, NodePort). However, for exposing applications externally in a scalable, secure, and cloud‑native manner, Kubernetes provides the concept of Ingress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Ingress?&lt;/strong&gt;&lt;br&gt;
Ingress is a Kubernetes API object that manages external HTTP/HTTPS access to services within a cluster. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Layer‑7 routing (path‑based, host‑based)&lt;/li&gt;
&lt;li&gt;TLS termination&lt;/li&gt;
&lt;li&gt;Centralized traffic management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ingress works in conjunction with an Ingress Controller, which implements the actual traffic routing logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Ingress instead of NodePort / LoadBalancer?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fty5ief2jnclzjax27o62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fty5ief2jnclzjax27o62.png" alt=" " width="594" height="187"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;In AWS EKS, the recommended production approach is Ingress with AWS Application Load Balancer (ALB) using the AWS Load Balancer Controller.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident Overview&lt;/strong&gt;&lt;br&gt;
I encountered an issue in the EKS environment where an application became inaccessible from outside the cluster. &lt;br&gt;
Although the application pods were running and Kubernetes services were healthy, external users were unable to access the application URL.&lt;/p&gt;

&lt;p&gt;Upon investigation, it was observed that the Ingress resource was created successfully, but the ADDRESS field of the Ingress remained empty (null). &lt;br&gt;
As a result, no valid Load Balancer endpoint was available to route external traffic to the application.&lt;br&gt;
This issue closely resembled a production outage scenario, as it directly impacted external traffic routing despite the application itself being operational.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xx-xxx-xxx-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get ingress app-ingress &lt;span class="nt"&gt;-n&lt;/span&gt; ep-apps &lt;span class="nt"&gt;-o&lt;/span&gt; wide
&lt;span class="go"&gt;NAME          CLASS   HOSTS   ADDRESS   PORTS   AGE
app-ingress   alb     *                 80      10h

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;External users could not access the application.&lt;/li&gt;
&lt;li&gt;No ALB DNS was available from the Ingress.&lt;/li&gt;
&lt;li&gt;Target Group showed 0 registered targets.&lt;/li&gt;
&lt;li&gt;Application health appeared normal internally, which made the issue non-obvious at first glance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;&lt;br&gt;
Business Scenario&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Application deployed in EKS.&lt;/li&gt;
&lt;li&gt;Needs to be exposed externally over HTTPS&lt;/li&gt;
&lt;li&gt;Uses path-based routing&lt;/li&gt;
&lt;li&gt;Requires container-level health checks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;2. Architecture Overview&lt;/strong&gt;&lt;br&gt;
High-Level Flow&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xtgbbmvmhhff04mxklm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xtgbbmvmhhff04mxklm.png" alt=" " width="303" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Timeline of Events:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The application was deployed successfully in the EKS cluster.&lt;/li&gt;
&lt;li&gt;Pods were in Running state and passing readiness and liveness probes.&lt;/li&gt;
&lt;li&gt;Kubernetes Service (ClusterIP) showed valid endpoints.&lt;/li&gt;
&lt;li&gt;An ALB-backed Ingress was created to expose the application externally.&lt;/li&gt;
&lt;li&gt;Despite successful Ingress creation, the Ingress ADDRESS field remained empty.&lt;/li&gt;
&lt;li&gt;AWS Console showed an ALB and Target Group created, but the Target Group had zero registered targets.&lt;/li&gt;
&lt;li&gt;Because the Ingress did not publish an ADDRESS, application traffic could not reach the cluster.&lt;/li&gt;
&lt;li&gt;This resulted in an outage-like situation where the application was “up” internally but unreachable externally.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;4. Initial Observation&lt;/strong&gt;&lt;br&gt;
At a high level, everything appeared correct:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pods were healthy.
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xx-xxx-xxx-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get pods &lt;span class="nt"&gt;-A&lt;/span&gt;
&lt;span class="go"&gt;NAMESPACE           NAME                                                              READY   STATUS    RESTARTS   AGE
amazon-cloudwatch   amazon-cloudwatch-observability-controller-manager-586c44c2cclk   1/1     Running   0          7h6m
amazon-cloudwatch   cloudwatch-agent-xxx                                              1/1     Running   0          6h41m
amazon-cloudwatch   cloudwatch-agent-xxxx                                             1/1     Running   0          6h41m
amazon-cloudwatch   fluent-bit-xxxx                                                   1/1     Running   0          6h41m
amazon-cloudwatch   fluent-bit-xxxx                                                   1/1     Running   0          6h41m
external-dns        external-dns-75f7b59749-dfkgn                                     1/1     Running   0          24h
ep-apps             condition-service-96475888c-bdmdn                                 1/1     Running   0          22h
ep-apps             web-query-service-78b5d4dcb7-nms56                                1/1     Running   0          23h
ep-apps             web-query-service-78b5d4dcb7-xlfj9                                1/1     Running   0          23h
ep-apps             web-apps-59658b6868-fkwvp                                         1/1     Running   0          22h
kube-system         aws-node-4xrsc                                                    2/2     Running   0          24h
kube-system         aws-load-balancer-controller-78bddb649b-w56d5                     1/1     Running   0          24h
kube-system         aws-load-balancer-controller-78bddb649b-z5s5g                     1/1     Running   0          24h
kube-system         aws-node-ncp5f                                                    2/2     Running   0          24h
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Service endpoints existed.&lt;/li&gt;
&lt;li&gt;Ingress configuration looked valid&lt;/li&gt;
&lt;li&gt;ALB resources were present in AWS&lt;/li&gt;
&lt;li&gt;However, traffic was not flowing due to the missing Ingress ADDRESS, indicating a failure in Ingress‑to‑ALB reconciliation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;5. Root Cause Analysis (What Went Wrong)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This issue was not a single problem, but a chain of configuration gaps.&lt;br&gt;
&lt;strong&gt;Root Causes Identified&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;5.1 Ingress Group Conflict&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;TEST ingress was using DEV group name.&lt;/li&gt;
&lt;li&gt;Caused ALB ownership conflict.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xx-xxx-xxx-xxx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl describe ingress app-ingress &lt;span class="nt"&gt;-n&lt;/span&gt; ep-apps
&lt;span class="go"&gt;Name:             app-ingress
Labels:           app=xxx
                  app.kubernetes.io/name=app-ingress
                  app.kubernetes.io/part-of=ep
Namespace:        ep-apps
Address:
Ingress Class:    alb
&lt;/span&gt;&lt;span class="gp"&gt;Default backend:  &amp;lt;default&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="go"&gt;Rules:
  Host        Path  Backends
  ----        ----  --------
  *
              /   app:80 (xx.xx.xx.xxx:xxxx,xxx.xx.xx.xx.xxx:xxxx)
**_Annotations:  alb.ingress.kubernetes.io/group.name: app-dev_**
              alb.ingress.kubernetes.io/group.order: 100
              alb.ingress.kubernetes.io/healthcheck-interval-seconds: 30
              alb.ingress.kubernetes.io/healthcheck-path: /api/health

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;5.2 ACM Certificate Issue&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;1. Certificate attached was in PENDING_VALIDATION&lt;/li&gt;
&lt;li&gt;2. ALB HTTPS listener creation failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k7n1kq0cqn26xwjmdmg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k7n1kq0cqn26xwjmdmg.png" alt=" " width="800" height="195"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;5.3 Subnet Tagging Missing&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Public subnets lacked required tags&lt;/li&gt;
&lt;li&gt;ALB could not discover subnets correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;5.4 Broken ALB Controller Webhook&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;aws-load-balancer-webhook service had no endpoints&lt;/li&gt;
&lt;li&gt;Blocked creation of TargetGroupBinding&lt;/li&gt;
&lt;li&gt;Prevented Pod IP registration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0t8gfoyv1gtl1p72yxso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0t8gfoyv1gtl1p72yxso.png" alt=" " width="800" height="89"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;5.5 Ingress Finalizer Stuck&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Failed reconciliation added finalizer&lt;/li&gt;
&lt;li&gt;Controller unable to clean up state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Solution Applied&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Step-by-Step Resolution&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;6.1 Correct Ingress Group&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xx-xxx-xxx-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl annotate ingress app-ingress &lt;span class="nt"&gt;-n&lt;/span&gt; ep-apps &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;  alb.ingress.kubernetes.io/group.name=app-test \
  --overwrite
ingress.networking.k8s.io/app-ingress annotated

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;6.2 Use ISSUED Valid ACM Certificate&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xx-xxx-xxx-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl annotate ingress app-ingress &lt;span class="nt"&gt;-n&lt;/span&gt; ep-apps &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;  alb.ingress.kubernetes.io/certificate-arn=arn:aws:acm:us-west-2:xxxxxxxxxxx:certificate/xxxxxxxxxxxxxxxxxxxxxxx \
  --overwrite
ingress.networking.k8s.io/web-ingress annotated

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;6.3 Tag Public Subnets (Mandatory)&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;kubernetes.io/role/&lt;/span&gt;&lt;span class="py"&gt;elb&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="err"&gt;kubernetes.io/cluster/&amp;lt;cluster-name&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="err"&gt;shared&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl99wzkpye8p95t4ym53g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl99wzkpye8p95t4ym53g.png" alt=" " width="557" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrgdupglcg3qcpyril1v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrgdupglcg3qcpyril1v.png" alt=" " width="606" height="35"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;6.4 Allow ALB → Node Traffic (Critical for IP Mode)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Inbound rule on worker node security group.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;6.5 Remove Broken ALB Webhook&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Command to check logs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system deployment/aws-load-balancer-controller &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"2026-04-15T04:30:28Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"msg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Reconciler error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"controller"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ingress"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ep-test"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="nl"&gt;"namespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ep-test"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reconcileID"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"7ea1f646-368e-473f-b6b4-cc0a76cf4785"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Internal error occurred: failed calling webhook &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;mtargetgroupbinding.elbv2.k8s.aws&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: failed to call webhook: Post &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: context deadline exceeded"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"2026-04-15T04:30:32Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"msg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Reconciler error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"controller"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ingress"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"search-query-service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"namespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ep-apps"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="nl"&gt;"namespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ep-apps"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"search-query-service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reconcileID"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"970d799a-1982-4e78-9791-76daa6a54d4d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Internal error occurred: failed calling webhook &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;mtargetgroupbinding.elbv2.k8s.aws&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: failed to call webhook: Post &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: context deadline exceeded"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution-&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xxx-xxxx-xxxx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get mutatingwebhookconfigurations
&lt;span class="go"&gt;NAME                                                             WEBHOOKS   AGE
amazon-cloudwatch-observability-mutating-webhook-configuration   5          18h
aws-load-balancer-webhook                                        6          11h
pod-identity-webhook                                             1          6d21h
vpc-resource-mutating-webhook                                    1          6d21h
&lt;/span&gt;&lt;span class="gp"&gt;[ec2-user@ip-xxx-xxxx-xxxx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl delete mutatingwebhookconfiguration aws-load-balancer-webhook
&lt;span class="go"&gt;mutatingwebhookconfiguration.admissionregistration.k8s.io "aws-load-balancer-webhook" deleted
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6.6 Rollout and restart deployment &amp;amp; Recreate Ingress&lt;/p&gt;

&lt;p&gt;Restart the ALB controller&lt;br&gt;
✅ This forces the controller to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re‑build the model&lt;/li&gt;
&lt;li&gt;Create TargetGroupBinding&lt;/li&gt;
&lt;li&gt;Register pod IPs&lt;/li&gt;
&lt;li&gt;Update ingress status
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xx-xxx-xxx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl rollout restart deployment aws-load-balancer-controller &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system
&lt;span class="go"&gt;deployment.apps/aws-load-balancer-controller restarted
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;✅ &lt;strong&gt;7. FINAL VERIFICATION&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-xx-xxx-xx-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get ingress app-ingress &lt;span class="nt"&gt;-n&lt;/span&gt; ep-apps &lt;span class="nt"&gt;-o&lt;/span&gt; wide
&lt;span class="go"&gt;NAME          CLASS   HOSTS   ADDRESS                                                        PORTS   AGE
app-ingress   alb     *       k8s-eptest-erfs423536-xxxxxxxxxx.us-west-2.elb.amazonaws.com   80      10h
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ &lt;strong&gt;8. Validation Commands&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get ingress &lt;span class="nt"&gt;-A&lt;/span&gt;
kubectl get endpoints &lt;span class="nt"&gt;-A&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system deployment/aws-load-balancer-controller
kubectl get targetgroupbinding &lt;span class="nt"&gt;-A&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;9. Final Outcome&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ ALB created successfully&lt;br&gt;
✅ Target Group registered Pod IPs&lt;br&gt;
✅ Health checks passed&lt;br&gt;
✅ Ingress ADDRESS populated&lt;/p&gt;

&lt;p&gt;✅ Application accessible externally over HTTPS&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. Best Practices Checklist (Must Follow Every Time)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ Ingress Configuration Checklist&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environment-specific ingress group (dev/test/prod)&lt;/li&gt;
&lt;li&gt; Valid target-type (ip or instance)&lt;/li&gt;
&lt;li&gt; Correct service name and port&lt;/li&gt;
&lt;li&gt; Health check path works from Pod&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ ACM Certificate Checklist&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Certificate status = ISSUED&lt;/li&gt;
&lt;li&gt; Cert region = same as ALB&lt;/li&gt;
&lt;li&gt; Domain matches DNS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ Subnet Checklist (CRITICAL)&lt;/p&gt;

&lt;p&gt;For internet-facing ALB&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public subnets&lt;/li&gt;
&lt;li&gt; Route to Internet Gateway&lt;/li&gt;
&lt;li&gt; Tags:
  kubernetes.io/role/elb=1
  kubernetes.io/cluster/=shared&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ Security Group Checklist (IP Mode)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ALB SG allows inbound 80/443&lt;/li&gt;
&lt;li&gt; Node SG allows inbound from ALB SG on container port&lt;/li&gt;
&lt;li&gt; No restrictive NACLs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ Controller Health Checklist&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;aws-load-balancer-controller pods Running&lt;/li&gt;
&lt;li&gt; No webhook timeouts in controller logs&lt;/li&gt;
&lt;li&gt; TargetGroupBinding objects created&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;11. Key Learnings&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ALB IP mode requires explicit SG permissions&lt;/li&gt;
&lt;li&gt;Broken webhooks can silently block target registration&lt;/li&gt;
&lt;li&gt;Ingress ADDRESS updates only after full reconciliation&lt;/li&gt;
&lt;li&gt;Always validate subnet tags before troubleshooting ALB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;12. Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ingress with ALB provides a powerful, scalable, and production-ready way to expose applications in EKS.&lt;br&gt;
However, it relies on tight integration between Kubernetes and AWS infrastructure, and misalignment at any layer can lead to hard‑to‑debug issues.&lt;/p&gt;

&lt;p&gt;Following the checklists and best practices above will ensure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster deployments&lt;/li&gt;
&lt;li&gt;Predictable behavior&lt;/li&gt;
&lt;li&gt;Reduced downtime&lt;/li&gt;
&lt;li&gt;Easier troubleshooting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy Learning &amp;amp; Reliable Kubernetes! 🚀&lt;/p&gt;

&lt;p&gt;Follow me on LinkedIn: &lt;a href="http://www.linkedin.com/in/alok-shankar-55b94826" rel="noopener noreferrer"&gt;www.linkedin.com/in/alok-shankar-55b94826&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>eks</category>
      <category>productivity</category>
    </item>
    <item>
      <title>🚨 Elasticsearch High CPU Issue Due to Memory Pressure – Real Production Incident &amp; Fix</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Sat, 04 Apr 2026 09:15:31 +0000</pubDate>
      <link>https://dev.to/alok_shankar/elasticsearch-high-cpu-issue-due-to-memory-pressure-real-production-incident-fix-3c8k</link>
      <guid>https://dev.to/alok_shankar/elasticsearch-high-cpu-issue-due-to-memory-pressure-real-production-incident-fix-3c8k</guid>
      <description>&lt;p&gt;&lt;strong&gt;🔍 Introduction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Running Elasticsearch in production requires deep visibility into CPU, memory, shards, and cluster health.&lt;/p&gt;

&lt;p&gt;One of the most confusing scenarios DevOps engineers face is:&lt;/p&gt;

&lt;p&gt;⚠️ High CPU alerts, but CPU usage looks normal&lt;/p&gt;

&lt;p&gt;In this blog, I’ll walk you through a real production incident where:&lt;/p&gt;

&lt;p&gt;Elasticsearch triggered CPU alerts&lt;br&gt;
But the actual root cause was memory pressure + shard imbalance + node failure&lt;/p&gt;

&lt;p&gt;We’ll cover:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Core Elasticsearch concepts&lt;/li&gt;
&lt;li&gt;Real logs and debugging steps&lt;/li&gt;
&lt;li&gt;Root cause analysis&lt;/li&gt;
&lt;li&gt;Production fix&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;📘 Important Elasticsearch Concepts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before diving into the issue, let’s understand some key building blocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📦 How Elasticsearch Stores Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Elasticsearch stores data as documents, grouped into an index.&lt;/p&gt;

&lt;p&gt;However, when data grows large (billions/trillions of records), a single index cannot be stored efficiently on one node.&lt;/p&gt;

&lt;p&gt;🔹 What is an Index?&lt;/p&gt;

&lt;p&gt;An Index is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A collection of documents&lt;/li&gt;
&lt;li&gt;Logical partition of data&lt;/li&gt;
&lt;li&gt;Similar to a database&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 Example:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;metricbeat-*&lt;/li&gt;
&lt;li&gt;.monitoring-*&lt;/li&gt;
&lt;li&gt;user-data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;🔹 What are Shards?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To scale horizontally, Elasticsearch splits an index into shards.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each shard is a small unit of data&lt;/li&gt;
&lt;li&gt;Stored across multiple nodes&lt;/li&gt;
&lt;li&gt;Acts like a mini-index&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;⚙️ Why Shards Matter&lt;/strong&gt;&lt;br&gt;
✅ Scalability → Data distributed across nodes&lt;br&gt;
✅ Performance → Parallel query execution&lt;br&gt;
✅ Availability → Supports failover&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔁 Primary vs Replica Shards&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Primary Shard → Original data&lt;/li&gt;
&lt;li&gt;Replica Shard → Copy for fault tolerance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;🚨 Cluster Health Status&lt;/strong&gt;&lt;br&gt;
🟢 Green → All shards assigned&lt;br&gt;
🟡 Yellow → Replica shards missing&lt;br&gt;
🔴 Red → Primary shards missing&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧠 JVM &amp;amp; Memory Basics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Elasticsearch runs on JVM:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Heap memory is critical&lt;/li&gt;
&lt;li&gt;High usage → Garbage Collection (GC)&lt;/li&gt;
&lt;li&gt;GC → CPU spikes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;⚠️ Production Issue Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We received alerts for:&lt;/p&gt;

&lt;p&gt;🔴 High CPU usage&lt;br&gt;
⚠️ Cluster health degraded&lt;br&gt;
📉 Slow search performance&lt;/p&gt;

&lt;p&gt;📊 Investigation &amp;amp; Debugging&lt;/p&gt;

&lt;p&gt;🔍 Step 1: Cluster Health Check&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-x ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cluster/health?pretty"&lt;/span&gt;
&lt;span class="go"&gt;{
  "cluster_name" : "web-test",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 247,
  "active_shards" : 343,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 193,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 63.99253731343284
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-x ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cluster/health?filter_path=status,*_shards&amp;amp;pretty"&lt;/span&gt;
&lt;span class="go"&gt;{
  "status" : "yellow",
  "active_primary_shards" : 247,
  "active_shards" : 343,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 193,
  "delayed_unassigned_shards" : 0
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Key Insight:&lt;/p&gt;

&lt;p&gt;193 unassigned shards → Major issue&lt;/p&gt;

&lt;p&gt;🔍 Step 2: Node Resource Usage&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-x ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cat/nodes?v=true&amp;amp;s=cpu:desc&amp;amp;pretty"&lt;/span&gt;
&lt;span class="go"&gt;ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
1x.x.x.2x9           73          97   3    0.19    0.16     0.11 cdfhilmrstw -      node-5
1x.x.x.8x            77          90   2    0.03    0.06     0.03 cdfhilmrstw *      node-1
1x.x.x.x            60          84   1    0.22    0.65     0.72 cdfhilmrstw -      node-3
1x.x.x.x            46          90   1    0.03    0.06     0.01 cdfhilmrstw -      node-4
1x.x.x.x            65          91   0    0.01    0.03     0.00 cdfhilmrstw -      node-2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CPU: 0–5% (low)&lt;/li&gt;
&lt;li&gt;RAM: 88–97% (very high)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;👉 This is critical:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CPU alert was misleading — actual issue was memory pressure&lt;/p&gt;

&lt;p&gt;🔍 Step 3: OS-Level Analysis&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;top
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;top
&lt;span class="go"&gt;top - 10:57:46 up 13 days, 22:42,  1 user,  load average: 0.77, 0.73, 0.60
Tasks: 114 total,   1 running,  64 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  0.1 sy,  0.0 ni, 97.6 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7863696 total,   744000 free,  5938932 used,  1180764 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2202220 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3743 elastic+  20   0   48.0g   4.9g  36368 S   8.7 65.7   7078:50 java
    1 root      20   0  117520   5144   3408 S   0.0  0.1  22:27.92 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.25 kthreadd
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
    7 root      20   0       0      0      0 S   0.0  0.0   0:13.95 ksoftirqd/0
    8 root      20   0       0      0      0 I   0.0  0.0   2:29.56 rcu_sched
    9 root      20   0       0      0      0 I   0.0  0.0   0:00.00 rcu_bh
   10 root      rt   0       0      0      0 S   0.0  0.0   0:02.68 migration/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:01.54 watchdog/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0  0.0   0:01.63 watchdog/1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Findings:&lt;/strong&gt;&lt;br&gt;
Java process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;~4.9 GB memory usage&lt;/li&gt;
&lt;li&gt;~65% system memory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 Elasticsearch consuming most resources&lt;/p&gt;

&lt;p&gt;🔍 Step 4: JVM Memory Pressure&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"_nodes/stats?filter_path=nodes.*.jvm.mem.pools.old"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Observation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;High old-gen memory usage&lt;/li&gt;
&lt;li&gt;Frequent GC cycles&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔍 Step 5: Unassigned Shards Analysis&lt;/p&gt;

&lt;p&gt;Unassigned shards have a state of UNASSIGNED. The prirep value is p for primary shards and r for replicas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cat/shards?v=true&amp;amp;h=index,shard,prirep,state,node,unassigned.reason&amp;amp;s=state&amp;amp;pretty"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cat/shards?v=true&amp;amp;h=index,shard,prirep,state,node,unassigned.reason&amp;amp;s=state&amp;amp;pretty"&lt;/span&gt;
&lt;span class="go"&gt;index                                                       shard  prirep     state          unassigned.reason
product_search_tab_data                                      0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2023.02.08-000024                           0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.0-2022.12.04-000004                           0     r      UNASSIGNED        NODE_LEFT
.monitoring-es-7-mb-2023.04.16                                0     r      UNASSIGNED        REPLICA_ADDED
.monitoring-es-7-mb-2023.04.14                                0     r      UNASSIGNED        REPLICA_ADDED
apm-7.9.2-span-000002                                         0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2021.12.29-000012                           0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_fap_model_item                                       0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2021.11.29-000011                           0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.1-2022.12.07-000008                           0     r      UNASSIGNED        NODE_LEFT
.kibana-event-log-7.9.2-000024                                0     r      UNASSIGNED        NODE_LEFT
.kibana-event-log-7.17.1-000010                               0     r      UNASSIGNED        NODE_LEFT
.monitoring-kibana-7-2023.04.16                               0     r      UNASSIGNED        REPLICA_ADDED
.kibana-event-log-7.9.2-000026                                0     r      UNASSIGNED        INDEX_CREATED
product_fap_price                                            0     r      UNASSIGNED        NODE_LEFT
.ds-.logs-deprecation.elasticsearch-default-2022.12.12-000020 0     r      UNASSIGNED        NODE_LEFT
ilm-history-2-000025                                          0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.1-2022.10.08-000006                           0     r      UNASSIGNED        NODE_LEFT
ilm-history-2-000023                                          0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Finding:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;UNASSIGNED → NODE_LEFT&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 Meaning:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A node left the cluster&lt;/li&gt;
&lt;li&gt;Replica shards not reassigned&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔍 Step 6: UNASSIGNED Shard Analysis&lt;/p&gt;

&lt;p&gt;To understand why an unassigned shard is not being assigned and what action you must take to allow Elasticsearch to assign it, use the cluster allocation explanation API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&amp;amp;pretty"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&amp;amp;pretty"&lt;/span&gt;
&lt;span class="go"&gt;{
  "index" : "product_search_tab_data",
  "node_allocation_decisions" : [
    {
      "node_name" : "node-1",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[product_search_tab_data][0], node[EQ6QyUbhQZCZRqP78rMIIQ], [P], s[STARTED], a[id=7vBWLesZQAS4zYjt_ER2bw]]"
        },
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.42130719712077%]"
        }
      ]
    },
    {
      "node_name" : "node-5",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [9.907598002066106%]"
        }
      ]
    },
    {
      "node_name" : "node-2",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [11.010075893021023%]"
        }
      ]
    },
    {
      "node_name" : "node-3",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.938318653211446%]"
        }
      ]
    },
    {
      "node_name" : "node-4",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [12.273611767876893%]"
        }
      ]
    }
  ]
}
&lt;/span&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-xx ~]$&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="go"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🧠 Root Cause Analysis (RCA)&lt;/p&gt;

&lt;p&gt;After correlating all logs, metrics, and cluster behavior, we identified multiple layered issues contributing to the problem.&lt;/p&gt;

&lt;p&gt;🔴 1. Large Number of Unassigned Shards&lt;br&gt;
193 shards were unassigned&lt;/p&gt;

&lt;p&gt;Majority had reason:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UNASSIGNED → NODE_LEFT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Impact:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Continuous shard allocation attempts&lt;/li&gt;
&lt;li&gt;Increased cluster overhead&lt;/li&gt;
&lt;li&gt;Memory and thread pressure&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔴 2. Node Failure (NODE_LEFT)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;- One or more nodes temporarily left the cluster&lt;/li&gt;
&lt;li&gt;- Replica shards lost their assigned nodes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 Result:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cluster moved to YELLOW state&lt;/li&gt;
&lt;li&gt;Triggered rebalancing operations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔴 3. Disk Watermark Threshold Breach (Critical Finding 🚨)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;During shard allocation analysis, we found:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"node_allocation_decisions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"node_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node-3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"deciders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"decider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disk_threshold"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"explanation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node above low watermark (85%), free: ~7.6%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"node_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node-5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"deciders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"decider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disk_threshold"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"explanation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node above low watermark (85%), free: ~9.6%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"node_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"deciders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"decider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disk_threshold"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"explanation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node above low watermark (85%), free: ~10.7%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Key Insight:&lt;/p&gt;

&lt;p&gt;Elasticsearch refused to allocate shards on nodes&lt;br&gt;
Because disk usage crossed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;cluster.routing.allocation.disk.watermark.low&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;85%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Actual situation:&lt;/p&gt;

&lt;p&gt;Nodes had only ~7%–10% free disk space&lt;br&gt;
Allocation decision = ❌ NO&lt;/p&gt;

&lt;p&gt;⚠️ Why This Is Critical&lt;/p&gt;

&lt;p&gt;When disk watermark is breached:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Elasticsearch blocks shard allocation&lt;/li&gt;
&lt;li&gt;Unassigned shards remain stuck&lt;/li&gt;
&lt;li&gt;Cluster cannot rebalance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 This directly caused:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Persistent unassigned shards&lt;/li&gt;
&lt;li&gt;Memory pressure&lt;/li&gt;
&lt;li&gt;Internal retries → CPU spikes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔴 4. High JVM Memory Pressure&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Heap usage consistently high&lt;/li&gt;
&lt;li&gt;JVM old-gen heavily utilized&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 Result:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Frequent Garbage Collection (GC)&lt;/li&gt;
&lt;li&gt;CPU spikes during GC cycles&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔴 5. Thread Pool Pressure&lt;/p&gt;

&lt;p&gt;Even though CPU looked low:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Threads were blocked due to:&lt;/li&gt;
&lt;li&gt;Allocation retries&lt;/li&gt;
&lt;li&gt;Memory pressure&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 As per Elasticsearch behavior:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Thread pool exhaustion can trigger CPU-related alerts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🧩 Final Root Cause Summary&lt;/p&gt;

&lt;p&gt;The issue was NOT just CPU-related.&lt;/p&gt;

&lt;p&gt;It was a combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Disk space exhaustion (Watermark breach)&lt;/li&gt;
&lt;li&gt;❌ Unassigned shards (allocation blocked)&lt;/li&gt;
&lt;li&gt;❌ Node failure (NODE_LEFT)&lt;/li&gt;
&lt;li&gt;❌ High JVM memory pressure&lt;/li&gt;
&lt;li&gt;❌ Continuous allocation retries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🛠️ Final Fix Implemented&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After complete analysis, we identified that:&lt;/p&gt;

&lt;p&gt;👉 Insufficient disk space was the primary blocker&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔧 Solution Steps&lt;/strong&gt;&lt;br&gt;
✅ 1. Increased Disk Capacity&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Added +50 GB storage to all Elasticsearch nodes&lt;br&gt;
👉 Result:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Disk usage dropped below watermark threshold&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Shard allocation resumed&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;monitoring-kibana-7-2023.04.17                               0     p      STARTED    node-5
catelog-7.9.2-span-000010                                    0     p      STARTED    node-1
catelog-7.9.2-span-000010                                    0     r      STARTED    node-3
product_fragments                                            0     p      STARTED    node-3
packetbeat-7.9.3-2023.04.14-000019                            0     p      STARTED    node-5
metricbeat-7.10.2-2022.04.14-000014                           0     p      STARTED    node-3
.ds-.logs-deprecation.elasticsearch-default-2022.09.19-000014 0     p      STARTED    node-1
.ds-ilm-history-5-2023.04.09-000028                           0     p      STARTED    node-5
catelog-7.9.2-profile-000010                                  0     p      STARTED    node-2
catelog-7.9.2-profile-000010                                  0     r      STARTED    node-3
packetbeat-7.9.3-2022.09.16-000012                            0     p      STARTED    node-2
metricbeat-7.13.3-2021.07.11-000001                           0     p      STARTED    node-2
logstash                                                      0     p      STARTED    node-3
.monitoring-es-7-mb-2023.04.12                                0     p      STARTED    node-4
.catelog-custom-link                                          0     p      STARTED    node-1
.catelog-custom-link                                          0     r      STARTED    node-3
catelog-7.9.2-metric-000015                                   0     p      STARTED    node-1
catelog-7.9.2-metric-000015                                   0     r      STARTED    node-3
catelog-7.9.2-profile-000017                                  0     r      STARTED    node-3
catelog-7.9.2-profile-000017                                  0     p      STARTED    node-5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;✅ 2. Rolling Restart&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Restarted nodes one by one (rolling restart)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 Ensured:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No downtime&lt;/li&gt;
&lt;li&gt;Safe cluster recovery&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;✅ 3. Automatic Shard Reallocation&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Elasticsearch started assigning shards automatically&lt;/li&gt;
&lt;li&gt;Cluster began stabilizing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🎯 Final Result&lt;br&gt;
✅ Unassigned shards → 0&lt;br&gt;
✅ Cluster status → GREEN&lt;br&gt;
✅ Memory pressure reduced&lt;br&gt;
✅ CPU spikes eliminated&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;[ec2-user@ip-x-x-x-xx ~]$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="s2"&gt;"localhost:9200/_cluster/health?pretty"&lt;/span&gt;
&lt;span class="go"&gt;{
  "cluster_name" : "web-test",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 247,
  "active_shards" : 536,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;💡 Key Learning (Very Important 🚀)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔥 Disk space is directly linked to cluster stability in Elasticsearch&lt;/p&gt;

&lt;p&gt;Even if:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CPU looks fine&lt;/li&gt;
&lt;li&gt;Memory seems manageable&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 If disk crosses watermark:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Shards won’t allocate&lt;/li&gt;
&lt;li&gt;Cluster will degrade&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;✍️ Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This incident was a great reminder that Elasticsearch performance issues are rarely straightforward.&lt;/p&gt;

&lt;p&gt;What initially appeared as a high CPU problem turned out to be a cascading failure caused by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Disk watermark threshold breaches&lt;/li&gt;
&lt;li&gt;Unassigned shards&lt;/li&gt;
&lt;li&gt;Node failure (NODE_LEFT)&lt;/li&gt;
&lt;li&gt;JVM memory pressure&lt;/li&gt;
&lt;li&gt;Continuous shard allocation retries&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 The most critical takeaway:&lt;/p&gt;

&lt;p&gt;🔥 Disk space is not just a storage concern in Elasticsearch — it directly impacts shard allocation, memory usage, and overall cluster stability.&lt;/p&gt;

&lt;p&gt;Even when CPU usage looks normal, underlying factors like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Heap pressure&lt;/li&gt;
&lt;li&gt;Disk utilization&lt;/li&gt;
&lt;li&gt;Cluster health
4.can silently degrade the system until it reaches a breaking point.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;🚀 Final Thoughts for DevOps Engineers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In production environments, always think beyond surface-level alerts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Don’t trust CPU metrics alone&lt;/li&gt;
&lt;li&gt;Correlate memory, disk, and cluster state&lt;/li&gt;
&lt;li&gt;Monitor unassigned shards and disk watermarks proactively&lt;/li&gt;
&lt;li&gt;Design clusters with proper shard sizing and capacity planning.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>productivity</category>
      <category>devops</category>
      <category>aws</category>
      <category>elasticsearch</category>
    </item>
    <item>
      <title>🚀 Headlamp: A Modern Kubernetes UI You’ll Actually Enjoy Using</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Mon, 30 Mar 2026 13:54:25 +0000</pubDate>
      <link>https://dev.to/alok_shankar/headlamp-a-modern-kubernetes-ui-youll-actually-enjoy-using-3h3h</link>
      <guid>https://dev.to/alok_shankar/headlamp-a-modern-kubernetes-ui-youll-actually-enjoy-using-3h3h</guid>
      <description>&lt;p&gt;🔹 1. Introduction&lt;/p&gt;

&lt;p&gt;Managing Kubernetes clusters via CLI (kubectl) is powerful—but let’s be honest, it can get overwhelming, especially when dealing with complex workloads, debugging issues, or onboarding new team members.&lt;/p&gt;

&lt;p&gt;This is where Headlamp comes in.&lt;/p&gt;

&lt;p&gt;Headlamp is a user-friendly, extensible Kubernetes UI designed to simplify cluster management while still giving DevOps engineers deep visibility and control.&lt;/p&gt;

&lt;p&gt;👉 Think of it as:&lt;/p&gt;

&lt;p&gt;A developer-friendly Kubernetes dashboard&lt;br&gt;
A modern alternative to traditional tools&lt;br&gt;
A UI that supports plugins and extensibility&lt;/p&gt;

&lt;p&gt;🔹 2. Why Use Headlamp Over Kubernetes Dashboard?&lt;/p&gt;

&lt;p&gt;The official Kubernetes Dashboard, while simple and lightweight, has not kept pace with the needs of modern, production-grade environments. In fact, as of early 2026, the Kubernetes Dashboard has been officially archived and is no longer maintained, with the Kubernetes community and documentation now recommending Headlamp as the preferred UI.&lt;br&gt;
The traditional Kubernetes Dashboard has been around for years, but it comes with limitations.&lt;/p&gt;

&lt;p&gt;❌ Challenges with Kubernetes Dashboard:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Complex authentication setup (token-based access)&lt;/li&gt;
&lt;li&gt;Limited debugging capabilities&lt;/li&gt;
&lt;li&gt;No plugin/extensibility support&lt;/li&gt;
&lt;li&gt;Poor UX for large-scale clusters&lt;/li&gt;
&lt;li&gt;Not actively evolving for modern DevOps needs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;✅ Why Headlamp Wins:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Simple setup and login&lt;/li&gt;
&lt;li&gt;Clean and intuitive UI&lt;/li&gt;
&lt;li&gt;Plugin-based architecture&lt;/li&gt;
&lt;li&gt;Better visibility into workloads&lt;/li&gt;
&lt;li&gt;Built-in terminal (exec into pods)&lt;/li&gt;
&lt;li&gt;Real-time logs and metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 In short: Headlamp is built for modern DevOps workflows.&lt;/p&gt;

&lt;p&gt;🔹 3. Headlamp vs Kubernetes Dashboard (Comparison)&lt;/p&gt;

&lt;p&gt;To clearly illustrate the differences, the following table contrasts Headlamp and the Kubernetes Dashboard across key dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature/Capability&lt;/th&gt;
&lt;th&gt;Headlamp&lt;/th&gt;
&lt;th&gt;Kubernetes Dashboard&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Project Status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Actively maintained, CNCF Sandbox project&lt;/td&gt;
&lt;td&gt;Officially archived, unmaintained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment Modes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Desktop app (Windows, Linux, Mac), in-cluster&lt;/td&gt;
&lt;td&gt;In-cluster web UI only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-Cluster Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, via kubeconfig and context switching&lt;/td&gt;
&lt;td&gt;No, single cluster per instance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RBAC Awareness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full RBAC support, UI adapts to user permissions&lt;/td&gt;
&lt;td&gt;Basic RBAC, less granular&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CRD/Operator Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;First-class, auto-discovers and renders CRDs&lt;/td&gt;
&lt;td&gt;Limited, often breaks with CRDs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extensibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Robust plugin system, easy customization&lt;/td&gt;
&lt;td&gt;Minimal, no plugin architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Relationships&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visualizes ownership and relationships&lt;/td&gt;
&lt;td&gt;Object-centric, limited relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Uses kubeconfig, minimal cluster footprint&lt;/td&gt;
&lt;td&gt;Requires in-cluster service account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI/UX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Modern, clean, responsive&lt;/td&gt;
&lt;td&gt;Basic, dated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logs &amp;amp; Exec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Integrated log viewing, pod exec, download logs&lt;/td&gt;
&lt;td&gt;Basic logs, limited exec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Community &amp;amp; Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Active, open-source, CNCF-backed&lt;/td&gt;
&lt;td&gt;Community archived, no new features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production Readiness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, recommended for enterprise use&lt;/td&gt;
&lt;td&gt;Not recommended for production&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔹 4. Key Benefits of Headlamp&lt;br&gt;
🚀 Developer Productivity&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Visual representation of resources&lt;/li&gt;
&lt;li&gt;Faster troubleshooting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔍 Deep Observability&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Logs, events, and YAML in one place&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔌 Extensibility&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add plugins for custom workflows&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;⚡ Faster Debugging&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Exec into pods directly from UI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🌍 Multi-cluster Management&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Manage multiple clusters seamlessly&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Installation Steps for Windows and Linux
&lt;/h2&gt;

&lt;p&gt;Headlamp offers multiple installation methods, catering to both desktop and in-cluster deployments. Below are detailed, step-by-step instructions for installing Headlamp on Windows and Linux desktops, as well as in-cluster options for team-wide access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Windows Desktop Installation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Option 1: Install via Winget (Recommended for Windows 10/11)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open PowerShell or Command Prompt as Administrator.&lt;/li&gt;
&lt;li&gt;Run the following command:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;   &lt;span class="kd"&gt;winget&lt;/span&gt; &lt;span class="kd"&gt;install&lt;/span&gt; &lt;span class="kd"&gt;headlamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Once installed, launch Headlamp from the Start Menu or by searching for "Headlamp".&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Option 2: Install via Chocolatey&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ensure Chocolatey is installed. If not, follow the instructions at &lt;a href="https://chocolatey.org/install" rel="noopener noreferrer"&gt;https://chocolatey.org/install&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Open PowerShell as Administrator.&lt;/li&gt;
&lt;li&gt;Run:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;   &lt;span class="kd"&gt;choco&lt;/span&gt; &lt;span class="kd"&gt;install&lt;/span&gt; &lt;span class="kd"&gt;headlamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Launch Headlamp from the Start Menu.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Option 3: Download the Installer from GitHub Releases&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Visit the &lt;a href="https://github.com/kubernetes-sigs/headlamp/releases" rel="noopener noreferrer"&gt;Headlamp GitHub Releases page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Download the latest &lt;code&gt;.exe&lt;/code&gt; installer.&lt;/li&gt;
&lt;li&gt;Double-click the installer and follow the prompts.&lt;/li&gt;
&lt;li&gt;Launch Headlamp from the Start Menu.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Upgrading:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If installed via Winget or Chocolatey, use &lt;code&gt;winget upgrade headlamp&lt;/code&gt; or &lt;code&gt;choco upgrade headlamp&lt;/code&gt; to update.&lt;/li&gt;
&lt;li&gt;If installed via the GitHub installer, download and run the new version manually.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;First Launch:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
On first launch, Headlamp will prompt you to select a kubeconfig file or will automatically load it from &lt;code&gt;~/.kube/config&lt;/code&gt;. Select your desired cluster context to begin managing your Kubernetes environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  In-Cluster Installation (Helm and YAML)
&lt;/h3&gt;

&lt;p&gt;For team-wide, browser-based access, Headlamp can be deployed inside your Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 1: Install via Helm&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add the Headlamp Helm repository:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/
   helm repo update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Install Headlamp in the desired namespace (e.g., &lt;code&gt;headlamp&lt;/code&gt;):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   helm &lt;span class="nb"&gt;install &lt;/span&gt;headlamp headlamp/headlamp &lt;span class="nt"&gt;--namespace&lt;/span&gt; headlamp &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Forward the service port to your local machine:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   kubectl port-forward svc/headlamp 4466:80 &lt;span class="nt"&gt;-n&lt;/span&gt; headlamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Access Headlamp at &lt;a href="http://localhost:4466" rel="noopener noreferrer"&gt;http://localhost:4466&lt;/a&gt; in your browser.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Option 2: Install via YAML Manifest&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Apply the official deployment YAML:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/kubernetes-sigs/headlamp/main/deployment/headlamp.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Forward the service port as above.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Authentication:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp uses Kubernetes RBAC for authentication. For secure access, create a ServiceAccount and ClusterRoleBinding as needed, and use the generated token to log in.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Verify Pods and Check Workloads Using Headlamp
&lt;/h2&gt;

&lt;p&gt;One of Headlamp’s core strengths is its ability to provide clear, actionable insights into the state of your workloads. Here’s how to verify pods and check workloads step by step:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Navigating to Workloads
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Open Headlamp&lt;/strong&gt; and select your cluster context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrdk3px7s33vryocl05k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrdk3px7s33vryocl05k.png" alt=" " width="800" height="318"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the sidebar, click on &lt;strong&gt;“Workloads”&lt;/strong&gt;. This section aggregates all workload types: Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, and Pods.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4y67zm56q0s25gh7syr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4y67zm56q0s25gh7syr.png" alt=" " width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Viewing Pods
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Click on &lt;strong&gt;“Pods”&lt;/strong&gt; under the Workloads section.&lt;/li&gt;
&lt;li&gt;You’ll see a table listing all pods, with columns for Name, Namespace, Status, Age, Node, and more.&lt;/li&gt;
&lt;li&gt;Status indicators (color-coded) provide at-a-glance health information: green for Running, yellow for Pending, red for Failed, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Inspecting Pod Details
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Click on a pod name to open its detail view.&lt;/li&gt;
&lt;li&gt;The detail page shows:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pod status&lt;/strong&gt; (phase, conditions, restarts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container statuses&lt;/strong&gt; (ready, waiting, terminated, reason)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Events&lt;/strong&gt; (recent events affecting the pod)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource usage&lt;/strong&gt; (CPU, memory, if metrics are available)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YAML view&lt;/strong&gt; for advanced inspection or editing (if permitted)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjtyrta1mor4vnuvk92u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjtyrta1mor4vnuvk92u.png" alt=" " width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Workload Overview Dashboard
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Workload Overview&lt;/strong&gt; provides charts and summaries of all workload types, including ready vs. total replicas, health status, and recent changes.&lt;/li&gt;
&lt;li&gt;Use filters and search to narrow down by namespace, label, or status.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Practical Example
&lt;/h3&gt;

&lt;p&gt;Suppose you want to verify that all pods in the &lt;code&gt;production&lt;/code&gt; namespace are healthy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select the &lt;code&gt;production&lt;/code&gt; namespace from the namespace dropdown.&lt;/li&gt;
&lt;li&gt;Go to Workloads → Pods.&lt;/li&gt;
&lt;li&gt;Check the Status column for any pods not in the Running state.&lt;/li&gt;
&lt;li&gt;Click on any problematic pod to view its events and logs for troubleshooting.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp’s pod and workload management interface leverages standardized patterns for filtering, sorting, and metrics visualization, making it easy to spot issues and drill down for details. The consistent UI across resource types ensures a smooth learning curve and efficient operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Analyze and Download Logs in Headlamp
&lt;/h2&gt;

&lt;p&gt;Accessing and analyzing logs is critical for troubleshooting and monitoring Kubernetes workloads. Headlamp streamlines this process with integrated log viewing and download capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Accessing Pod Logs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Workloads → Pods&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click on the desired pod to open its detail view.&lt;/li&gt;
&lt;li&gt;In the pod detail page, locate the &lt;strong&gt;“Logs”&lt;/strong&gt; tab or section.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Viewing Logs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Select the container (if the pod has multiple containers) from the dropdown.&lt;/li&gt;
&lt;li&gt;Logs are streamed in real-time, with options to pause, scroll, or search within the log output.&lt;/li&gt;
&lt;li&gt;You can filter logs by time range or keywords for targeted analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3u1qvf9qph4f0o06alja.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3u1qvf9qph4f0o06alja.png" alt=" " width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev9ajqtvopq8vg57y6ix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev9ajqtvopq8vg57y6ix.png" alt=" " width="800" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Downloading Logs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Click the &lt;strong&gt;“Download”&lt;/strong&gt; button (usually represented by a download icon) to save the current log output as a file.&lt;/li&gt;
&lt;li&gt;Choose the desired format (plain text or JSON, depending on implementation).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82aikn4ct9o2o84yqbvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82aikn4ct9o2o84yqbvj.png" alt=" " width="800" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Practical Example
&lt;/h3&gt;

&lt;p&gt;If a deployment is failing, you can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify the failing pod in the Workloads → Pods view.&lt;/li&gt;
&lt;li&gt;Open its detail page and switch to the Logs tab.&lt;/li&gt;
&lt;li&gt;Review recent log entries for errors or stack traces.&lt;/li&gt;
&lt;li&gt;Download the logs for offline analysis or sharing with your team.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp’s log viewer eliminates the need for running &lt;code&gt;kubectl logs&lt;/code&gt; commands or SSHing into nodes. The ability to stream, filter, and download logs directly from the UI accelerates troubleshooting and supports collaborative debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Execute Commands in Running Pods (Runtime Pod Exec)
&lt;/h2&gt;

&lt;p&gt;Executing commands inside running containers is often necessary for debugging, inspecting file systems, or running diagnostics. Headlamp provides a secure, RBAC-aware interface for runtime pod exec.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Accessing Pod Exec
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Workloads → Pods&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click on the target pod to open its detail view.&lt;/li&gt;
&lt;li&gt;Look for the &lt;strong&gt;“Exec”&lt;/strong&gt; or &lt;strong&gt;“Terminal”&lt;/strong&gt; button (often represented by a terminal icon).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Opening a Terminal
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Click the Exec/Terminal button.&lt;/li&gt;
&lt;li&gt;A web-based terminal session opens, connected to the selected container.&lt;/li&gt;
&lt;li&gt;You can run shell commands (e.g., &lt;code&gt;ls&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;env&lt;/code&gt;, &lt;code&gt;top&lt;/code&gt;) as if you were inside the container.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv5933u2vcet0p9rolg8y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv5933u2vcet0p9rolg8y.png" alt=" " width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhhs6rozj9lwatx359sj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhhs6rozj9lwatx359sj.png" alt=" " width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Security and Permissions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The Exec feature is only available if your RBAC permissions allow it.&lt;/li&gt;
&lt;li&gt;If you lack the necessary permissions, the button will be hidden or disabled.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Practical Example
&lt;/h3&gt;

&lt;p&gt;To debug a misbehaving application:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open the pod’s terminal via Exec.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;ps aux&lt;/code&gt; to inspect running processes.&lt;/li&gt;
&lt;li&gt;Check configuration files or logs within the container.&lt;/li&gt;
&lt;li&gt;Exit the session when done.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp’s runtime exec feature brings the power of &lt;code&gt;kubectl exec&lt;/code&gt; to the browser, with RBAC enforcement and auditability. This reduces the need for direct node access and supports secure, efficient debugging workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debugging with Headlamp
&lt;/h2&gt;

&lt;p&gt;Effective debugging in Kubernetes requires visibility into events, resource relationships, and traces. Headlamp offers a suite of tools to support comprehensive debugging:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Events and Relationships
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Events:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Each resource detail page includes a list of recent Kubernetes events affecting that resource (e.g., pod scheduled, container crash, image pull errors). Events are color-coded by severity and timestamped for easy correlation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Resource Relationships:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp visualizes ownership and dependency chains, such as which ReplicaSet owns a Pod, or which Service routes to which Pods. This helps trace issues across controllers and workloads.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Traces and Metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Traceloop Plugin:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For advanced debugging, the Inspektor Gadget Traceloop plugin can be integrated, providing syscall traces for pods. This acts as a “flight data recorder,” capturing system calls before and after crashes for post-mortem analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Metrics Integration:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp can display resource usage metrics (CPU, memory) for nodes and pods, aiding in performance troubleshooting.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Practical Debugging Workflow
&lt;/h3&gt;

&lt;p&gt;Suppose a deployment is experiencing intermittent pod restarts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open the deployment’s detail page to view related ReplicaSets and Pods.&lt;/li&gt;
&lt;li&gt;Inspect events for crash loops or scheduling errors.&lt;/li&gt;
&lt;li&gt;View pod logs for error messages.&lt;/li&gt;
&lt;li&gt;Use the Exec feature to inspect the container’s environment.&lt;/li&gt;
&lt;li&gt;If available, use the Traceloop plugin to analyze syscalls leading up to the crash.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
By consolidating events, logs, metrics, and resource relationships in one UI, Headlamp enables rapid root cause analysis and reduces mean time to resolution (MTTR) for production incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Precautions While Using Headlamp
&lt;/h2&gt;

&lt;p&gt;While Headlamp is designed with security and usability in mind, there are important precautions and best practices to follow:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. RBAC and Access Control
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Principle of Least Privilege:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Always assign the minimum necessary permissions to users and service accounts accessing Headlamp. Use Kubernetes RBAC to restrict actions by role, namespace, or resource type.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ServiceAccount Tokens:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For in-cluster deployments, generate dedicated ServiceAccounts and ClusterRoleBindings for Headlamp access. Avoid using cluster-admin tokens for regular users.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;UI Controls Reflect Permissions:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp hides or disables UI controls (edit, delete, exec) if the user lacks the corresponding RBAC permissions, reducing the risk of unauthorized actions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Network Exposure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Restrict External Access:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Do not expose Headlamp (or any Kubernetes dashboard) directly to the public internet. Use VPNs, IP allowlists, or network policies to restrict access to trusted networks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TLS and Ingress:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When exposing Headlamp via Ingress, always enable TLS/HTTPS and use trusted certificates. Consider integrating with identity providers for Single Sign-On (SSO).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Audit Logging
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kubernetes API Auditing:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
All actions performed via Headlamp are executed through the Kubernetes API and are subject to API server audit logging. Ensure audit logs are enabled and retained according to compliance requirements.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No Native UI Audit Log:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp itself does not maintain a separate audit log of UI actions. Rely on Kubernetes API audit logs for forensic analysis.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Plugin Security
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Review Plugins Carefully:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Only install plugins from trusted sources. Review plugin code and permissions, as plugins can extend or modify UI behavior.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sandboxing:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Desktop deployments are isolated to the local machine, reducing risk. In-cluster deployments should be monitored for plugin activity.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Upgrades and Maintenance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep Headlamp Updated:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Regularly update Headlamp to the latest version to receive security patches and new features. Use package managers (Winget, Chocolatey, Flatpak) or Helm for managed upgrades.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor for Vulnerabilities:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Subscribe to Headlamp’s GitHub repository or community channels for security advisories and updates.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
By adhering to these precautions, organizations can safely leverage Headlamp’s capabilities while minimizing security risks and maintaining compliance with best practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Plugins and Extensibility
&lt;/h2&gt;

&lt;p&gt;One of Headlamp’s defining features is its extensible plugin system, which empowers users and organizations to tailor the UI to their unique workflows and requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. What Can Plugins Do?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom Dashboards:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Build specialized pages with visualizations, metrics, or business logic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Resource Extensions:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Add custom sections, actions, or views to existing Kubernetes resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;External Integrations:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Connect Headlamp to monitoring tools (Prometheus, Grafana), CI/CD systems, or cost management platforms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Branding and Theming:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Apply custom themes, logos, or UI components to match organizational branding.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automation and Workflows:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Implement organization-specific automation, such as bulk actions or approval workflows.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Official and Community Plugins
&lt;/h3&gt;

&lt;p&gt;Headlamp maintains a repository of official plugins, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prometheus:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Adds Prometheus-powered charts to workload detail views.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;cert-manager:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
UI for managing cert-manager resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flux, Karpenter, KEDA, Knative, Minikube, Opencost:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Integrations for popular Kubernetes operators and tools.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI Assistant:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Integrates AI capabilities directly into Headlamp.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plugin Catalog:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Enables one-click installation of plugins from within the desktop app.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Community plugins are available for tools like Trivy (vulnerability scanning), Kyverno (policy management), Kubescape (security scanning), and more.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Developing Plugins
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Framework:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Plugins are developed using TypeScript/React and the Headlamp plugin API.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Development Workflow:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Use the &lt;code&gt;@kinvolk/headlamp-plugin&lt;/code&gt; CLI for scaffolding, building, and packaging plugins.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distribution:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Plugins can be distributed via Artifact Hub or internal repositories.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Documentation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Comprehensive guides and examples are available in the Headlamp documentation and plugins repository.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Practical Example: Adding a Prometheus Chart
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Install the Prometheus plugin from the Plugin Catalog.&lt;/li&gt;
&lt;li&gt;Configure Prometheus in your cluster.&lt;/li&gt;
&lt;li&gt;View real-time metrics charts in workload detail pages.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The plugin system transforms Headlamp from a static dashboard into a customizable platform, enabling organizations to innovate and adapt as their Kubernetes environments evolve.&lt;/p&gt;




&lt;h2&gt;
  
  
  RBAC, CRD Handling, and Multi-Cluster Support
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. RBAC (Role-Based Access Control)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fine-Grained Permissions:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp enforces Kubernetes RBAC policies, ensuring users only see and perform actions they are authorized for.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;UI Adaptation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The UI dynamically adapts to the user’s permissions, hiding or disabling controls as appropriate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Namespace and Cluster Scope:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Permissions can be restricted at the namespace or cluster level, supporting multi-tenant and secure environments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. CRD (Custom Resource Definition) Handling
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auto-Discovery:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp automatically detects and renders CRDs, displaying their custom schemas and status fields.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Operator Support:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Works seamlessly with operator-driven clusters, supporting tools like Argo CD, Prometheus Operator, and custom controllers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plugin Extensions:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Plugins can extend the UI for specialized CRDs, providing tailored dashboards or management interfaces.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Multi-Cluster Support
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unified Management:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Manage multiple clusters from a single Headlamp instance, switching contexts via the cluster switcher.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kubeconfig Integration:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Desktop app uses your kubeconfig file, supporting any number of clusters and contexts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In-Cluster Federation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
In-cluster deployments can be configured to access multiple clusters if API access is permitted.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
These capabilities make Headlamp suitable for complex, enterprise-grade environments, supporting secure, scalable, and operator-driven Kubernetes operations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Troubleshooting Common Issues and Verification After Install
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Installation Verification
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Desktop App:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
On launch, ensure Headlamp loads your kubeconfig and displays available clusters. If clusters are missing, check the kubeconfig path and permissions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In-Cluster Deployment:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
After deploying Headlamp, verify the deployment and service status:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  kubectl get deploy &lt;span class="nt"&gt;-n&lt;/span&gt; headlamp
  kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; headlamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ensure pods are running and the service is accessible via port-forward, NodePort, or Ingress.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Authentication Issues
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Access Denied:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If you see “Access Denied” errors, verify your ServiceAccount token and RBAC permissions. Ensure the token is valid and has the necessary roles.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Missing Resources:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If resources are missing from the UI, check RBAC policies and namespace filters.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Log and Exec Failures
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Logs Not Loading:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Ensure the Headlamp service has network access to the Kubernetes API and that your RBAC permissions include &lt;code&gt;get&lt;/code&gt; and &lt;code&gt;list&lt;/code&gt; on pods/logs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exec Not Working:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Confirm that your role includes the &lt;code&gt;pods/exec&lt;/code&gt; verb. Some environments restrict exec for security reasons.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Plugin Issues
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plugin Not Loading:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Check plugin compatibility with your Headlamp version. Review plugin logs for errors.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;UI Glitches:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Clear browser cache or restart the desktop app. Ensure all dependencies are up to date.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Port Forwarding Problems
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stuck Pending:&lt;/strong&gt;
If port forwarding is stuck, verify that your RBAC permissions include &lt;code&gt;pods/portforward&lt;/code&gt; and &lt;code&gt;services/portforward&lt;/code&gt;. Check for port conflicts or network restrictions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Metrics Not Displayed
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus Integration:&lt;/strong&gt;
Ensure Prometheus is installed and accessible in your cluster. Configure the Prometheus plugin as needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Most issues stem from RBAC misconfigurations, network restrictions, or plugin compatibility. Careful review of logs, permissions, and documentation resolves the majority of problems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integration with Monitoring and Observability Tools
&lt;/h2&gt;

&lt;p&gt;Headlamp can be integrated with popular monitoring and observability stacks to provide comprehensive visibility into cluster health and performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Prometheus
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plugin Integration:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The Prometheus plugin adds charts and metrics to workload detail views, displaying CPU, memory, and custom metrics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Deploy Prometheus in your cluster and configure the plugin to point to the Prometheus endpoint.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Grafana
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;External Dashboards:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
While Headlamp does not natively embed Grafana dashboards, it can link to external Grafana instances for advanced visualization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability Stack:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Combine Headlamp with Prometheus and Grafana for a full-featured observability solution, leveraging exporters, scrapers, and dashboards.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Third-Party Integrations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plugins:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Community plugins are available for tools like Opencost (cost monitoring), Trivy (vulnerability scanning), and Kubescape (security compliance).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom Plugins:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Develop custom plugins to integrate with proprietary monitoring or alerting systems.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp’s extensibility ensures it can fit into any observability stack, providing both built-in and customizable monitoring capabilities.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations, Caveats, and Best Practices for Production Use
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Not a Full Replacement for kubectl:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
While Headlamp covers most day-to-day operations, advanced automation, scripting, and bulk operations are still best handled via the CLI.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No Built-In Audit Log:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp relies on Kubernetes API audit logs for action tracking. There is no separate UI audit log.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Metrics Visualization:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp provides basic metrics visualization but does not match the depth of dedicated tools like Grafana.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plugin Development Overhead:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Specialized CRDs or workflows may require custom plugin development, which involves TypeScript/React expertise.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RBAC Complexity:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Fine-tuning RBAC for large teams can be complex; misconfigurations may lead to incomplete UI or access issues.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Best Practices
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use RBAC to Enforce Least Privilege:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Regularly review and update roles and bindings to minimize risk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Restrict Network Exposure:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Never expose Headlamp directly to the internet without strong authentication and network controls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor and Audit Usage:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Enable and retain Kubernetes API audit logs for compliance and incident response.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep Headlamp and Plugins Updated:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Regularly update to the latest versions to benefit from security patches and new features.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test Plugins in Staging:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Validate new plugins in a non-production environment before rolling out to production clusters.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
By understanding these limitations and following best practices, organizations can maximize Headlamp’s benefits while mitigating risks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Community, Contribution, and Official Resources
&lt;/h2&gt;

&lt;p&gt;Headlamp is a vibrant, community-driven project with active development and support channels.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Community Involvement
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Open Source:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
100% open source under Apache 2.0 License.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Contribution:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Contributions are welcome via GitHub pull requests, plugin development, documentation, and issue reporting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Community Channels:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;#headlamp channel in Kubernetes Slack&lt;/li&gt;
&lt;li&gt;Monthly community meetings&lt;/li&gt;
&lt;li&gt;GitHub Discussions and Issues&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Documentation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Comprehensive user and developer documentation at &lt;a href="https://headlamp.dev/docs/latest/" rel="noopener noreferrer"&gt;headlamp.dev/docs&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Source code, releases, and contribution guidelines at &lt;a href="https://github.com/kubernetes-sigs/headlamp" rel="noopener noreferrer"&gt;github.com/kubernetes-sigs/headlamp&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plugins Repository:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Official and community plugins at &lt;a href="https://github.com/headlamp-k8s/plugins" rel="noopener noreferrer"&gt;github.com/headlamp-k8s/plugins&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Artifact Hub:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Discover and install plugins via &lt;a href="https://artifacthub.io/packages/search?kind=0&amp;amp;org=headlamp-k8s" rel="noopener noreferrer"&gt;Artifact Hub&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Active community engagement ensures Headlamp remains relevant, secure, and feature-rich, with rapid response to issues and evolving requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use Headlamp vs kubectl and CLI Automation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Use Headlamp When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You need a visual overview of cluster health and resource relationships.&lt;/li&gt;
&lt;li&gt;Managing or troubleshooting workloads, pods, and services interactively.&lt;/li&gt;
&lt;li&gt;Onboarding new team members or collaborating with less technical users.&lt;/li&gt;
&lt;li&gt;Viewing and managing CRDs and operator-driven resources.&lt;/li&gt;
&lt;li&gt;Integrating with monitoring, cost, or security tools via plugins.&lt;/li&gt;
&lt;li&gt;Managing multiple clusters from a single interface.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Use kubectl and CLI Automation When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Performing advanced scripting, automation, or CI/CD integration.&lt;/li&gt;
&lt;li&gt;Executing bulk operations or custom workflows.&lt;/li&gt;
&lt;li&gt;Managing infrastructure as code (GitOps).&lt;/li&gt;
&lt;li&gt;Debugging at the API or YAML level.&lt;/li&gt;
&lt;li&gt;Handling edge cases or experimental features not yet supported in the UI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analysis:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Headlamp and kubectl are complementary tools. Headlamp excels at visualization, day-to-day management, and collaboration, while kubectl remains indispensable for automation, scripting, and advanced operations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Headlamp represents the next generation of Kubernetes UIs: user-friendly, extensible, and aligned with the realities of modern, production-grade clusters. By combining a clean, intuitive interface with powerful features like multi-cluster management, robust RBAC support, CRD visibility, and a thriving plugin ecosystem, Headlamp empowers teams to manage Kubernetes with confidence and efficiency.&lt;/p&gt;

&lt;p&gt;Whether you are a DevOps engineer troubleshooting a production incident, a developer deploying your first application, or a platform team managing dozens of clusters, Headlamp provides the insight, control, and flexibility you need. Its open-source foundation, active community, and commitment to extensibility ensure that it will continue to evolve alongside Kubernetes itself.&lt;/p&gt;

&lt;p&gt;👉 If you’re a DevOps Engineer, SRE, or Cloud Architect, Headlamp can significantly improve your workflow.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Resolving Kubernetes Production Pod Failure Due to EFS Mount &amp; Memory Exhaustion</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Wed, 17 Dec 2025 17:37:18 +0000</pubDate>
      <link>https://dev.to/alok_shankar/resolving-kubernetes-production-pod-failure-due-to-efs-mount-memory-exhaustion-4ma</link>
      <guid>https://dev.to/alok_shankar/resolving-kubernetes-production-pod-failure-due-to-efs-mount-memory-exhaustion-4ma</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In production Kubernetes environments, storage-related issues combined with resource exhaustion can lead to cascading pod failures. In this blog, a critical JMSQ and backend pod reached 100% memory utilization, and subsequent pod recreation attempts failed. An associated efs-start pod was also impacted, resulting in Persistent Volume mount failures. This blog walks through the real-world troubleshooting approach, commands used, root cause analysis, and the final fix applied in an Amazon EKS cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem Statement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symptoms Observed&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Application pod stuck in Pending / ContainerCreating state&lt;/li&gt;
&lt;li&gt;Continuous FailedMount errors&lt;/li&gt;
&lt;li&gt;Dependent pods unable to start&lt;/li&gt;
&lt;li&gt;Heap dump PVC backed by AWS EFS failing to mount&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Error Message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Warning FailedMount kubelet Unable to attach or mount volumes:
unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]:
timed out waiting for the condition
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This clearly indicated a Persistent Volume mount issue, most likely related to EFS connectivity from worker nodes.&lt;/p&gt;

&lt;p&gt;2️⃣ Failure Flow (What Went Wrong)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;JMSQ Pod Memory Spike
│
▼
Pod Restart Triggered
│
▼
Scheduler selects OLD Node
│
▼
Node has STALE EFS mount
│
▼
EFS Volume Mount Timeout
│
▼
Pod stuck in ContainerCreating
│
▼
FailedMount errors flood Events
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-Step Troubleshooting &amp;amp; Fix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Identify the Impacted Pod&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe pod jmsq-deployment-34253e -n prdq

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  110s (x30 over 67m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]: timed out waiting for the condition

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this step?&lt;/strong&gt;&lt;br&gt;
kubectl describe provides detailed pod-level events, container states, volume mounts, and scheduling information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Findings:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pod stuck in ContainerCreating&lt;/li&gt;
&lt;li&gt;Volume heapdump-volume failed to mount&lt;/li&gt;
&lt;li&gt;Storage backend: PersistentVolumeClaim (PVC)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Verify efs-start Pod &amp;amp; Storage Health&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it efs-start-23sed-34e -n prdefs -- /bin/sh

#Inside the pod:
df -h

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Filesystem                Size      Used Available Use% Mounted on
overlay                 128.0G      5.2G    122.8G   4% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     7.6G         0      7.6G   0% /sys/fs/cgroup
fs-xxxxx.efs.us-west-2.amazonaws.com:/
                          8.0E      5.3G      8.0E   0% /persistentvolumes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this step?&lt;/strong&gt;&lt;br&gt;
To ensure that EFS itself was reachable and mounted correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;EFS filesystem was mounted&lt;/li&gt;
&lt;li&gt;No disk space exhaustion&lt;/li&gt;
&lt;li&gt;Indicates node-level EFS connectivity issue, not EFS service failure&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Inspect Persistent Volumes (PV)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                              STORAGECLASS    
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   101Gi      RWX            Delete           Bound    prd/jmsq-volume                      aws-efs      
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/solr-index-volume                slow-local   
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/heapdump-volume                  aws-efs 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Purpose:&lt;/strong&gt;&lt;br&gt;
Validate whether the PV associated with the heapdump PVC is in a healthy Bound state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;heapdump-volume PV was Bound&lt;/li&gt;
&lt;li&gt;Backed by aws-efs storage class&lt;/li&gt;
&lt;li&gt;This ruled out PVC misconfiguration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Verify Storage Classes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get storageclass
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE
aws-efs         example.com/aws-efs     Delete          Immediate        
fast-local      kubernetes.io/aws-ebs   Delete          Immediate                
slow-local      kubernetes.io/aws-ebs   Delete          Immediate        
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;br&gt;
Confirms dynamic provisioning behavior and backend type.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Storage Class:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;aws-efs → RWX, dynamic provisioning, Immediate binding&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Check Persistent Volume Claims (PVC)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pvc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME           STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   
Cloudnexus     Bound    pvc-XXXXXXXXXX   256Gi      RWO            slow-local     
Cloudjenkin    Bound    pvc-xxxxxxxxxx   64Gi       RWO            slow-local
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 6: Inspect Worker Nodes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;All nodes appeared Ready&lt;/li&gt;
&lt;li&gt;However, older nodes likely had stale or broken EFS mount connections&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step 7: Cordon Existing Nodes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do \
kubectl cordon $X; \
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ec2-user@ip-xxxxxxxx ~]$ for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do kubectl cordon $X;done
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why cordon nodes?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prevents new workloads from scheduling on problematic nodes&lt;/li&gt;
&lt;li&gt;Allows isolation of faulty infrastructure&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Verify post cordon nodes:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 8: Scale Auto Scaling Group to Add New Nodes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;New EKS worker node was automatically provisioned via ASG&lt;/li&gt;
&lt;li&gt;New node established fresh EFS mount connections&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   &amp;lt;none&amp;gt;   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      &amp;lt;none&amp;gt;   50s   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      &amp;lt;none&amp;gt;   50s   v1.22.14-eks-fb459a0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;New node joined cluster in Ready state&lt;/li&gt;
&lt;li&gt;Healthy EFS connectivity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step 9: Restart Impacted Deployments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl rollout restart deployment &amp;lt;deployment-name&amp;gt; -n &amp;lt;namespace&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pods scheduled only on new healthy nodes&lt;/li&gt;
&lt;li&gt;EFS volumes mounted successfully&lt;/li&gt;
&lt;li&gt;Application recovered without data loss&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Root Cause Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Long-running worker nodes developed stale EFS mount connections&lt;/li&gt;
&lt;li&gt;JMSQ pod memory exhaustion triggered repeated restarts&lt;/li&gt;
&lt;li&gt;Kubernetes attempted to reuse unhealthy nodes&lt;/li&gt;
&lt;li&gt;EFS mount timeout prevented PVC attachment&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Final Fix Applied&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✔ Cordoned impacted nodes&lt;br&gt;
✔ Added fresh worker nodes via Auto Scaling Group&lt;br&gt;
✔ Restarted deployments to force rescheduling&lt;br&gt;
✔ Restored healthy EFS mounts and pod stability&lt;/p&gt;

&lt;p&gt;3️⃣ Recovery Flow (Fix Applied)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Detect FailedMount Errors
│
▼
Cordon Old Worker Nodes
│
▼
Auto Scaling Group adds NEW Node
│
▼
New Node establishes fresh EFS mount
│
▼
Rollout Restart Deployment
│
▼
Pods scheduled on healthy node
│
▼
Application fully recovered
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Learnings &amp;amp; Best Practices&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Always check Events section in kubectl describe pod&lt;/li&gt;
&lt;li&gt;PVC Bound ≠ storage is usable (node-level issues matter)&lt;/li&gt;
&lt;li&gt;Periodically rotate EKS worker nodes&lt;/li&gt;
&lt;li&gt;Monitor memory usage to avoid JVM heap exhaustion&lt;/li&gt;
&lt;li&gt;Use rollout restart instead of deleting pods manually&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This case study demonstrates how storage + node health issues can silently break Kubernetes workloads even when cluster objects appear healthy. A structured troubleshooting approach — starting from pod events, moving to storage, and finally node isolation — helped resolve the production outage efficiently with minimal risk.&lt;/p&gt;

&lt;p&gt;If you are running stateful workloads on EKS with EFS, proactive node lifecycle management and monitoring are critical to avoid similar failures.&lt;/p&gt;

&lt;p&gt;Happy Learning &amp;amp; Reliable Kubernetes! 🚀&lt;/p&gt;

&lt;p&gt;Follow me on LinkedIn: &lt;a href="http://www.linkedin.com/in/alok-shankar-55b94826" rel="noopener noreferrer"&gt;www.linkedin.com/in/alok-shankar-55b94826&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>aws</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Investigating &amp; Resolving High-CPU Alerts in Kubernetes Pods</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Mon, 15 Dec 2025 15:20:58 +0000</pubDate>
      <link>https://dev.to/alok_shankar/investigating-resolving-high-cpu-alerts-in-kubernetes-pods-m0a</link>
      <guid>https://dev.to/alok_shankar/investigating-resolving-high-cpu-alerts-in-kubernetes-pods-m0a</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Recently, I faced a production issue where Observability tools flagged sustained CPU utilization &amp;gt;95% on the particular pod in Kubernetes. Investigation revealed the Java process was hitting the pod’s 3-core CPU limit, even though the node had spare capacity—pointing to application-level saturation. &lt;/p&gt;

&lt;p&gt;Using kubectl and in-container diagnostics, I confirmed the JVM as the source. &lt;/p&gt;

&lt;p&gt;In this post, I’ll walk through the step-by-step process: how I diagnosed it, the safe remediation &lt;strong&gt;(increasing pod CPU limits and optionally scaling replicas)&lt;/strong&gt;, and the follow-up JVM and query checks to prevent recurrence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goals&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;- Confirm the alert (pod and node metrics).&lt;/li&gt;
&lt;li&gt;- Determine whether node or pod (application) caused high CPU.&lt;/li&gt;
&lt;li&gt;- Identify what inside the pod is CPU-hot (process / JVM threads / GC / queries).&lt;/li&gt;
&lt;li&gt;- Apply safe remediation and verify.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Confirm pod metrics (kubectl top)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl top pod webapp-deployment-rfc4f -n stgapp

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                          CPU(cores)   MEMORY(bytes)
webapp-deployment-rfc4f                         2863m        2662Mi

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Values:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CPU = 2863m  (≈ 2.86 cores)  → round ≈ 2.9 cores&lt;br&gt;
Memory = 2662Mi (≈ 2.6 GB)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Check the pod's resource limits (deployment spec)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get deployment webapp-deployment-rfc4f -n stgapp-o yaml | grep -A5 resources

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resources:
  limits:
    cpu: "3"
    memory: 3Gi
  requests:
    cpu: "3"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Interpretation:&lt;/strong&gt; The pod is consuming ~2.86 cores — very close to its configured CPU limit.&lt;/p&gt;

&lt;p&gt;CPU request = 3 cores&lt;br&gt;
CPU limit = 3 cores&lt;br&gt;
The pod is configured to have 3 cores; pod usage ~2.86 cores explains the alert (&amp;gt;95% of 3 cores).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Confirm node health (is it node or pod that’s saturated?)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl top node

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                       CPU(cores)  CPU%   MEMORY(bytes)   MEMORY%
ip-xxxxxxxxxx.us-west-2.compute.internal   122m         1%     5575Mi          37%
ip-xxxxxxxxxx.us-west-2.compute.internal   181m         2%     9653Mi          65%
ip-xxxxxxxxxx.us-west-2.compute.internal   86m          1%     7030Mi          47%
ip-xxxxxxxxxx.us-west-2.compute.internal   3045m        39%    7057Mi          47%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ip-xxxxxxxxxx.us-west-2.compute.internal    3045m        39%    7057Mi                    47%
... other nodes show low CPU %

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4 — Check process level inside the pod&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We exec'd into the pod and listed processes.&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it webapp-deployment-rfc4f -n stgapp-- ps aux --sort=-%cpu | head -20

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
app_run+      46  225 17.1 8475108 2745168 pts/0 Sl+  04:42 129:33 /usr/lib/jvm/
app_run+       1  0.0  0.0   2664   960 pts/0    Ss   04:42   0:00 /usr/bin/tini
... (other minor processes)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Interpretation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Java process (PID 46) is the heavy consumer — observed at ~225% CPU.&lt;/p&gt;

&lt;p&gt;This continuous high CPU usage from Java explains the pod-level metric.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We also obtained thread count:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it webapp-deployment-rfc4f -n stgapp -- bash -c "ps -eLf | grep java | wc -l"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;180

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Interpretation:&lt;/strong&gt; ~180 Java threads — a large thread count for java services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The pod was configured with a CPU limit of 3 cores.&lt;/li&gt;
&lt;li&gt;The Java search process consistently consumed most of that (observed 225%–293% at different times).&lt;/li&gt;
&lt;li&gt;Node had spare capacity → this was application-level saturation (the application needed more CPU than allotted).&lt;/li&gt;
&lt;li&gt;No evidence of node-level resource pressure or cgroup throttling preventing the pod from running; the pod simply used its quota.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Solution implemented:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Below are two safe approaches:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Option A — Increase pod CPU limit (vertical fix)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the application legitimately needs more CPU (sustained), increase limits.cpu. Example: raise limit from 3 → 4 or 6 cores.&lt;/p&gt;

&lt;p&gt;Command to update the deployment (non-disruptive for Deployment-managed pods; pods will be rolled):&lt;/p&gt;

&lt;p&gt;Command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl set resources deployment/webapp-deployment-rfc4f -n stgapp \
  --limits=cpu=4 --requests=cpu=3

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or edit YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl edit deployment/webapp-deployment-rfc4f -n stgapp 
# update resources: limits.cpu to "4"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | grep -A5 resources
# and
kubectl top pod webapp-deployment-rfc4f -n stgapp 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected result:&lt;/p&gt;

&lt;p&gt;Pod has a larger CPU quota. If load is same, CPU% (relative to limit) will drop and alert will recover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B — Scale replicas (horizontal fix)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If requests can be load-balanced across replicas, scale out to reduce per-pod load:&lt;/p&gt;

&lt;p&gt;command-&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl scale deployment webapp-deployment-rfc4f -n stgapp --replicas=2

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify:&lt;/p&gt;

&lt;p&gt;command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods -n stgapp  -l app=webapp-deployment-rfc4f -o wide
kubectl top pods -n stgapp 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected result:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Per-pod CPU load drops (if incoming workload is split across replicas).&lt;/li&gt;
&lt;li&gt;We increased CPU limit to 4 and scaled to 2 replicas.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Increase CPU limit:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Command-&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl set resources deployment/webapp-deployment-rfc4f -n stgapp \
  --limits=cpu=4 --requests=cpu=3

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scale replicas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl scale deployment webapp-deployment-rfc4f -n stgapp --replicas=2

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | sed -n '/resources:/,+6p'
kubectl get pods -n stgapp -l app=webapp-deployment-rfc4f -o wide
kubectl top pod -n stgapp 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Post-fix verification:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Deployment resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | grep -A5 resources

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output -&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resources:
  limits:
    cpu: "4"
    memory: 3Gi
  requests:
    cpu: "3"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Alert triggered due to pod CPU usage ≈ 2.86 cores vs limit 3 cores → Alert flagged &amp;gt;95% sustained usage.&lt;/li&gt;
&lt;li&gt;Investigation steps: kubectl top pod, kubectl top node, ps inside pod, check deployment resources.&lt;/li&gt;
&lt;li&gt;Root cause: Java search process saturating the pod CPU (application-level).&lt;/li&gt;
&lt;li&gt;Remediation: Increase CPU limit (vertical), or scale replicas (horizontal), and investigate hot threads/slow queries for permanent fixes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Follow me on LinkedIn: &lt;a href="http://www.linkedin.com/in/alok-shankar-55b94826" rel="noopener noreferrer"&gt;www.linkedin.com/in/alok-shankar-55b94826&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>java</category>
      <category>kubernetes</category>
      <category>aws</category>
    </item>
    <item>
      <title>Cloud FinOps in Action: How I Saved Thousands by Optimizing AWS Architectures</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Sun, 23 Nov 2025 16:36:41 +0000</pubDate>
      <link>https://dev.to/alok_shankar/cloud-finops-in-action-how-i-saved-thousands-by-optimizing-aws-architectures-36bk</link>
      <guid>https://dev.to/alok_shankar/cloud-finops-in-action-how-i-saved-thousands-by-optimizing-aws-architectures-36bk</guid>
      <description>&lt;p&gt;Managing cloud spending is one of the biggest challenges for modern enterprises. As applications scale, costs silently grow through unused resources, over-provisioned workloads, and inefficient storage patterns. AWS provides numerous tools and best practices to control and optimize spend—yet most organizations use only a small fraction of them.&lt;/p&gt;

&lt;p&gt;In this blog, I’m sharing the most effective AWS cost optimization techniques that I have personally implemented across real-world environments. These strategies are simple, practical, and deliver immediate results without compromising performance.&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;1. Migrate to Graviton Instances&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS Graviton2 and Graviton3 processors offer 20–40% better price-performance compared to traditional x86 instances. They are energy-efficient and ideal for application servers, microservices, and container workloads. Migrating to Graviton is one of the easiest ways to cut EC2 compute costs significantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mchyf0sjfi4q27rk5r5.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mchyf0sjfi4q27rk5r5.JPG" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💰 &lt;strong&gt;2. Purchase Reserved Instances for Long-Running Workloads&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you have workloads running 24/7 (e.g., production servers, databases), Reserved Instances (RI) can cut costs by up to 72%. By committing to a 1-year or 3-year term, you get predictable and deeply discounted pricing compared to On-Demand.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuyjm6d7wsgweewdl98x.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuyjm6d7wsgweewdl98x.JPG" alt=" " width="690" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📦 &lt;strong&gt;3. Apply S3 Lifecycle Policies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without lifecycle policies, data sits forever in expensive S3 Standard storage. Using lifecycle rules, cold or unused data can automatically shift to cheaper tiers like Glacier, Glacier Deep Archive, or S3 Infrequent Access. This reduces storage costs dramatically for logs, backups, and infrequently accessed datasets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flu8eiic196t7lches0p5.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flu8eiic196t7lches0p5.JPG" alt=" " width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐳 &lt;strong&gt;4. Apply ECR Lifecycle Policies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amazon ECR often stores hundreds of old container images that are no longer required. Implementing ECR lifecycle rules helps delete unused tags and old image versions, keeping repositories clean and reducing unnecessary storage costs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32nmlmfuiwcwuuj8qfom.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32nmlmfuiwcwuuj8qfom.JPG" alt=" " width="800" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📊 &lt;strong&gt;5. Set Retention Policies for CloudWatch Logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CloudWatch Logs grow quickly—and storing logs forever gets expensive. Setting a retention period (7, 30, or 90 days) ensures logs are automatically deleted based on relevance. This is essential for cost control in environments with high log volume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffg0q9kc748ff3a3jeno4.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffg0q9kc748ff3a3jeno4.JPG" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💾 &lt;strong&gt;6. Remove Unused AMIs and Snapshots&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unused AMIs and outdated snapshots accumulate over time, consuming EBS storage. Regular audits and deletion of stale snapshots help lower costs and maintain a clutter-free environment. I have used custom script to delete unused AMIs and associated snapshots.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dak8agwq14ahdbwseb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dak8agwq14ahdbwseb6.png" alt=" " width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🌐 &lt;strong&gt;7. Release Unused Elastic IPs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS charges for Elastic IPs that are allocated but not attached to a running instance. Releasing unused Elastic IPs prevents silent billing and keeps your network resources optimized.&lt;/p&gt;

&lt;p&gt;🔍 &lt;strong&gt;8. Rightsize EC2 Instances&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Over-provisioned EC2, RDS, or Auto Scaling Groups lead to unnecessary spending. Use AWS Compute Optimizer or CloudWatch metrics to identify resources that can be downsized. Rightsizing is often the quickest win with immediate cost reduction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a85y047fd8pd1zwvrxw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a85y047fd8pd1zwvrxw.png" alt=" " width="800" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎯 &lt;strong&gt;9. Use Spot Instances for Non-Critical Workloads&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For flexible and fault-tolerant workloads, Spot Instances provide up to 90% cost savings. They are ideal for CI/CD pipelines, batch jobs, analytics workloads, and large-scale distributed tasks.&lt;/p&gt;

&lt;p&gt;📂 &lt;strong&gt;10. Enable S3 Intelligent-Tiering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;S3 Intelligent-Tiering automatically moves data between access tiers based on usage. This provides cost savings without needing manual lifecycle rules—perfect for unpredictable access patterns.&lt;/p&gt;

&lt;p&gt;💤 &lt;strong&gt;11. Shut Down Non-Prod Resources During Off-Hours&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DEV/QA environments typically run only during business hours. Automate shutdown using AWS Instance Scheduler or Lambda scripts. This alone can save 30–50% of EC2 costs for non-production environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqh6eb6vyzfoy7qk86kvb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqh6eb6vyzfoy7qk86kvb.png" alt=" " width="800" height="45"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🧾 &lt;strong&gt;12. Use AWS Savings Plans&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Savings Plans offer flexible, commitment-based pricing across EC2, Fargate, Lambda, and SageMaker, delivering up to 66% savings. Unlike RIs, Savings Plans automatically apply across instance families, regions, and OS types.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiom0emvcp1wiq9obbunz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiom0emvcp1wiq9obbunz.png" alt=" " width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⚖️ &lt;strong&gt;13. Optimize Load Balancers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Delete unused ALBs/NLBs, idle target groups, and low-traffic load balancers. ALBs can also be more cost-effective than NLBs for HTTP workloads.&lt;/p&gt;

&lt;p&gt;🗃️ &lt;strong&gt;14. Use Aurora Serverless or DynamoDB On-Demand&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not all workloads need permanent, provisioned databases. Serverless and on-demand modes allow you to pay only when data is actually accessed, making them perfect for variable or unpredictable loads.&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;15. Reduce NAT Gateway Costs with VPC Endpoints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;NAT Gateways charge per GB of data processed. Use VPC endpoints for S3 and DynamoDB to bypass NAT and significantly reduce data transfer charges—especially in data-intensive architectures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vh3a48olczhi71mw448.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vh3a48olczhi71mw448.png" alt=" " width="547" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📀 &lt;strong&gt;16. Optimize EBS Volumes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Convert GP2 to GP3 volumes to reduce cost and improve performance&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvv82brbvty5w1jjazb7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvv82brbvty5w1jjazb7.png" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Delete unattached EBS volumes&lt;br&gt;
Use EBS Snapshot Lifecycle Manager to automate cleaning&lt;br&gt;
These small changes collectively make a big impact on long-term cost savings.&lt;/p&gt;

&lt;p&gt;📝 &lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS offers a huge toolbox for cost optimization—but without active monitoring and periodic cleanup, cloud costs quickly spiral out of control. By implementing the techniques above—Graviton migration, lifecycle policies, RI/Savings Plans, rightsizing, and storage optimization—you can achieve substantial savings while keeping your cloud environment efficient and future-ready.&lt;/p&gt;

&lt;p&gt;Cost optimization is not a one-time task; it’s a continuous &lt;strong&gt;FinOps practice&lt;/strong&gt;. Start with small improvements and build a culture where teams regularly review and optimize their cloud usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost comparison post applying all FinOps practice:&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwsryza3wupfye4ucmts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwsryza3wupfye4ucmts.png" alt=" " width="800" height="504"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Follow me on LinkedIn: &lt;a href="http://www.linkedin.com/in/alok-shankar-55b94826" rel="noopener noreferrer"&gt;www.linkedin.com/in/alok-shankar-55b94826&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cloud</category>
      <category>architecture</category>
      <category>devops</category>
    </item>
    <item>
      <title>🚀 Hosting a React App on AWS Amplify with a Custom Domain</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Fri, 20 Jun 2025 14:44:52 +0000</pubDate>
      <link>https://dev.to/alok_shankar/hosting-a-react-app-on-aws-amplify-with-a-custom-domain-38no</link>
      <guid>https://dev.to/alok_shankar/hosting-a-react-app-on-aws-amplify-with-a-custom-domain-38no</guid>
      <description>&lt;h2&gt;
  
  
  📌 1. Introduction
&lt;/h2&gt;

&lt;p&gt;In today’s fast-paced development world, hosting frontend applications with speed, scalability, and security is essential. If you’ve built a React app and want to deploy it quickly with CI/CD and HTTPS, AWS Amplify is a perfect solution. And the best part? You can access your app using your own custom domain—even if it's hosted on third-party DNS provider like GoDaddy.&lt;/p&gt;

&lt;p&gt;In this guide, I’ll walk through hosting a React app on AWS Amplify and linking it to a custom subdomain like &lt;a href="https://subdomain.yourdomain.com" rel="noopener noreferrer"&gt;https://subdomain.yourdomain.com&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  🌐 2. Why Use AWS Amplify to Host Frontend Apps?
&lt;/h2&gt;

&lt;p&gt;AWS Amplify is a full-stack hosting and deployment platform from AWS designed for modern web and mobile applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  ✅ Benefits of AWS Amplify for Hosting Frontend Apps:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;- Zero server management&lt;/li&gt;
&lt;li&gt;- CI/CD integration with GitHub, GitLab, Bitbucket&lt;/li&gt;
&lt;li&gt;- Global CDN for faster content delivery&lt;/li&gt;
&lt;li&gt;- Free SSL certificate with automatic HTTPS&lt;/li&gt;
&lt;li&gt;- Custom domain support&lt;/li&gt;
&lt;li&gt;- Preview environments for every Git branch&lt;/li&gt;
&lt;li&gt;- Authentication, APIs, and storage if you expand to full-stack&lt;/li&gt;
&lt;li&gt;- It makes deployment as simple as connecting your GitHub repo and clicking "Deploy".&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;📦 3. Clone a Sample React App from GitHub&lt;br&gt;
Let’s get started with a simple Tailwind CSS + React frontend.&lt;/p&gt;

&lt;p&gt;📁 Step-by-step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Clone the sample repo
git clone https://github.com/alokshanhbti/amplify-react-poc.git

cd amplify-react-poc

# Install dependencies
npm install

# Run locally
npm start

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftga4gth85ql5kfz73cnw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftga4gth85ql5kfz73cnw.png" alt=" " width="773" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zehqioksjgeneoedpx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zehqioksjgeneoedpx8.png" alt=" " width="800" height="367"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
## If you want to create your own React app:

npx create-react-app amplify-react-poc
cd amplify-react-poc

# Install TailwindCSS
npm install -D tailwindcss postcss autoprefixer
npx tailwindcss init -p

# Run the App Locally
npm start
Your app opens in your browser at http://localhost:3000

# Then push it to GitHub:
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/amplify-react-poc.git
git push -u origin main

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🔗 Summary GitHub Repo Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;amplify-react-poc/
├── public/
├── src/
│   ├── App.js
│   ├── index.css
│   └── index.js
├── tailwind.config.js
├── postcss.config.js
├── package.json
└── README.md

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🚀 4. Deployment on AWS Amplify
&lt;/h2&gt;

&lt;h2&gt;
  
  
  🌍 Steps to Deploy:
&lt;/h2&gt;

&lt;p&gt;Go to AWS Amplify Console&lt;br&gt;
Click "Create New App"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9w1n6b0dahls5adporr.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9w1n6b0dahls5adporr.PNG" alt=" " width="800" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Choose GitHub → Connect your repo (amplify-react-poc)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61wh0n3suhmz3ah1tcuc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61wh0n3suhmz3ah1tcuc.png" alt=" " width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Select the main branch and click Next&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ylxk4jge9qqtkwafc7q.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ylxk4jge9qqtkwafc7q.JPG" alt=" " width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vg2llqhzmsunguruzu9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vg2llqhzmsunguruzu9.png" alt=" " width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amplify auto-detects React → leave build settings as-is&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy72q9hzk2ubbnh0nskj3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy72q9hzk2ubbnh0nskj3.png" alt=" " width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncabtsp3bfxgl61us9m5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncabtsp3bfxgl61us9m5.png" alt=" " width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click "Save and Deploy"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3braopf4rfk7ytch2m4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3braopf4rfk7ytch2m4.png" alt=" " width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note : Please check  build log if deployment failed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fd6tkwuuewawju6oxpi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fd6tkwuuewawju6oxpi.png" alt=" " width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There was error while build process. So verified the build.setting and checked amplify.yml file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffqlx680x6yqsjuycudmx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffqlx680x6yqsjuycudmx.png" alt=" " width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Added the commands under preBuild section&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
        - nvm install 20.19.0
        - nvm use 20.19.0
        - node -v  
        - npm install

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Post Update amplify.yml file&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3tta8284z8pdle08eok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3tta8284z8pdle08eok.png" alt=" " width="800" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F90hfxe6kfmh4znujrgdf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F90hfxe6kfmh4znujrgdf.png" alt=" " width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Deployment Completed&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3a72fqepe9r5fr4f79k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3a72fqepe9r5fr4f79k.png" alt=" " width="800" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;App accessible using amplify url&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yy8vuzofvlyr59k71iw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3yy8vuzofvlyr59k71iw.png" alt=" " width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🌐 5. Add a Custom Domain (e.g., from GoDaddy)
&lt;/h2&gt;

&lt;p&gt;Now let’s connect your Amplify app to a custom domain like &lt;a href="https://subdomain.yourdomain.com" rel="noopener noreferrer"&gt;https://subdomain.yourdomain.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To add a custom domain managed by a third-party DNS provider&lt;br&gt;
Sign in to the AWS Management Console and open the Amplify console.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose your app that you want to add a custom domain to.&lt;/li&gt;
&lt;li&gt;In the navigation pane, choose Hosting, Custom domains.&lt;/li&gt;
&lt;li&gt;On the Custom domains page, choose Add domain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7io43rybzjrtvr7gvgs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7io43rybzjrtvr7gvgs.png" alt=" " width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Enter the name of your root domain. For example, if the name of your domain is &lt;a href="https://example.com" rel="noopener noreferrer"&gt;https://example.com&lt;/a&gt;, enter example.com.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzakynr5hb3sjqrcr7df2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzakynr5hb3sjqrcr7df2.png" alt=" " width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amplify detects that you are not using a Route 53 domain and gives you the option to create a hosted zone in Route 53.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6uttoanm75r9zyrymom1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6uttoanm75r9zyrymom1.png" alt=" " width="800" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once hosted zone created add those  nameservers to your Domain provider.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcqcetdygdtx709hwwe1.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcqcetdygdtx709hwwe1.PNG" alt=" " width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xchreoewwei1mi6fq7v.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xchreoewwei1mi6fq7v.PNG" alt=" " width="800" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now add your subdomain and wait for DNS record be create for SSL creation&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7sqpjuyh3xt608tx37z6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7sqpjuyh3xt608tx37z6.png" alt=" " width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As my nameservers are changed to Route53 then we don’t have to do anything Amplify will add these records to route 53&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5c34hrr8chn5tez5rgi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5c34hrr8chn5tez5rgi.png" alt=" " width="800" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7qr1f020255qbx8s3hq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7qr1f020255qbx8s3hq.png" alt=" " width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Domain activation completed &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvoqsra02xes43wkj68a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvoqsra02xes43wkj68a.png" alt=" " width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  🔓 6. Access the App Using Custom Domain
&lt;/h2&gt;

&lt;p&gt;After DNS propagation and SSL verification (usually &amp;lt; 1 hour):&lt;/p&gt;

&lt;p&gt;✅ Your app will be available at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://subdomain.yourdomain.com

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fau4efiw6syhj3z7vp0lq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fau4efiw6syhj3z7vp0lq.png" alt=" " width="800" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It will be fully secured with HTTPS using a free AWS-issued SSL certificate.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧰 7. Troubleshooting Steps
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔧 Problem                        ✅ Solution
❌ Domain verification failed      Ensure correct CNAME is added in GoDaddy
⏳ Stuck in "Pending"              Use https://dnschecker.org to confirm DNS
🔒 No SSL/HTTPS                     Wait for Amplify to finish provisioning or re-add domain
🛑 404 after deployment             Confirm the subdomain is correctly mapped to your app branch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ✅ 8. Conclusion
&lt;/h2&gt;

&lt;p&gt;Hosting a React app on AWS Amplify is fast, secure, and scalable. By combining it with your own custom domain—whether it's registered on GoDaddy or any third-party DNS provider—you get a professional-grade deployment pipeline in minutes.&lt;/p&gt;

&lt;p&gt;No DevOps, no manual servers, and no complex SSL setups. &lt;/p&gt;

&lt;p&gt;Just code → push → deploy&lt;/p&gt;

&lt;p&gt;Follow me on LinkedIn: &lt;a href="http://www.linkedin.com/in/alok-shankar-55b94826" rel="noopener noreferrer"&gt;www.linkedin.com/in/alok-shankar-55b94826&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>react</category>
      <category>aws</category>
      <category>frontend</category>
    </item>
    <item>
      <title>Automate AWS CloudWatch Log Retention with Bash: Save Costs &amp; Stay Compliant</title>
      <dc:creator>alok shankar</dc:creator>
      <pubDate>Sun, 18 May 2025 14:31:16 +0000</pubDate>
      <link>https://dev.to/alok_shankar/automate-aws-cloudwatch-log-retention-with-bash-save-costs-stay-compliant-2oj6</link>
      <guid>https://dev.to/alok_shankar/automate-aws-cloudwatch-log-retention-with-bash-save-costs-stay-compliant-2oj6</guid>
      <description>&lt;p&gt;🔹 &lt;strong&gt;Introduction :&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Managing CloudWatch log groups is a critical part of maintaining operational efficiency and cost control in AWS. However, it's easy to overlook retention settings — especially when log groups are created automatically by various AWS services. Without a defined retention period, logs accumulate indefinitely, leading to increased storage costs and unnecessary clutter.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In this blog, I’ll walk through streamlined approach to automatically detect CloudWatch log groups without a retention policy, update them to a 30-day retention period, and generate an HTML report delivered straight to your inbox.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;The solution is powered by a simple Bash script that leverages the AWS CLI and standard Linux utilities — making it easy to integrate into any DevOps workflow.&lt;/p&gt;

&lt;p&gt;Whether you're a cloud engineer trying to stay compliant or just looking to reduce AWS costs, this automated approach will save time, improve visibility, and ensure consistent log management across your environment.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Challenges Faced in Manual Process:&lt;/strong&gt;&lt;br&gt;
Manually managing log retention policies in AWS is like trying to clean every file cabinet in a skyscraper—painful, slow, and error-prone. Some of the common problems:&lt;/p&gt;

&lt;p&gt;❌ You can't visually identify which logs lack retention&lt;br&gt;
❌ You have to click through each log group in the AWS Console&lt;br&gt;
❌ There’s no built-in notification when retention is missing&lt;br&gt;
❌ Risk of accumulating terabytes of unused logs&lt;/p&gt;

&lt;p&gt;So I thought — “Why not automate the boring stuff?”&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Benefits of Automating CloudWatch Retention Updates&lt;/strong&gt;&lt;br&gt;
Automating retention policies brings a whole bouquet of benefits:&lt;/p&gt;

&lt;p&gt;🌟 Cost Control – Say goodbye to ever-growing log storage bills&lt;br&gt;
🔍 Audit Friendly – Track what's changed, when, and how&lt;br&gt;
📧 Proactive Alerting – Get email summaries with detailed tables&lt;br&gt;
🧹 Cleaner Environment – Consistent retention policies = better hygiene&lt;br&gt;
⏱️ Time Saved – No more manual clicking or forgetfulness&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;br&gt;
Before I  dive in, make sure you have the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;- An AWS account with access to CloudWatch&lt;/li&gt;
&lt;li&gt;- IAM permissions to read and update log groups&lt;/li&gt;
&lt;li&gt;- AWS CLI configured on your machine&lt;/li&gt;
&lt;li&gt;- Bash shell environment (Linux or macOS)&lt;/li&gt;
&lt;li&gt;- Tools like jq, sendmail, mailutils installed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔹 &lt;strong&gt;Step 1: Install AWS CLI&lt;/strong&gt;&lt;br&gt;
If you haven’t installed the AWS CLI yet, follow the steps below:&lt;br&gt;
&lt;code&gt;curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"&lt;br&gt;
unzip awscliv2.zip&lt;br&gt;
sudo ./aws/install&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then configure your credentials:&lt;br&gt;
&lt;code&gt;aws configure&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
🔹 &lt;strong&gt;Step 2: Install Dependencies&lt;/strong&gt;&lt;br&gt;
You’ll also need jq and sendmail for parsing and email delivery:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo apt install jq mailutils -y&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
🔹 &lt;strong&gt;Step 3: Create IAM Policy as per below , attached to IAM role and assign that role to EC2 instance.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;You’ll need the following IAM permissions to make it work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DescribeLogGroups",
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PutRetentionPolicy",
      "Effect": "Allow",
      "Action": "logs:PutRetentionPolicy",
      "Resource": "*"
    },
    {
      "Sid": "CloudWatchMetricsAccess",
      "Effect": "Allow",
      "Action": "cloudwatch:GetMetricStatistics",
      "Resource": "*"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Permissions include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;logs:DescribeLogGroups&lt;/li&gt;
&lt;li&gt;logs:DescribeLogStreams&lt;/li&gt;
&lt;li&gt;logs:PutRetentionPolicy&lt;/li&gt;
&lt;li&gt;cloudwatch:GetMetricStatistics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jr3eu8nzbsselb3ouye.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jr3eu8nzbsselb3ouye.JPG" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Step 4: Clone the GitHub Repository&lt;/strong&gt;&lt;br&gt;
Instead of writing the script manually, you can simply clone the prebuilt GitHub repository that includes the script, required IAM policy, and a README.&lt;br&gt;
&lt;code&gt;git clone https://github.com/alokshanhbti/cloudwatch-retention-update.git&lt;br&gt;
cd cloudwatch-retention-update&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
Inside the folder, you’ll find:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;cloudwatch-retention-update.sh – The automation script&lt;/li&gt;
&lt;li&gt;iam-policy.json – IAM policy required for permissions&lt;/li&gt;
&lt;li&gt;README.md – Full documentation and usage instructions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔹 &lt;strong&gt;Step 5: Make the Script Executable&lt;/strong&gt;&lt;br&gt;
After saving the script, make it executable with:&lt;br&gt;
&lt;code&gt;chmod +x cloudwatch-retention-update.sh&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
🔹 &lt;strong&gt;Step 6: Run the Script&lt;/strong&gt;&lt;br&gt;
Simply execute:&lt;br&gt;
&lt;code&gt;./cloudwatch-retention-update.sh&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
The script will log activity to a file, apply changes, and email the report to the address you specify.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Step 7: Script Flow&lt;/strong&gt;&lt;br&gt;
Here’s how the script works behind the scenes:&lt;/p&gt;

&lt;p&gt;🔍 Scan CloudWatch for log groups with no retention&lt;br&gt;
🧠 Fetch metadata: log group name, retention, last event, service name, and storage&lt;br&gt;
✍️ Update retention to 30 days using put-retention-policy&lt;br&gt;
📨 Generate HTML email with two colorful tables:&lt;br&gt;
Before update&lt;br&gt;
After update&lt;br&gt;
📬 Send email via sendmail with all details&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Step 8: Screen shots of email and logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Email part Before update :&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5uscli9pl62y4zt4zcl.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5uscli9pl62y4zt4zcl.JPG" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Email part After update :&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fslwoc2rla49eabcf7pkp.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fslwoc2rla49eabcf7pkp.JPG" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Logs :&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp52gc3gnovdkbm5ssjq2.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp52gc3gnovdkbm5ssjq2.JPG" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Automating CloudWatch log retention is a simple yet highly effective way to maintain a clean, cost-efficient, and compliant cloud environment. With this Bash script, you can easily identify log groups without retention settings, apply a consistent 30-day policy, and receive a well-formatted email report — all with minimal effort and zero manual intervention.&lt;/p&gt;

&lt;p&gt;This solution not only improves visibility and governance but also frees up your time to focus on higher-value tasks.&lt;/p&gt;

&lt;p&gt;Thank you for reading!&lt;br&gt;
If this script helps improve your cloud hygiene, feel free to share it with your team or contribute to the project.&lt;/p&gt;

&lt;p&gt;📂 Access the GitHub Repository Here:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/alokshanhbti" rel="noopener noreferrer"&gt;
        alokshanhbti
      &lt;/a&gt; / &lt;a href="https://github.com/alokshanhbti/cloudwatch-retention-update" rel="noopener noreferrer"&gt;
        cloudwatch-retention-update
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      cloudwatch-retention-update repo that audits AWS CloudWatch log groups with  no retention period and updates them to 30-day retention and sends email
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;📊 CloudWatch Log Retention Manager&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code&gt;cloudwatch-retention-update.sh&lt;/code&gt; is a Bash script that audits AWS CloudWatch log groups with &lt;strong&gt;no retention period set&lt;/strong&gt;, updates them to a &lt;strong&gt;30-day retention&lt;/strong&gt;, and sends a &lt;strong&gt;HTML email report&lt;/strong&gt; containing &lt;strong&gt;color-coded tables&lt;/strong&gt;.&lt;/p&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🔧 Features&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;✅ Identifies log groups &lt;strong&gt;without retention&lt;/strong&gt;&lt;br&gt;
✅ Fetches &lt;strong&gt;last log date&lt;/strong&gt;, &lt;strong&gt;associated AWS service&lt;/strong&gt;, and &lt;strong&gt;storage usage (in GB)&lt;/strong&gt;&lt;br&gt;
✅ Applies a &lt;strong&gt;30-day retention policy&lt;/strong&gt;&lt;br&gt;
✅ Sends an &lt;strong&gt;HTML email&lt;/strong&gt; via &lt;code&gt;sendmail&lt;/code&gt; with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;📋 &lt;strong&gt;Before Update Table&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;After Update Table&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;📁 Script Overview&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;📂 &lt;strong&gt;Log Group Scan&lt;/strong&gt; — Uses &lt;code&gt;aws logs describe-log-groups&lt;/code&gt; and &lt;code&gt;jq&lt;/code&gt; to filter targets&lt;/li&gt;
&lt;li&gt;⏳ &lt;strong&gt;Retention Status&lt;/strong&gt; — Detects &lt;code&gt;null&lt;/code&gt; retention policies&lt;/li&gt;
&lt;li&gt;📅 &lt;strong&gt;Last Log Timestamp&lt;/strong&gt; — Uses &lt;code&gt;describe-log-streams&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;💾 &lt;strong&gt;Storage Usage (GB)&lt;/strong&gt; — Uses &lt;code&gt;cloudwatch:GetMetricStatistics&lt;/code&gt; for &lt;code&gt;StoredBytes&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;📧 &lt;strong&gt;HTML Email Report&lt;/strong&gt; — Sends two HTML tables (before &amp;amp; after) with colors&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🚀 Usage&lt;/h2&gt;

&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Step 1: Make it executable&lt;/h3&gt;

&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;chmod +x cloudwatch-retention-update.sh
&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Step&lt;/h3&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/alokshanhbti/cloudwatch-retention-update" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;Happy automating! 🚀&lt;/p&gt;

&lt;p&gt;Follow me on LinkedIn: &lt;a href="http://www.linkedin.com/in/alok-shankar-55b94826" rel="noopener noreferrer"&gt;www.linkedin.com/in/alok-shankar-55b94826&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>linux</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
