DEV Community: alok shankar

🔐 Implementing Least‑Privilege Access in Amazon EKS Using Kubernetes RBAC

alok shankar — Wed, 29 Apr 2026 17:05:30 +0000

1. Introduction

As Kubernetes adoption grows across organizations, controlling who can do what inside a cluster becomes critical. In Amazon EKS, this challenge is compounded by the interaction between AWS IAM, EKS authentication, and Kubernetes RBAC.
In many teams, developers only need to deploy and update applications, but not manage cluster infrastructure or access sensitive data like secrets. Granting them full admin access increases risk and violates the principle of least privilege.
In this blog, i will walk through a real‑world, production‑ready RBAC implementation in Amazon EKS that allows a QA/deployment user to:

Update deployments
Monitor rollouts
Push images to ECR

…while explicitly restricting all administrative and destructive actions.

2. What is RBAC and Why Do We Need It?

RBAC (Role‑Based Access Control) is a Kubernetes authorization mechanism that controls who can access which resources and perform which actions inside a cluster.
RBAC answers three key questions:

Who is the user or service?
What Kubernetes resources can they access?
Which actions (verbs) can they perform on those resources?

Without RBAC:

Every authenticated user may become a de facto admin
Accidental deletes can take down environments
Auditing and compliance become difficult RBAC provides fine‑grained, auditable, and secure access control.

Benefits of RBAC Implementing RBAC provides several tangible benefits:

✅ Least Privilege – users get only what they need
✅ Blast Radius Reduction – mistakes don’t bring down clusters
✅ Security Compliance – aligns with SOC2, ISO, PCI controls
✅ Clear Separation of Duties – infra vs application teams
✅ Auditable Access – easy to review who can do what

Prerequisites Before starting, ensure the following are in place:

✅ Amazon EKS cluster (Test/QA environment)
✅ Admin access to the EKS cluster
✅ AWS CLI and kubectl configured
✅ An IAM user for deployment (example: web_deploy_qa)
✅ ECR repository for application images

5. Step‑by‑Step Implementation

5.1 Create IAM Policy for Deployment User
This policy allows:

ECR push
EKS cluster discovery

Why needed
Without this, the user cannot:

authenticate to EKS
push images during deployments

IAM Policy (web-deploy-policy-qa)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "ecr:GetAuthorizationToken",
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload",
        "ecr:PutImage"
      ],
      "Resource": "arn:aws:ecr:us-west-2:xxxxxxxxxxxx:repository/web-qa-ecr"
    },
    {
      "Effect": "Allow",
      "Action": "eks:DescribeCluster",
      "Resource": "arn:aws:eks:us-west-2:xxxxxxxxxxxx:cluster/web-qa-eks"
    }
  ]
}

Attach this policy to IAM user web_deploy_qa.

5.2 (Modern EKS) Create EKS Access Entry
Newer EKS clusters use EKS Access Management.

Why needed
In new EKS clusters, aws-auth alone is not enough—Access Entries are mandatory.
✅ Without this step, users get Unauthorized even with correct RBAC.

aws eks create-access-entry \
  --cluster-name web-qa-eks \
  --principal-arn arn:aws:iam::xxxxxxxxxxx:user/web_deploy_qa \
  --region us-west-2

Output:

{
  "accessEntry": {
    "principalArn": "arn:aws:iam::xxxxxxxxxx:user/web_deploy_qa",
    "type": "STANDARD"
  }
}

[ec2-user@ip-xxxxxxxxxxxx xxxxxxxxxxxx]$ aws eks create-access-entry \
  --cluster-name web-qa-eks \
  --principal-arn arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa \
  --region us-west-2
{
    "accessEntry": {
        "clusterName": "web-qa-eks",
        "principalArn": "arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa",
        "kubernetesGroups": [],
        "accessEntryArn": "arn:aws:eks:us-west-2:xxxxxxxxxxxx:access-entry/web-qa-eks/user/xxxxxxxxxxxx/web_deploy_qa/3acedc95-6cab-f7fd-bd78-d84fdac46546",
        "createdAt": "2026-01-23T07:08:06.401000+00:00",
        "modifiedAt": "2026-01-23T07:08:06.401000+00:00",
        "tags": {},
        "username": "arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa",
        "type": "STANDARD"
    }
}

5.3 Associate EKS View Policy (Required for kubectl discovery)

This allows read‑only cluster discovery (namespaces, API resources).
Why needed
kubectl requires read‑only discovery access.

aws eks associate-access-policy \
  --cluster-name web-qa-eks \
  --principal-arn arn:aws:iam::xxxxxxxxxxxxx:user/web_deploy_qa \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \
  --access-scope type=cluster \
  --region us-west-2

[ec2-user@ip-xxxxxxxxxxxx ]$ aws eks associate-access-policy \
  --cluster-name web-qa-eks \
  --principal-arn arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \
  --access-scope type=cluster \
  --region us-west-2
{
    "clusterName": "web-qa-eks",
    "principalArn": "arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa",
    "associatedAccessPolicy": {
        "policyArn": "arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy",
        "accessScope": {
            "type": "cluster",
            "namespaces": []
        },
        "associatedAt": "2026-01-23T07:08:59.134000+00:00",
        "modifiedAt": "2026-01-23T07:08:59.134000+00:00"
    }
}

✅ This does NOT grant admin rights.

5.4 Create ClusterRole (Deployment Only)
Why needed
Defines what actions are allowed.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: qa-deployment-clusterrole
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs:
      - get
      - list
      - watch
      - patch
      - update

  - apiGroups: ["apps"]
    resources: ["deployments/status", "replicasets"]
    verbs:
      - get
      - list
      - watch

[ec2-user@ip-xxxxxxxxxxxx xxxxxxxxxxxx]$ cat <<EOF > qa-deployment-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: qa-deployment-clusterrole
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "patch", "update"]

  - apiGroups: ["apps"]
    resources: ["deployments/status", "replicasets"]
    verbs: ["get", "list", "watch"]
EOF
[ec2-user@ip-xxxxxxxxxxxx xxxxxxxxxxxx]$ ll
total 8
-rw-r--r--. 1 ec2-user ec2-user 317 March 23 05:58 qa-deployment-clusterrole.yaml
[ec2-user@ip-xxxxxxxxxxxx xxxxxxxxxxxx]$ kubectl apply -f qa-deployment-clusterrole.yaml
clusterrole.rbac.authorization.k8s.io/qa-deployment-clusterrole created

Verification -

[ec2-user@ip-xxxxxxxxxxxx xxxxxxxxxxxx]$ kubectl get clusterrole qa-deployment-clusterrole
kubectl get clusterrolebinding qa-deployment-clusterbinding
NAME                        CREATED AT
qa-deployment-clusterrole   2026-03-23T05:58:46Z
NAME                           ROLE                                    AGE
qa-deployment-clusterbinding   ClusterRole/qa-deployment-clusterrole   15s

5.5 Map IAM User in aws-auth ConfigMap
Why needed
This bridges AWS IAM → Kubernetes identity.

mapUsers: |
  - userarn: arn:aws:iam::xxxxxxxxxxxxx:user/web_deploy_qa
    username: arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa
    groups:
      - qa-deployers
``

✅ Proper indentation is critical for EKS authentication.

[ec2-user@xxxxxxxxxxxxxx]$ cat <<EOF > aws-auth-fixed.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::xxxxxxxxxxxx:role/EKS-EC2-Role
      username: eks-ec2-role
      groups:
        - system:masters

  mapUsers: |
    - userarn: arn:aws:iam::xxxxxxxxxxxx:user/devops-admin
      username: devops-admin
      groups:
        - system:masters

    - userarn: arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa
      username: web_deploy_qa
      groups:
        - qa-deployers
EOF
[ec2-user@xxxxxxxxxxxxx]$ ll
total 16
-rw-r--r--. 1 ec2-user ec2-user 518 Apr 23 06:43 aws-auth-fixed.yaml
-rw-r--r--. 1 ec2-user ec2-user 303 Apr 23 05:59 qa-deployment-clusterbinding.yaml
-rw-r--r--. 1 ec2-user ec2-user 317 Apr 23 05:58 qa-deployment-clusterrole.yaml
[ec2-user@xxxxxxxxxxxx]$ kubectl apply -f aws-auth-fixed.yaml
Warning: resource configmaps/aws-auth is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
configmap/aws-auth configured
[ec2-user@xxxxxxxxxxxxxxxxxxx]$

Verification -

[ec2-user@xxxxxxxxxxxxxxxxxxxxxxx]$ kubectl get configmap aws-auth -n kube-system -o yaml
apiVersion: v1
data:
  mapRoles: |
    - rolearn: arn:aws:iam::xxxxxxxxxxxx:role/EKS-EC2-Role
      username: eks-ec2-role
      groups:
        - system:masters
  mapUsers: |
    - userarn: arn:aws:iam::xxxxxxxxxxxx:user/devops-admin
      username: devops-admin
      groups:
        - system:masters

    - userarn: arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa
      username: web-deploy-qa
      groups:
        - qa-deployers
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"mapRoles":"- rolearn: arn:aws:iam::xxxxxxxxxxxx:role/EKS-EC2-Role\n  username: eks-admin-ec2-role\n  groups:\n    - system:masters\n","mapUsers":"- userarn: arn:aws:iam::xxxxxxxxxxxx:user/devops-admin\n  username: devops-admin\n  groups:\n    - system:masters\n\n- userarn: arn:aws:iam::xxxxxxxxxxxx:user/web_deploy_qa\n  username: web-deploy-qa\n  groups:\n    - qa-deployers\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"aws-auth","namespace":"kube-system"}}
  creationTimestamp: "2026-04-13T07:07:43Z"
  name: aws-auth
  namespace: kube-system
  resourceVersion: "4329198"
  uid: 5bda06d5-f977-4c52-b3d2-556bdd8f5da1

6️⃣ Architecture – Before RBAC

IAM User
│
▼
aws-auth (system:masters)
│
▼
FULL CLUSTER ADMIN ACCESS ❌

Before RBAC Problems:

❌ Users often had system:masters
❌ Full cluster admin access
❌ Could delete namespaces, secrets, workloads
❌ High risk of accidental outages

7️⃣ Final Architecture -After RBAC

IAM User (web_deploy_qa)
   ↓
EKS Access Entry (STANDARD)
   ↓
AmazonEKSViewPolicy (discovery only)
   ↓
Kubernetes ClusterRole + ClusterRoleBinding
   ↓
Controlled deployment updates

✅ Result

Authentication ✅
Discovery ✅
Controlled mutations ✅
Admin access ❌

8️⃣ Validation Before Handover

Configure aws IAM user in bastion server-

[ec2-user@xxxxxxxxxxx ~]$ aws configure --profile web_deploy_qa
AWS Access Key ID [None]: xxxxxxxxxxxxxxxx
AWS Secret Access Key [None]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Default region name [None]: xxxxxxxxxxxx
Default output format [None]: json
[ec2-user@xxxxxxxxxxxx ~]$ aws sts get-caller-identity --profile web_deploy_qa
{
    "UserId": "xxxxxxxxxxxxxx",
    "Account": "xxxxxxxxxxxxx",
    "Arn": "arn:aws:iam::xxxxxxxxxxxxx:user/web_deploy_qa"
}

Update EKS kubeconfig to add profile web_deploy_qa

[ec2-user@xxxxxxxxxxxxxxxxxxx]$ aws eks update-kubeconfig \
  --region us-west-2 \
  --name web-qa-eks \
  --profile web_deploy_qa \
  --alias web-deploy-eks-qa
Updated context web-deploy-eks-qa in /home/ec2-user/.kube/config

Switch to QA context.


[ec2-user@xxxxxxxxxxxxxxxxxxxxx]$ kubectl config use-context web-deploy-eks-qa
Switched to context "web-deploy-eks-qa".
[ec2-user@xxxxxxxxxxxxxxxxxxxxx]$

4.Verify identity:

[ec2-user@xxxxxxxxxxxxxxxxxxxxxxxxxxx]$ kubectl auth whoami
ATTRIBUTE                                              VALUE
Username                                               arn:aws:iam::xxxxxxxxxxxxx:user/web_deploy_qa
UID                                                    aws-iam-authenticator:xxxxxxxxxxxxx:xxxxxxxxxxxxx
Groups                                                 [system:authenticated]
Extra: accessKeyId                                     [xxxxxxxxxxxxx]
Extra: arn                                             [arn:aws:iam::xxxxxxxxxxxxx:user/web_deploy_qa]
Extra: canonicalArn                                    [arn:aws:iam::xxxxxxxxxxxxx:user/web_deploy_qa]
Extra: principalId                                     [xxxxxxxxxxxxx]
Extra: sessionName                                     []
Extra: sigs.k8s.io/aws-iam-authenticator/principalId   [xxxxxxxxxxxxx]

9️⃣ Validation Checklist (Commands)
✅ Allowed

kubectl auth can-i patch deployments --all-namespaces
kubectl auth can-i update deployments --all-namespaces
kubectl auth can-i watch replicasets --all-namespaces
kubectl auth can-i get deployments --all-namespaces

OutPut :

[ec2-user@xxxxxxxxxxxxxx]$ kubectl auth can-i patch deployments --all-namespaces
yes
[ec2-user@xxxxxxxxxxxxxx]$ kubectl auth can-i update deployments --all-namespaces
yes
[ec2-user@xxxxxxxxxxxxxx]$ kubectl auth can-i watch replicasets --all-namespaces
kubectl auth can-i get deployments --all-namespaces
yes
yes

❌ Restricted

kubectl auth can-i delete deployments --all-namespaces
kubectl auth can-i get secrets --all-namespaces
kubectl auth can-i exec pods --all-namespaces
kubectl auth can-i update configmap aws-auth -n kube-system

[ec2-user@xxxxxxxxxxxxxx]$ kubectl auth can-i delete deployments --all-namespaces
kubectl auth can-i create deployments --all-namespaces
kubectl auth can-i get secrets --all-namespaces
kubectl auth can-i exec pods --all-namespaces
no
no
no
no

🔧 Troubleshooting Section
❌ Unauthorized Error
Cause:

Missing EKS Access Entry
Checking access entry and add it if required

[ec2-user@xxxxxxxxxxxxxx]$ aws eks list-access-entries --cluster-name web-qa-eks --region us-west-2           {
    "accessEntries": [
        "arn:aws:iam::xxxxxxxxxxxxxx:role/EKS-EC2-Role",
        "arn:aws:iam::xxxxxxxxxxxxxx:role/aws-service-role/eks.amazonaws.com/AWSServiceRoleForAmazonEKS",
        "arn:aws:iam::xxxxxxxxxxxxxx:user/devops_admin",
        "arn:aws:iam::xxxxxxxxxxxxxx:user/web_deploy_qa"
    ]
}

❌ aws-auth Looks Correct but Still Fails
Cause:
aws-node cache

Fix:

kubectl rollout restart daemonset aws-node -n kube-system

[ec2-user@xxxxxxxxxxxxxx]$ kubectl rollout restart daemonset aws-node -n kube-system
daemonset.apps/aws-node restarted

❌ Can Get But Cannot Patch Deployments
Cause:
ClusterRoleBinding subject mismatch
Check:

kubectl auth whoami
kubectl get clusterrolebinding -o yaml

[ec2-user@xxxxxxxxxxxxxx]$ kubectl auth whoami
ATTRIBUTE                                              VALUE
Username                                               arn:aws:iam::xxxxxxxxxxxxxx:user/web_deploy_qa
UID                                                    aws-iam-authenticator:xxxxxxxxxxxxxx:AIDA3TR7NTEIQMP4HFBAZ
Groups                                                 [system:authenticated]
Extra: accessKeyId                                     [xxxxxxxxxxxxxxxxxxxxx]
Extra: arn                                             [arn:aws:iam::xxxxxxxxxxxxxx:user/web_deploy_qa]
Extra: canonicalArn                                    [arn:aws:iam::xxxxxxxxxxxxxx:user/web_deploy_qa]
Extra: principalId                                     [xxxxxxxxxxxxxxx]
Extra: sessionName                                     []
Extra: sigs.k8s.io/aws-iam-authenticator/principalId   [AIDA3TR7NTEIQMP4HFBAZ]
[ec2-user@xxxxxxxxxxxxxxx]$ kubectl patch clusterrolebinding qa-deployment-clusterbinding \
  --type='json' \
  -p='[{"op":"replace","path":"/subjects/0/name","value":"arn:aws:iam::xxxxxxxxxxxxxxxxxxxxx:user/web_deploy_qa"}]'
clusterrolebinding.rbac.authorization.k8s.io/qa-deployment-clusterbinding patched

10. Conclusion
By combining:

✅ IAM policies
✅ EKS Access Entries
✅ AmazonEKSViewPolicy
✅ Kubernetes RBAC

we achieved a secure, least‑privilege, production‑grade access model for application deployment teams.
This approach:

Prevents privilege escalation
Protects cluster stability
Provides a clean operational boundary between infra and dev teams

RBAC is not optional in Kubernetes — it is foundational.
When implemented correctly in EKS, it enables safe scaling of teams without sacrificing security or control.

Happy Orchestrating & Bulletproof K8s! 🛠️

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826

EKS Ingress Address Not Assigned (Application Outage)-Incident & Resolution Guide

alok shankar — Thu, 16 Apr 2026 06:06:29 +0000

1. Introduction
In Kubernetes, applications are typically exposed internally using Services (ClusterIP, NodePort). However, for exposing applications externally in a scalable, secure, and cloud‑native manner, Kubernetes provides the concept of Ingress.

What is Ingress?
Ingress is a Kubernetes API object that manages external HTTP/HTTPS access to services within a cluster. It provides:

Layer‑7 routing (path‑based, host‑based)
TLS termination
Centralized traffic management

Ingress works in conjunction with an Ingress Controller, which implements the actual traffic routing logic.

Why Ingress instead of NodePort / LoadBalancer?

In AWS EKS, the recommended production approach is Ingress with AWS Application Load Balancer (ALB) using the AWS Load Balancer Controller.

Incident Overview
I encountered an issue in the EKS environment where an application became inaccessible from outside the cluster.
Although the application pods were running and Kubernetes services were healthy, external users were unable to access the application URL.

Upon investigation, it was observed that the Ingress resource was created successfully, but the ADDRESS field of the Ingress remained empty (null).
As a result, no valid Load Balancer endpoint was available to route external traffic to the application.
This issue closely resembled a production outage scenario, as it directly impacted external traffic routing despite the application itself being operational.

[ec2-user@ip-xx-xxx-xxx-xx ~]$ kubectl get ingress app-ingress -n ep-apps -o wide
NAME          CLASS   HOSTS   ADDRESS   PORTS   AGE
app-ingress   alb     *                 80      10h

Impact

External users could not access the application.
No ALB DNS was available from the Ingress.
Target Group showed 0 registered targets.
Application health appeared normal internally, which made the issue non-obvious at first glance.

Use Case
Business Scenario

Application deployed in EKS.
Needs to be exposed externally over HTTPS
Uses path-based routing
Requires container-level health checks.

2. Architecture Overview
High-Level Flow

3. Timeline of Events:

The application was deployed successfully in the EKS cluster.
Pods were in Running state and passing readiness and liveness probes.
Kubernetes Service (ClusterIP) showed valid endpoints.
An ALB-backed Ingress was created to expose the application externally.
Despite successful Ingress creation, the Ingress ADDRESS field remained empty.
AWS Console showed an ALB and Target Group created, but the Target Group had zero registered targets.
Because the Ingress did not publish an ADDRESS, application traffic could not reach the cluster.
This resulted in an outage-like situation where the application was “up” internally but unreachable externally.

4. Initial Observation
At a high level, everything appeared correct:

Pods were healthy.

[ec2-user@ip-xx-xxx-xxx-xx ~]$ kubectl get pods -A
NAMESPACE           NAME                                                              READY   STATUS    RESTARTS   AGE
amazon-cloudwatch   amazon-cloudwatch-observability-controller-manager-586c44c2cclk   1/1     Running   0          7h6m
amazon-cloudwatch   cloudwatch-agent-xxx                                              1/1     Running   0          6h41m
amazon-cloudwatch   cloudwatch-agent-xxxx                                             1/1     Running   0          6h41m
amazon-cloudwatch   fluent-bit-xxxx                                                   1/1     Running   0          6h41m
amazon-cloudwatch   fluent-bit-xxxx                                                   1/1     Running   0          6h41m
external-dns        external-dns-75f7b59749-dfkgn                                     1/1     Running   0          24h
ep-apps             condition-service-96475888c-bdmdn                                 1/1     Running   0          22h
ep-apps             web-query-service-78b5d4dcb7-nms56                                1/1     Running   0          23h
ep-apps             web-query-service-78b5d4dcb7-xlfj9                                1/1     Running   0          23h
ep-apps             web-apps-59658b6868-fkwvp                                         1/1     Running   0          22h
kube-system         aws-node-4xrsc                                                    2/2     Running   0          24h
kube-system         aws-load-balancer-controller-78bddb649b-w56d5                     1/1     Running   0          24h
kube-system         aws-load-balancer-controller-78bddb649b-z5s5g                     1/1     Running   0          24h
kube-system         aws-node-ncp5f                                                    2/2     Running   0          24h

Service endpoints existed.
Ingress configuration looked valid
ALB resources were present in AWS
However, traffic was not flowing due to the missing Ingress ADDRESS, indicating a failure in Ingress‑to‑ALB reconciliation.

5. Root Cause Analysis (What Went Wrong)

This issue was not a single problem, but a chain of configuration gaps.
Root Causes Identified

5.1 Ingress Group Conflict

TEST ingress was using DEV group name.
Caused ALB ownership conflict.

[ec2-user@ip-xx-xxx-xxx-xxx ~]$ kubectl describe ingress app-ingress -n ep-apps
Name:             app-ingress
Labels:           app=xxx
                  app.kubernetes.io/name=app-ingress
                  app.kubernetes.io/part-of=ep
Namespace:        ep-apps
Address:
Ingress Class:    alb
Default backend:  <default>
Rules:
  Host        Path  Backends
  ----        ----  --------
  *
              /   app:80 (xx.xx.xx.xxx:xxxx,xxx.xx.xx.xx.xxx:xxxx)
**_Annotations:  alb.ingress.kubernetes.io/group.name: app-dev_**
              alb.ingress.kubernetes.io/group.order: 100
              alb.ingress.kubernetes.io/healthcheck-interval-seconds: 30
              alb.ingress.kubernetes.io/healthcheck-path: /api/health

5.2 ACM Certificate Issue

1. Certificate attached was in PENDING_VALIDATION
2. ALB HTTPS listener creation failed

5.3 Subnet Tagging Missing

Public subnets lacked required tags
ALB could not discover subnets correctly

5.4 Broken ALB Controller Webhook

aws-load-balancer-webhook service had no endpoints
Blocked creation of TargetGroupBinding
Prevented Pod IP registration

5.5 Ingress Finalizer Stuck

Failed reconciliation added finalizer
Controller unable to clean up state

6. Solution Applied

Step-by-Step Resolution

6.1 Correct Ingress Group

[ec2-user@ip-xx-xxx-xxx-xx ~]$ kubectl annotate ingress app-ingress -n ep-apps \
  alb.ingress.kubernetes.io/group.name=app-test \
  --overwrite
ingress.networking.k8s.io/app-ingress annotated

6.2 Use ISSUED Valid ACM Certificate

[ec2-user@ip-xx-xxx-xxx-xx ~]$ kubectl annotate ingress app-ingress -n ep-apps \
  alb.ingress.kubernetes.io/certificate-arn=arn:aws:acm:us-west-2:xxxxxxxxxxx:certificate/xxxxxxxxxxxxxxxxxxxxxxx \
  --overwrite
ingress.networking.k8s.io/web-ingress annotated

6.3 Tag Public Subnets (Mandatory)

kubernetes.io/role/elb = 1
kubernetes.io/cluster/<cluster-name> = shared

6.4 Allow ALB → Node Traffic (Critical for IP Mode)

Inbound rule on worker node security group.

6.5 Remove Broken ALB Webhook

Command to check logs

kubectl logs -n kube-system deployment/aws-load-balancer-controller --tail=200

{"level":"error","ts":"2026-04-15T04:30:28Z","msg":"Reconciler error","controller":"ingress","object":{"name":"ep-test"},"namespace":"","name":"ep-test","reconcileID":"7ea1f646-368e-473f-b6b4-cc0a76cf4785","error":"Internal error occurred: failed calling webhook \"mtargetgroupbinding.elbv2.k8s.aws\": failed to call webhook: Post \"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s\": context deadline exceeded"}
{"level":"error","ts":"2026-04-15T04:30:32Z","msg":"Reconciler error","controller":"ingress","object":{"name":"search-query-service","namespace":"ep-apps"},"namespace":"ep-apps","name":"search-query-service","reconcileID":"970d799a-1982-4e78-9791-76daa6a54d4d","error":"Internal error occurred: failed calling webhook \"mtargetgroupbinding.elbv2.k8s.aws\": failed to call webhook: Post \"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s\": context deadline exceeded"}

Solution-

[ec2-user@ip-xxx-xxxx-xxxx ~]$ kubectl get mutatingwebhookconfigurations
NAME                                                             WEBHOOKS   AGE
amazon-cloudwatch-observability-mutating-webhook-configuration   5          18h
aws-load-balancer-webhook                                        6          11h
pod-identity-webhook                                             1          6d21h
vpc-resource-mutating-webhook                                    1          6d21h
[ec2-user@ip-xxx-xxxx-xxxx ~]$ kubectl delete mutatingwebhookconfiguration aws-load-balancer-webhook
mutatingwebhookconfiguration.admissionregistration.k8s.io "aws-load-balancer-webhook" deleted

6.6 Rollout and restart deployment & Recreate Ingress

Restart the ALB controller
✅ This forces the controller to:

Re‑build the model
Create TargetGroupBinding
Register pod IPs
Update ingress status

[ec2-user@ip-xx-xxx-xxx ~]$ kubectl rollout restart deployment aws-load-balancer-controller -n kube-system
deployment.apps/aws-load-balancer-controller restarted

✅ 7. FINAL VERIFICATION

[ec2-user@ip-xx-xxx-xx-xx ~]$ kubectl get ingress app-ingress -n ep-apps -o wide
NAME          CLASS   HOSTS   ADDRESS                                                        PORTS   AGE
app-ingress   alb     *       k8s-eptest-erfs423536-xxxxxxxxxx.us-west-2.elb.amazonaws.com   80      10h

✅ 8. Validation Commands

kubectl get ingress -A
kubectl get endpoints -A
kubectl logs -n kube-system deployment/aws-load-balancer-controller
kubectl get targetgroupbinding -A

9. Final Outcome

✅ ALB created successfully
✅ Target Group registered Pod IPs
✅ Health checks passed
✅ Ingress ADDRESS populated

✅ Application accessible externally over HTTPS

10. Best Practices Checklist (Must Follow Every Time)

✅ Ingress Configuration Checklist

Environment-specific ingress group (dev/test/prod)
Valid target-type (ip or instance)
Correct service name and port
Health check path works from Pod

✅ ACM Certificate Checklist

Certificate status = ISSUED
Cert region = same as ALB
Domain matches DNS

✅ Subnet Checklist (CRITICAL)

For internet-facing ALB

Public subnets
Route to Internet Gateway
Tags: kubernetes.io/role/elb=1 kubernetes.io/cluster/=shared

✅ Security Group Checklist (IP Mode)

ALB SG allows inbound 80/443
Node SG allows inbound from ALB SG on container port
No restrictive NACLs

✅ Controller Health Checklist

aws-load-balancer-controller pods Running
No webhook timeouts in controller logs
TargetGroupBinding objects created

11. Key Learnings

ALB IP mode requires explicit SG permissions
Broken webhooks can silently block target registration
Ingress ADDRESS updates only after full reconciliation
Always validate subnet tags before troubleshooting ALB

12. Conclusion

Ingress with ALB provides a powerful, scalable, and production-ready way to expose applications in EKS.
However, it relies on tight integration between Kubernetes and AWS infrastructure, and misalignment at any layer can lead to hard‑to‑debug issues.

Following the checklists and best practices above will ensure:

Faster deployments
Predictable behavior
Reduced downtime
Easier troubleshooting

Happy Learning & Reliable Kubernetes! 🚀

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826

🚨 Elasticsearch High CPU Issue Due to Memory Pressure – Real Production Incident & Fix

alok shankar — Sat, 04 Apr 2026 09:15:31 +0000

🔍 Introduction

Running Elasticsearch in production requires deep visibility into CPU, memory, shards, and cluster health.

One of the most confusing scenarios DevOps engineers face is:

⚠️ High CPU alerts, but CPU usage looks normal

In this blog, I’ll walk you through a real production incident where:

Elasticsearch triggered CPU alerts
But the actual root cause was memory pressure + shard imbalance + node failure

We’ll cover:

Core Elasticsearch concepts
Real logs and debugging steps
Root cause analysis
Production fix

📘 Important Elasticsearch Concepts

Before diving into the issue, let’s understand some key building blocks.

📦 How Elasticsearch Stores Data

Elasticsearch stores data as documents, grouped into an index.

However, when data grows large (billions/trillions of records), a single index cannot be stored efficiently on one node.

🔹 What is an Index?

An Index is:

A collection of documents
Logical partition of data
Similar to a database

👉 Example:

metricbeat-*
.monitoring-*
user-data

🔹 What are Shards?

To scale horizontally, Elasticsearch splits an index into shards.

Each shard is a small unit of data
Stored across multiple nodes
Acts like a mini-index

⚙️ Why Shards Matter
✅ Scalability → Data distributed across nodes
✅ Performance → Parallel query execution
✅ Availability → Supports failover

🔁 Primary vs Replica Shards

Primary Shard → Original data
Replica Shard → Copy for fault tolerance

🚨 Cluster Health Status
🟢 Green → All shards assigned
🟡 Yellow → Replica shards missing
🔴 Red → Primary shards missing

🧠 JVM & Memory Basics

Elasticsearch runs on JVM:

Heap memory is critical
High usage → Garbage Collection (GC)
GC → CPU spikes

⚠️ Production Issue Overview

We received alerts for:

🔴 High CPU usage
⚠️ Cluster health degraded
📉 Slow search performance

📊 Investigation & Debugging

🔍 Step 1: Cluster Health Check

[ec2-user@ip-x-x-x-x ~]$ curl -X GET "localhost:9200/_cluster/health?pretty"
{
  "cluster_name" : "web-test",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 247,
  "active_shards" : 343,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 193,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 63.99253731343284
}

[ec2-user@ip-x-x-x-x ~]$ curl -X GET "localhost:9200/_cluster/health?filter_path=status,*_shards&pretty"
{
  "status" : "yellow",
  "active_primary_shards" : 247,
  "active_shards" : 343,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 193,
  "delayed_unassigned_shards" : 0
}

👉 Key Insight:

193 unassigned shards → Major issue

🔍 Step 2: Node Resource Usage

[ec2-user@ip-x-x-x-x ~]$ curl -X GET "localhost:9200/_cat/nodes?v=true&s=cpu:desc&pretty"
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
1x.x.x.2x9           73          97   3    0.19    0.16     0.11 cdfhilmrstw -      node-5
1x.x.x.8x            77          90   2    0.03    0.06     0.03 cdfhilmrstw *      node-1
1x.x.x.x            60          84   1    0.22    0.65     0.72 cdfhilmrstw -      node-3
1x.x.x.x            46          90   1    0.03    0.06     0.01 cdfhilmrstw -      node-4
1x.x.x.x            65          91   0    0.01    0.03     0.00 cdfhilmrstw -      node-2

Observation:

CPU: 0–5% (low)
RAM: 88–97% (very high)

👉 This is critical:

CPU alert was misleading — actual issue was memory pressure

🔍 Step 3: OS-Level Analysis

top

[ec2-user@ip-x-x-x-xx ~]$ top
top - 10:57:46 up 13 days, 22:42,  1 user,  load average: 0.77, 0.73, 0.60
Tasks: 114 total,   1 running,  64 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  0.1 sy,  0.0 ni, 97.6 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7863696 total,   744000 free,  5938932 used,  1180764 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2202220 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3743 elastic+  20   0   48.0g   4.9g  36368 S   8.7 65.7   7078:50 java
    1 root      20   0  117520   5144   3408 S   0.0  0.1  22:27.92 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.25 kthreadd
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
    7 root      20   0       0      0      0 S   0.0  0.0   0:13.95 ksoftirqd/0
    8 root      20   0       0      0      0 I   0.0  0.0   2:29.56 rcu_sched
    9 root      20   0       0      0      0 I   0.0  0.0   0:00.00 rcu_bh
   10 root      rt   0       0      0      0 S   0.0  0.0   0:02.68 migration/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:01.54 watchdog/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0  0.0   0:01.63 watchdog/1

Findings:
Java process:

~4.9 GB memory usage
~65% system memory

👉 Elasticsearch consuming most resources

🔍 Step 4: JVM Memory Pressure

curl -X GET "_nodes/stats?filter_path=nodes.*.jvm.mem.pools.old"

Observation:

High old-gen memory usage
Frequent GC cycles

🔍 Step 5: Unassigned Shards Analysis

Unassigned shards have a state of UNASSIGNED. The prirep value is p for primary shards and r for replicas.

curl -X GET "localhost:9200/_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state&pretty"

[ec2-user@ip-x-x-x-xx ~]$ curl -X GET "localhost:9200/_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state&pretty"
index                                                       shard  prirep     state          unassigned.reason
product_search_tab_data                                      0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2023.02.08-000024                           0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.0-2022.12.04-000004                           0     r      UNASSIGNED        NODE_LEFT
.monitoring-es-7-mb-2023.04.16                                0     r      UNASSIGNED        REPLICA_ADDED
.monitoring-es-7-mb-2023.04.14                                0     r      UNASSIGNED        REPLICA_ADDED
apm-7.9.2-span-000002                                         0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2021.12.29-000012                           0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_fap_model_item                                       0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2021.11.29-000011                           0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.1-2022.12.07-000008                           0     r      UNASSIGNED        NODE_LEFT
.kibana-event-log-7.9.2-000024                                0     r      UNASSIGNED        NODE_LEFT
.kibana-event-log-7.17.1-000010                               0     r      UNASSIGNED        NODE_LEFT
.monitoring-kibana-7-2023.04.16                               0     r      UNASSIGNED        REPLICA_ADDED
.kibana-event-log-7.9.2-000026                                0     r      UNASSIGNED        INDEX_CREATED
product_fap_price                                            0     r      UNASSIGNED        NODE_LEFT
.ds-.logs-deprecation.elasticsearch-default-2022.12.12-000020 0     r      UNASSIGNED        NODE_LEFT
ilm-history-2-000025                                          0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.1-2022.10.08-000006                           0     r      UNASSIGNED        NODE_LEFT
ilm-history-2-000023                                          0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT

Key Finding:

UNASSIGNED → NODE_LEFT

👉 Meaning:

A node left the cluster
Replica shards not reassigned

🔍 Step 6: UNASSIGNED Shard Analysis

To understand why an unassigned shard is not being assigned and what action you must take to allow Elasticsearch to assign it, use the cluster allocation explanation API.

curl -X GET "localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&pretty"

[ec2-user@ip-x-x-x-xx ~]$ curl -X GET "localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&pretty"
{
  "index" : "product_search_tab_data",
  "node_allocation_decisions" : [
    {
      "node_name" : "node-1",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[product_search_tab_data][0], node[EQ6QyUbhQZCZRqP78rMIIQ], [P], s[STARTED], a[id=7vBWLesZQAS4zYjt_ER2bw]]"
        },
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.42130719712077%]"
        }
      ]
    },
    {
      "node_name" : "node-5",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [9.907598002066106%]"
        }
      ]
    },
    {
      "node_name" : "node-2",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [11.010075893021023%]"
        }
      ]
    },
    {
      "node_name" : "node-3",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.938318653211446%]"
        }
      ]
    },
    {
      "node_name" : "node-4",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [12.273611767876893%]"
        }
      ]
    }
  ]
}
[ec2-user@ip-x-x-x-xx ~]$

🧠 Root Cause Analysis (RCA)

After correlating all logs, metrics, and cluster behavior, we identified multiple layered issues contributing to the problem.

🔴 1. Large Number of Unassigned Shards
193 shards were unassigned

Majority had reason:

UNASSIGNED → NODE_LEFT

👉 Impact:

Continuous shard allocation attempts
Increased cluster overhead
Memory and thread pressure

🔴 2. Node Failure (NODE_LEFT)

- One or more nodes temporarily left the cluster
- Replica shards lost their assigned nodes

👉 Result:

Cluster moved to YELLOW state
Triggered rebalancing operations

🔴 3. Disk Watermark Threshold Breach (Critical Finding 🚨)

During shard allocation analysis, we found:

"index": "search",
"node_allocation_decisions": [
  {
    "node_name": "node-3",
    "deciders": [
      {
        "decider": "disk_threshold",
        "decision": "NO",
        "explanation": "node above low watermark (85%), free: ~7.6%"
      }
    ]
  },
  {
    "node_name": "node-5",
    "deciders": [
      {
        "decider": "disk_threshold",
        "decision": "NO",
        "explanation": "node above low watermark (85%), free: ~9.6%"
      }
    ]
  },
  {
    "node_name": "node-4",
    "deciders": [
      {
        "decider": "disk_threshold",
        "decision": "NO",
        "explanation": "node above low watermark (85%), free: ~10.7%"
      }
    ]
  }
]

👉 Key Insight:

Elasticsearch refused to allocate shards on nodes
Because disk usage crossed:

cluster.routing.allocation.disk.watermark.low = 85%

👉 Actual situation:

Nodes had only ~7%–10% free disk space
Allocation decision = ❌ NO

⚠️ Why This Is Critical

When disk watermark is breached:

Elasticsearch blocks shard allocation
Unassigned shards remain stuck
Cluster cannot rebalance

👉 This directly caused:

Persistent unassigned shards
Memory pressure
Internal retries → CPU spikes

🔴 4. High JVM Memory Pressure

Heap usage consistently high
JVM old-gen heavily utilized

👉 Result:

Frequent Garbage Collection (GC)
CPU spikes during GC cycles

🔴 5. Thread Pool Pressure

Even though CPU looked low:

Threads were blocked due to:
Allocation retries
Memory pressure

👉 As per Elasticsearch behavior:

Thread pool exhaustion can trigger CPU-related alerts

🧩 Final Root Cause Summary

The issue was NOT just CPU-related.

It was a combination of:

❌ Disk space exhaustion (Watermark breach)
❌ Unassigned shards (allocation blocked)
❌ Node failure (NODE_LEFT)
❌ High JVM memory pressure
❌ Continuous allocation retries

🛠️ Final Fix Implemented

After complete analysis, we identified that:

👉 Insufficient disk space was the primary blocker

🔧 Solution Steps
✅ 1. Increased Disk Capacity

Added +50 GB storage to all Elasticsearch nodes
👉 Result:
Disk usage dropped below watermark threshold
Shard allocation resumed

monitoring-kibana-7-2023.04.17                               0     p      STARTED    node-5
catelog-7.9.2-span-000010                                    0     p      STARTED    node-1
catelog-7.9.2-span-000010                                    0     r      STARTED    node-3
product_fragments                                            0     p      STARTED    node-3
packetbeat-7.9.3-2023.04.14-000019                            0     p      STARTED    node-5
metricbeat-7.10.2-2022.04.14-000014                           0     p      STARTED    node-3
.ds-.logs-deprecation.elasticsearch-default-2022.09.19-000014 0     p      STARTED    node-1
.ds-ilm-history-5-2023.04.09-000028                           0     p      STARTED    node-5
catelog-7.9.2-profile-000010                                  0     p      STARTED    node-2
catelog-7.9.2-profile-000010                                  0     r      STARTED    node-3
packetbeat-7.9.3-2022.09.16-000012                            0     p      STARTED    node-2
metricbeat-7.13.3-2021.07.11-000001                           0     p      STARTED    node-2
logstash                                                      0     p      STARTED    node-3
.monitoring-es-7-mb-2023.04.12                                0     p      STARTED    node-4
.catelog-custom-link                                          0     p      STARTED    node-1
.catelog-custom-link                                          0     r      STARTED    node-3
catelog-7.9.2-metric-000015                                   0     p      STARTED    node-1
catelog-7.9.2-metric-000015                                   0     r      STARTED    node-3
catelog-7.9.2-profile-000017                                  0     r      STARTED    node-3
catelog-7.9.2-profile-000017                                  0     p      STARTED    node-5

✅ 2. Rolling Restart

Restarted nodes one by one (rolling restart)

👉 Ensured:

No downtime
Safe cluster recovery

✅ 3. Automatic Shard Reallocation

Elasticsearch started assigning shards automatically
Cluster began stabilizing

🎯 Final Result
✅ Unassigned shards → 0
✅ Cluster status → GREEN
✅ Memory pressure reduced
✅ CPU spikes eliminated

[ec2-user@ip-x-x-x-xx ~]$ curl -X GET "localhost:9200/_cluster/health?pretty"
{
  "cluster_name" : "web-test",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 247,
  "active_shards" : 536,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

💡 Key Learning (Very Important 🚀)

🔥 Disk space is directly linked to cluster stability in Elasticsearch

Even if:

CPU looks fine
Memory seems manageable

👉 If disk crosses watermark:

Shards won’t allocate
Cluster will degrade

✍️ Conclusion

This incident was a great reminder that Elasticsearch performance issues are rarely straightforward.

What initially appeared as a high CPU problem turned out to be a cascading failure caused by:

Disk watermark threshold breaches
Unassigned shards
Node failure (NODE_LEFT)
JVM memory pressure
Continuous shard allocation retries

👉 The most critical takeaway:

🔥 Disk space is not just a storage concern in Elasticsearch — it directly impacts shard allocation, memory usage, and overall cluster stability.

Even when CPU usage looks normal, underlying factors like:

Heap pressure
Disk utilization
Cluster health 4.can silently degrade the system until it reaches a breaking point.

🚀 Final Thoughts for DevOps Engineers

In production environments, always think beyond surface-level alerts:

Don’t trust CPU metrics alone
Correlate memory, disk, and cluster state
Monitor unassigned shards and disk watermarks proactively
Design clusters with proper shard sizing and capacity planning.

🚀 Headlamp: A Modern Kubernetes UI You’ll Actually Enjoy Using

alok shankar — Mon, 30 Mar 2026 13:54:25 +0000

🔹 1. Introduction

Managing Kubernetes clusters via CLI (kubectl) is powerful—but let’s be honest, it can get overwhelming, especially when dealing with complex workloads, debugging issues, or onboarding new team members.

This is where Headlamp comes in.

Headlamp is a user-friendly, extensible Kubernetes UI designed to simplify cluster management while still giving DevOps engineers deep visibility and control.

👉 Think of it as:

A developer-friendly Kubernetes dashboard
A modern alternative to traditional tools
A UI that supports plugins and extensibility

🔹 2. Why Use Headlamp Over Kubernetes Dashboard?

The official Kubernetes Dashboard, while simple and lightweight, has not kept pace with the needs of modern, production-grade environments. In fact, as of early 2026, the Kubernetes Dashboard has been officially archived and is no longer maintained, with the Kubernetes community and documentation now recommending Headlamp as the preferred UI.
The traditional Kubernetes Dashboard has been around for years, but it comes with limitations.

❌ Challenges with Kubernetes Dashboard:

Complex authentication setup (token-based access)
Limited debugging capabilities
No plugin/extensibility support
Poor UX for large-scale clusters
Not actively evolving for modern DevOps needs

✅ Why Headlamp Wins:

Simple setup and login
Clean and intuitive UI
Plugin-based architecture
Better visibility into workloads
Built-in terminal (exec into pods)
Real-time logs and metrics

👉 In short: Headlamp is built for modern DevOps workflows.

🔹 3. Headlamp vs Kubernetes Dashboard (Comparison)

To clearly illustrate the differences, the following table contrasts Headlamp and the Kubernetes Dashboard across key dimensions:

Feature/Capability	Headlamp	Kubernetes Dashboard
Project Status	Actively maintained, CNCF Sandbox project	Officially archived, unmaintained
Deployment Modes	Desktop app (Windows, Linux, Mac), in-cluster	In-cluster web UI only
Multi-Cluster Support	Yes, via kubeconfig and context switching	No, single cluster per instance
RBAC Awareness	Full RBAC support, UI adapts to user permissions	Basic RBAC, less granular
CRD/Operator Support	First-class, auto-discovers and renders CRDs	Limited, often breaks with CRDs
Extensibility	Robust plugin system, easy customization	Minimal, no plugin architecture
Resource Relationships	Visualizes ownership and relationships	Object-centric, limited relationships
Security Model	Uses kubeconfig, minimal cluster footprint	Requires in-cluster service account
UI/UX	Modern, clean, responsive	Basic, dated
Logs & Exec	Integrated log viewing, pod exec, download logs	Basic logs, limited exec
Community & Support	Active, open-source, CNCF-backed	Community archived, no new features
Production Readiness	Yes, recommended for enterprise use	Not recommended for production

🔹 4. Key Benefits of Headlamp
🚀 Developer Productivity

Visual representation of resources
Faster troubleshooting

🔍 Deep Observability

Logs, events, and YAML in one place

🔌 Extensibility

Add plugins for custom workflows

⚡ Faster Debugging

Exec into pods directly from UI

🌍 Multi-cluster Management

Manage multiple clusters seamlessly

Installation Steps for Windows and Linux

Headlamp offers multiple installation methods, catering to both desktop and in-cluster deployments. Below are detailed, step-by-step instructions for installing Headlamp on Windows and Linux desktops, as well as in-cluster options for team-wide access.

Windows Desktop Installation

Option 1: Install via Winget (Recommended for Windows 10/11)

Open PowerShell or Command Prompt as Administrator.
Run the following command:

   winget install headlamp

Once installed, launch Headlamp from the Start Menu or by searching for "Headlamp".

Option 2: Install via Chocolatey

Ensure Chocolatey is installed. If not, follow the instructions at https://chocolatey.org/install.
Open PowerShell as Administrator.
Run:

   choco install headlamp

Launch Headlamp from the Start Menu.

Option 3: Download the Installer from GitHub Releases

Visit the Headlamp GitHub Releases page.
Download the latest .exe installer.
Double-click the installer and follow the prompts.
Launch Headlamp from the Start Menu.

Upgrading:

If installed via Winget or Chocolatey, use winget upgrade headlamp or choco upgrade headlamp to update.
If installed via the GitHub installer, download and run the new version manually.

First Launch:

On first launch, Headlamp will prompt you to select a kubeconfig file or will automatically load it from ~/.kube/config. Select your desired cluster context to begin managing your Kubernetes environment.

In-Cluster Installation (Helm and YAML)

For team-wide, browser-based access, Headlamp can be deployed inside your Kubernetes cluster.

Option 1: Install via Helm

Add the Headlamp Helm repository:

   helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/
   helm repo update

Install Headlamp in the desired namespace (e.g., headlamp):

   helm install headlamp headlamp/headlamp --namespace headlamp --create-namespace

Forward the service port to your local machine:

   kubectl port-forward svc/headlamp 4466:80 -n headlamp

Access Headlamp at http://localhost:4466 in your browser.

Option 2: Install via YAML Manifest

Apply the official deployment YAML:

   kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/headlamp/main/deployment/headlamp.yaml

Forward the service port as above.

Authentication:

Headlamp uses Kubernetes RBAC for authentication. For secure access, create a ServiceAccount and ClusterRoleBinding as needed, and use the generated token to log in.

How to Verify Pods and Check Workloads Using Headlamp

One of Headlamp’s core strengths is its ability to provide clear, actionable insights into the state of your workloads. Here’s how to verify pods and check workloads step by step:

1. Navigating to Workloads

Open Headlamp and select your cluster context.

In the sidebar, click on “Workloads”. This section aggregates all workload types: Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, and Pods.

2. Viewing Pods

Click on “Pods” under the Workloads section.
You’ll see a table listing all pods, with columns for Name, Namespace, Status, Age, Node, and more.
Status indicators (color-coded) provide at-a-glance health information: green for Running, yellow for Pending, red for Failed, etc.

3. Inspecting Pod Details

Click on a pod name to open its detail view.
The detail page shows:
- Pod status (phase, conditions, restarts)
- Container statuses (ready, waiting, terminated, reason)
- Events (recent events affecting the pod)
- Resource usage (CPU, memory, if metrics are available)
- YAML view for advanced inspection or editing (if permitted)

4. Workload Overview Dashboard

The Workload Overview provides charts and summaries of all workload types, including ready vs. total replicas, health status, and recent changes.
Use filters and search to narrow down by namespace, label, or status.

5. Practical Example

Suppose you want to verify that all pods in the production namespace are healthy:

Select the production namespace from the namespace dropdown.
Go to Workloads → Pods.
Check the Status column for any pods not in the Running state.
Click on any problematic pod to view its events and logs for troubleshooting.

Analysis:

Headlamp’s pod and workload management interface leverages standardized patterns for filtering, sorting, and metrics visualization, making it easy to spot issues and drill down for details. The consistent UI across resource types ensures a smooth learning curve and efficient operations.

How to Analyze and Download Logs in Headlamp

Accessing and analyzing logs is critical for troubleshooting and monitoring Kubernetes workloads. Headlamp streamlines this process with integrated log viewing and download capabilities.

1. Accessing Pod Logs

Navigate to Workloads → Pods.
Click on the desired pod to open its detail view.
In the pod detail page, locate the “Logs” tab or section.

2. Viewing Logs

Select the container (if the pod has multiple containers) from the dropdown.
Logs are streamed in real-time, with options to pause, scroll, or search within the log output.
You can filter logs by time range or keywords for targeted analysis.

3. Downloading Logs

Click the “Download” button (usually represented by a download icon) to save the current log output as a file.
Choose the desired format (plain text or JSON, depending on implementation).

4. Practical Example

If a deployment is failing, you can:

Identify the failing pod in the Workloads → Pods view.
Open its detail page and switch to the Logs tab.
Review recent log entries for errors or stack traces.
Download the logs for offline analysis or sharing with your team.

Analysis:

Headlamp’s log viewer eliminates the need for running kubectl logs commands or SSHing into nodes. The ability to stream, filter, and download logs directly from the UI accelerates troubleshooting and supports collaborative debugging.

How to Execute Commands in Running Pods (Runtime Pod Exec)

Executing commands inside running containers is often necessary for debugging, inspecting file systems, or running diagnostics. Headlamp provides a secure, RBAC-aware interface for runtime pod exec.

1. Accessing Pod Exec

Navigate to Workloads → Pods.
Click on the target pod to open its detail view.
Look for the “Exec” or “Terminal” button (often represented by a terminal icon).

2. Opening a Terminal

Click the Exec/Terminal button.
A web-based terminal session opens, connected to the selected container.
You can run shell commands (e.g., ls, cat, env, top) as if you were inside the container.

3. Security and Permissions

The Exec feature is only available if your RBAC permissions allow it.
If you lack the necessary permissions, the button will be hidden or disabled.

4. Practical Example

To debug a misbehaving application:

Open the pod’s terminal via Exec.
Run ps aux to inspect running processes.
Check configuration files or logs within the container.
Exit the session when done.

Analysis:

Headlamp’s runtime exec feature brings the power of kubectl exec to the browser, with RBAC enforcement and auditability. This reduces the need for direct node access and supports secure, efficient debugging workflows.

Debugging with Headlamp

Effective debugging in Kubernetes requires visibility into events, resource relationships, and traces. Headlamp offers a suite of tools to support comprehensive debugging:

1. Events and Relationships

Events:

Each resource detail page includes a list of recent Kubernetes events affecting that resource (e.g., pod scheduled, container crash, image pull errors). Events are color-coded by severity and timestamped for easy correlation.
Resource Relationships:

Headlamp visualizes ownership and dependency chains, such as which ReplicaSet owns a Pod, or which Service routes to which Pods. This helps trace issues across controllers and workloads.

2. Traces and Metrics

Traceloop Plugin:

For advanced debugging, the Inspektor Gadget Traceloop plugin can be integrated, providing syscall traces for pods. This acts as a “flight data recorder,” capturing system calls before and after crashes for post-mortem analysis.
Metrics Integration:

Headlamp can display resource usage metrics (CPU, memory) for nodes and pods, aiding in performance troubleshooting.

3. Practical Debugging Workflow

Suppose a deployment is experiencing intermittent pod restarts:

Open the deployment’s detail page to view related ReplicaSets and Pods.
Inspect events for crash loops or scheduling errors.
View pod logs for error messages.
Use the Exec feature to inspect the container’s environment.
If available, use the Traceloop plugin to analyze syscalls leading up to the crash.

Analysis:

By consolidating events, logs, metrics, and resource relationships in one UI, Headlamp enables rapid root cause analysis and reduces mean time to resolution (MTTR) for production incidents.

Precautions While Using Headlamp

While Headlamp is designed with security and usability in mind, there are important precautions and best practices to follow:

1. RBAC and Access Control

Principle of Least Privilege:

Always assign the minimum necessary permissions to users and service accounts accessing Headlamp. Use Kubernetes RBAC to restrict actions by role, namespace, or resource type.
ServiceAccount Tokens:

For in-cluster deployments, generate dedicated ServiceAccounts and ClusterRoleBindings for Headlamp access. Avoid using cluster-admin tokens for regular users.
UI Controls Reflect Permissions:

Headlamp hides or disables UI controls (edit, delete, exec) if the user lacks the corresponding RBAC permissions, reducing the risk of unauthorized actions.

2. Network Exposure

Restrict External Access:

Do not expose Headlamp (or any Kubernetes dashboard) directly to the public internet. Use VPNs, IP allowlists, or network policies to restrict access to trusted networks.
TLS and Ingress:

When exposing Headlamp via Ingress, always enable TLS/HTTPS and use trusted certificates. Consider integrating with identity providers for Single Sign-On (SSO).

3. Audit Logging

Kubernetes API Auditing:

All actions performed via Headlamp are executed through the Kubernetes API and are subject to API server audit logging. Ensure audit logs are enabled and retained according to compliance requirements.
No Native UI Audit Log:

Headlamp itself does not maintain a separate audit log of UI actions. Rely on Kubernetes API audit logs for forensic analysis.

4. Plugin Security

Review Plugins Carefully:

Only install plugins from trusted sources. Review plugin code and permissions, as plugins can extend or modify UI behavior.
Sandboxing:

Desktop deployments are isolated to the local machine, reducing risk. In-cluster deployments should be monitored for plugin activity.

5. Upgrades and Maintenance

Keep Headlamp Updated:

Regularly update Headlamp to the latest version to receive security patches and new features. Use package managers (Winget, Chocolatey, Flatpak) or Helm for managed upgrades.
Monitor for Vulnerabilities:

Subscribe to Headlamp’s GitHub repository or community channels for security advisories and updates.

Analysis:

By adhering to these precautions, organizations can safely leverage Headlamp’s capabilities while minimizing security risks and maintaining compliance with best practices.

Plugins and Extensibility

One of Headlamp’s defining features is its extensible plugin system, which empowers users and organizations to tailor the UI to their unique workflows and requirements.

1. What Can Plugins Do?

Custom Dashboards:

Build specialized pages with visualizations, metrics, or business logic.
Resource Extensions:

Add custom sections, actions, or views to existing Kubernetes resources.
External Integrations:

Connect Headlamp to monitoring tools (Prometheus, Grafana), CI/CD systems, or cost management platforms.
Branding and Theming:

Apply custom themes, logos, or UI components to match organizational branding.
Automation and Workflows:

Implement organization-specific automation, such as bulk actions or approval workflows.

2. Official and Community Plugins

Headlamp maintains a repository of official plugins, including:

Prometheus:

Adds Prometheus-powered charts to workload detail views.
cert-manager:

UI for managing cert-manager resources.
Flux, Karpenter, KEDA, Knative, Minikube, Opencost:

Integrations for popular Kubernetes operators and tools.
AI Assistant:

Integrates AI capabilities directly into Headlamp.
Plugin Catalog:

Enables one-click installation of plugins from within the desktop app.

Community plugins are available for tools like Trivy (vulnerability scanning), Kyverno (policy management), Kubescape (security scanning), and more.

3. Developing Plugins

Framework:

Plugins are developed using TypeScript/React and the Headlamp plugin API.
Development Workflow:

Use the @kinvolk/headlamp-plugin CLI for scaffolding, building, and packaging plugins.
Distribution:

Plugins can be distributed via Artifact Hub or internal repositories.
Documentation:

Comprehensive guides and examples are available in the Headlamp documentation and plugins repository.

4. Practical Example: Adding a Prometheus Chart

Install the Prometheus plugin from the Plugin Catalog.
Configure Prometheus in your cluster.
View real-time metrics charts in workload detail pages.

Analysis:

The plugin system transforms Headlamp from a static dashboard into a customizable platform, enabling organizations to innovate and adapt as their Kubernetes environments evolve.

RBAC, CRD Handling, and Multi-Cluster Support

1. RBAC (Role-Based Access Control)

Fine-Grained Permissions:

Headlamp enforces Kubernetes RBAC policies, ensuring users only see and perform actions they are authorized for.
UI Adaptation:

The UI dynamically adapts to the user’s permissions, hiding or disabling controls as appropriate.
Namespace and Cluster Scope:

Permissions can be restricted at the namespace or cluster level, supporting multi-tenant and secure environments.

2. CRD (Custom Resource Definition) Handling

Auto-Discovery:

Headlamp automatically detects and renders CRDs, displaying their custom schemas and status fields.
Operator Support:

Works seamlessly with operator-driven clusters, supporting tools like Argo CD, Prometheus Operator, and custom controllers.
Plugin Extensions:

Plugins can extend the UI for specialized CRDs, providing tailored dashboards or management interfaces.

3. Multi-Cluster Support

Unified Management:

Manage multiple clusters from a single Headlamp instance, switching contexts via the cluster switcher.
Kubeconfig Integration:

Desktop app uses your kubeconfig file, supporting any number of clusters and contexts.
In-Cluster Federation:

In-cluster deployments can be configured to access multiple clusters if API access is permitted.

Analysis:

These capabilities make Headlamp suitable for complex, enterprise-grade environments, supporting secure, scalable, and operator-driven Kubernetes operations.

Troubleshooting Common Issues and Verification After Install

1. Installation Verification

Desktop App:

On launch, ensure Headlamp loads your kubeconfig and displays available clusters. If clusters are missing, check the kubeconfig path and permissions.
In-Cluster Deployment:

After deploying Headlamp, verify the deployment and service status:

  kubectl get deploy -n headlamp
  kubectl get svc -n headlamp

Ensure pods are running and the service is accessible via port-forward, NodePort, or Ingress.

2. Authentication Issues

Access Denied:

If you see “Access Denied” errors, verify your ServiceAccount token and RBAC permissions. Ensure the token is valid and has the necessary roles.
Missing Resources:

If resources are missing from the UI, check RBAC policies and namespace filters.

3. Log and Exec Failures

Logs Not Loading:

Ensure the Headlamp service has network access to the Kubernetes API and that your RBAC permissions include get and list on pods/logs.
Exec Not Working:

Confirm that your role includes the pods/exec verb. Some environments restrict exec for security reasons.

4. Plugin Issues

Plugin Not Loading:

Check plugin compatibility with your Headlamp version. Review plugin logs for errors.
UI Glitches:

Clear browser cache or restart the desktop app. Ensure all dependencies are up to date.

5. Port Forwarding Problems

Stuck Pending: If port forwarding is stuck, verify that your RBAC permissions include pods/portforward and services/portforward. Check for port conflicts or network restrictions.

6. Metrics Not Displayed

Prometheus Integration: Ensure Prometheus is installed and accessible in your cluster. Configure the Prometheus plugin as needed.

Analysis:

Most issues stem from RBAC misconfigurations, network restrictions, or plugin compatibility. Careful review of logs, permissions, and documentation resolves the majority of problems.

Integration with Monitoring and Observability Tools

Headlamp can be integrated with popular monitoring and observability stacks to provide comprehensive visibility into cluster health and performance.

1. Prometheus

Plugin Integration:

The Prometheus plugin adds charts and metrics to workload detail views, displaying CPU, memory, and custom metrics.
Configuration:

Deploy Prometheus in your cluster and configure the plugin to point to the Prometheus endpoint.

2. Grafana

External Dashboards:

While Headlamp does not natively embed Grafana dashboards, it can link to external Grafana instances for advanced visualization.
Observability Stack:

Combine Headlamp with Prometheus and Grafana for a full-featured observability solution, leveraging exporters, scrapers, and dashboards.

3. Third-Party Integrations

Plugins:

Community plugins are available for tools like Opencost (cost monitoring), Trivy (vulnerability scanning), and Kubescape (security compliance).
Custom Plugins:

Develop custom plugins to integrate with proprietary monitoring or alerting systems.

Analysis:

Headlamp’s extensibility ensures it can fit into any observability stack, providing both built-in and customizable monitoring capabilities.

Limitations, Caveats, and Best Practices for Production Use

1. Limitations

Not a Full Replacement for kubectl:

While Headlamp covers most day-to-day operations, advanced automation, scripting, and bulk operations are still best handled via the CLI.
No Built-In Audit Log:

Headlamp relies on Kubernetes API audit logs for action tracking. There is no separate UI audit log.
Metrics Visualization:

Headlamp provides basic metrics visualization but does not match the depth of dedicated tools like Grafana.
Plugin Development Overhead:

Specialized CRDs or workflows may require custom plugin development, which involves TypeScript/React expertise.
RBAC Complexity:

Fine-tuning RBAC for large teams can be complex; misconfigurations may lead to incomplete UI or access issues.

2. Best Practices

Use RBAC to Enforce Least Privilege:

Regularly review and update roles and bindings to minimize risk.
Restrict Network Exposure:

Never expose Headlamp directly to the internet without strong authentication and network controls.
Monitor and Audit Usage:

Enable and retain Kubernetes API audit logs for compliance and incident response.
Keep Headlamp and Plugins Updated:

Regularly update to the latest versions to benefit from security patches and new features.
Test Plugins in Staging:

Validate new plugins in a non-production environment before rolling out to production clusters.

Analysis:

By understanding these limitations and following best practices, organizations can maximize Headlamp’s benefits while mitigating risks.

Community, Contribution, and Official Resources

Headlamp is a vibrant, community-driven project with active development and support channels.

1. Community Involvement

Open Source:

100% open source under Apache 2.0 License.
Contribution:

Contributions are welcome via GitHub pull requests, plugin development, documentation, and issue reporting.
Community Channels:
- #headlamp channel in Kubernetes Slack
- Monthly community meetings
- GitHub Discussions and Issues

2. Official Resources

Documentation:

Comprehensive user and developer documentation at headlamp.dev/docs.
GitHub Repository:

Source code, releases, and contribution guidelines at github.com/kubernetes-sigs/headlamp.
Plugins Repository:

Official and community plugins at github.com/headlamp-k8s/plugins.
Artifact Hub:

Discover and install plugins via Artifact Hub.

Analysis:

Active community engagement ensures Headlamp remains relevant, secure, and feature-rich, with rapid response to issues and evolving requirements.

When to Use Headlamp vs kubectl and CLI Automation

1. Use Headlamp When:

You need a visual overview of cluster health and resource relationships.
Managing or troubleshooting workloads, pods, and services interactively.
Onboarding new team members or collaborating with less technical users.
Viewing and managing CRDs and operator-driven resources.
Integrating with monitoring, cost, or security tools via plugins.
Managing multiple clusters from a single interface.

2. Use kubectl and CLI Automation When:

Performing advanced scripting, automation, or CI/CD integration.
Executing bulk operations or custom workflows.
Managing infrastructure as code (GitOps).
Debugging at the API or YAML level.
Handling edge cases or experimental features not yet supported in the UI.

Analysis:

Headlamp and kubectl are complementary tools. Headlamp excels at visualization, day-to-day management, and collaboration, while kubectl remains indispensable for automation, scripting, and advanced operations.

Conclusion

Headlamp represents the next generation of Kubernetes UIs: user-friendly, extensible, and aligned with the realities of modern, production-grade clusters. By combining a clean, intuitive interface with powerful features like multi-cluster management, robust RBAC support, CRD visibility, and a thriving plugin ecosystem, Headlamp empowers teams to manage Kubernetes with confidence and efficiency.

Whether you are a DevOps engineer troubleshooting a production incident, a developer deploying your first application, or a platform team managing dozens of clusters, Headlamp provides the insight, control, and flexibility you need. Its open-source foundation, active community, and commitment to extensibility ensure that it will continue to evolve alongside Kubernetes itself.

👉 If you’re a DevOps Engineer, SRE, or Cloud Architect, Headlamp can significantly improve your workflow.

Resolving Kubernetes Production Pod Failure Due to EFS Mount & Memory Exhaustion

alok shankar — Wed, 17 Dec 2025 17:37:18 +0000

Introduction

In production Kubernetes environments, storage-related issues combined with resource exhaustion can lead to cascading pod failures. In this blog, a critical JMSQ and backend pod reached 100% memory utilization, and subsequent pod recreation attempts failed. An associated efs-start pod was also impacted, resulting in Persistent Volume mount failures. This blog walks through the real-world troubleshooting approach, commands used, root cause analysis, and the final fix applied in an Amazon EKS cluster.

Problem Statement

Symptoms Observed

Application pod stuck in Pending / ContainerCreating state
Continuous FailedMount errors
Dependent pods unable to start
Heap dump PVC backed by AWS EFS failing to mount

Error Message:

Warning FailedMount kubelet Unable to attach or mount volumes:
unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]:
timed out waiting for the condition

This clearly indicated a Persistent Volume mount issue, most likely related to EFS connectivity from worker nodes.

2️⃣ Failure Flow (What Went Wrong)

JMSQ Pod Memory Spike
│
▼
Pod Restart Triggered
│
▼
Scheduler selects OLD Node
│
▼
Node has STALE EFS mount
│
▼
EFS Volume Mount Timeout
│
▼
Pod stuck in ContainerCreating
│
▼
FailedMount errors flood Events

Step-by-Step Troubleshooting & Fix

Step 1: Identify the Impacted Pod

Command used:

kubectl describe pod jmsq-deployment-34253e -n prdq

Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  110s (x30 over 67m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]: timed out waiting for the condition

Why this step?
kubectl describe provides detailed pod-level events, container states, volume mounts, and scheduling information.

Key Findings:

Pod stuck in ContainerCreating
Volume heapdump-volume failed to mount
Storage backend: PersistentVolumeClaim (PVC)

Step 2: Verify efs-start Pod & Storage Health

Command:

kubectl exec -it efs-start-23sed-34e -n prdefs -- /bin/sh

#Inside the pod:
df -h

Filesystem                Size      Used Available Use% Mounted on
overlay                 128.0G      5.2G    122.8G   4% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     7.6G         0      7.6G   0% /sys/fs/cgroup
fs-xxxxx.efs.us-west-2.amazonaws.com:/
                          8.0E      5.3G      8.0E   0% /persistentvolumes

Why this step?
To ensure that EFS itself was reachable and mounted correctly.

Observation:

EFS filesystem was mounted
No disk space exhaustion
Indicates node-level EFS connectivity issue, not EFS service failure

Step 3: Inspect Persistent Volumes (PV)

Command:

kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                              STORAGECLASS    
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   101Gi      RWX            Delete           Bound    prd/jmsq-volume                      aws-efs      
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/solr-index-volume                slow-local   
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/heapdump-volume                  aws-efs

Purpose:
Validate whether the PV associated with the heapdump PVC is in a healthy Bound state.

Result:

heapdump-volume PV was Bound
Backed by aws-efs storage class
This ruled out PVC misconfiguration.

Step 4: Verify Storage Classes

Command:

kubectl get storageclass

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE
aws-efs         example.com/aws-efs     Delete          Immediate        
fast-local      kubernetes.io/aws-ebs   Delete          Immediate                
slow-local      kubernetes.io/aws-ebs   Delete          Immediate

Why this matters:
Confirms dynamic provisioning behavior and backend type.

Key Storage Class:

aws-efs → RWX, dynamic provisioning, Immediate binding

Step 5: Check Persistent Volume Claims (PVC)

Command:

kubectl get pvc

NAME           STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   
Cloudnexus     Bound    pvc-XXXXXXXXXX   256Gi      RWO            slow-local     
Cloudjenkin    Bound    pvc-xxxxxxxxxx   64Gi       RWO            slow-local

Step 6: Inspect Worker Nodes

Command:

kubectl get nodes

Observation:

All nodes appeared Ready
However, older nodes likely had stale or broken EFS mount connections

Step 7: Cordon Existing Nodes

Command:

for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do \
kubectl cordon $X; \
done

[ec2-user@ip-xxxxxxxx ~]$ for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do kubectl cordon $X;done
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned

Why cordon nodes?

Prevents new workloads from scheduling on problematic nodes
Allows isolation of faulty infrastructure

Verify post cordon nodes:

kubectl get nodes

[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0

Step 8: Scale Auto Scaling Group to Add New Nodes

New EKS worker node was automatically provisioned via ASG
New node established fresh EFS mount connections

Verification:

kubectl get nodes

[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      <none>   50s   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      <none>   50s   v1.22.14-eks-fb459a0

Result:

New node joined cluster in Ready state
Healthy EFS connectivity

Step 9: Restart Impacted Deployments

Command:

kubectl rollout restart deployment <deployment-name> -n <namespace>

Outcome:

Pods scheduled only on new healthy nodes
EFS volumes mounted successfully
Application recovered without data loss

Root Cause Analysis

Long-running worker nodes developed stale EFS mount connections
JMSQ pod memory exhaustion triggered repeated restarts
Kubernetes attempted to reuse unhealthy nodes
EFS mount timeout prevented PVC attachment

Final Fix Applied

✔ Cordoned impacted nodes
✔ Added fresh worker nodes via Auto Scaling Group
✔ Restarted deployments to force rescheduling
✔ Restored healthy EFS mounts and pod stability

3️⃣ Recovery Flow (Fix Applied)

Detect FailedMount Errors
│
▼
Cordon Old Worker Nodes
│
▼
Auto Scaling Group adds NEW Node
│
▼
New Node establishes fresh EFS mount
│
▼
Rollout Restart Deployment
│
▼
Pods scheduled on healthy node
│
▼
Application fully recovered

Key Learnings & Best Practices

Always check Events section in kubectl describe pod
PVC Bound ≠ storage is usable (node-level issues matter)
Periodically rotate EKS worker nodes
Monitor memory usage to avoid JVM heap exhaustion
Use rollout restart instead of deleting pods manually

Summary

This case study demonstrates how storage + node health issues can silently break Kubernetes workloads even when cluster objects appear healthy. A structured troubleshooting approach — starting from pod events, moving to storage, and finally node isolation — helped resolve the production outage efficiently with minimal risk.

If you are running stateful workloads on EKS with EFS, proactive node lifecycle management and monitoring are critical to avoid similar failures.

Happy Learning & Reliable Kubernetes! 🚀

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826

Investigating & Resolving High-CPU Alerts in Kubernetes Pods

alok shankar — Mon, 15 Dec 2025 15:20:58 +0000

Introduction
Recently, I faced a production issue where Observability tools flagged sustained CPU utilization >95% on the particular pod in Kubernetes. Investigation revealed the Java process was hitting the pod’s 3-core CPU limit, even though the node had spare capacity—pointing to application-level saturation.

Using kubectl and in-container diagnostics, I confirmed the JVM as the source.

In this post, I’ll walk through the step-by-step process: how I diagnosed it, the safe remediation (increasing pod CPU limits and optionally scaling replicas), and the follow-up JVM and query checks to prevent recurrence.

Goals

- Confirm the alert (pod and node metrics).
- Determine whether node or pod (application) caused high CPU.
- Identify what inside the pod is CPU-hot (process / JVM threads / GC / queries).
- Apply safe remediation and verify.

Step 1 — Confirm pod metrics (kubectl top)

Command:

kubectl top pod webapp-deployment-rfc4f -n stgapp

Output:

NAME                                          CPU(cores)   MEMORY(bytes)
webapp-deployment-rfc4f                         2863m        2662Mi

Values:

CPU = 2863m (≈ 2.86 cores) → round ≈ 2.9 cores
Memory = 2662Mi (≈ 2.6 GB)

Step 2 — Check the pod's resource limits (deployment spec)

kubectl get deployment webapp-deployment-rfc4f -n stgapp-o yaml | grep -A5 resources

Output:

resources:
  limits:
    cpu: "3"
    memory: 3Gi
  requests:
    cpu: "3"

Interpretation: The pod is consuming ~2.86 cores — very close to its configured CPU limit.

CPU request = 3 cores
CPU limit = 3 cores
The pod is configured to have 3 cores; pod usage ~2.86 cores explains the alert (>95% of 3 cores).

Step 3 — Confirm node health (is it node or pod that’s saturated?)

Command:

kubectl top node

Output:

NAME                                       CPU(cores)  CPU%   MEMORY(bytes)   MEMORY%
ip-xxxxxxxxxx.us-west-2.compute.internal   122m         1%     5575Mi          37%
ip-xxxxxxxxxx.us-west-2.compute.internal   181m         2%     9653Mi          65%
ip-xxxxxxxxxx.us-west-2.compute.internal   86m          1%     7030Mi          47%
ip-xxxxxxxxxx.us-west-2.compute.internal   3045m        39%    7057Mi          47%

ip-xxxxxxxxxx.us-west-2.compute.internal    3045m        39%    7057Mi                    47%
... other nodes show low CPU %

Step 4 — Check process level inside the pod

We exec'd into the pod and listed processes.

Command:

kubectl exec -it webapp-deployment-rfc4f -n stgapp-- ps aux --sort=-%cpu | head -20

Output:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
app_run+      46  225 17.1 8475108 2745168 pts/0 Sl+  04:42 129:33 /usr/lib/jvm/
app_run+       1  0.0  0.0   2664   960 pts/0    Ss   04:42   0:00 /usr/bin/tini
... (other minor processes)

Interpretation:

The Java process (PID 46) is the heavy consumer — observed at ~225% CPU.

This continuous high CPU usage from Java explains the pod-level metric.

We also obtained thread count:

Command :

kubectl exec -it webapp-deployment-rfc4f -n stgapp -- bash -c "ps -eLf | grep java | wc -l"

Output:

Interpretation: ~180 Java threads — a large thread count for java services.

Root cause:

The pod was configured with a CPU limit of 3 cores.
The Java search process consistently consumed most of that (observed 225%–293% at different times).
Node had spare capacity → this was application-level saturation (the application needed more CPU than allotted).
No evidence of node-level resource pressure or cgroup throttling preventing the pod from running; the pod simply used its quota.

Solution implemented:

Below are two safe approaches:

1. Option A — Increase pod CPU limit (vertical fix)

If the application legitimately needs more CPU (sustained), increase limits.cpu. Example: raise limit from 3 → 4 or 6 cores.

Command to update the deployment (non-disruptive for Deployment-managed pods; pods will be rolled):

Command:

kubectl set resources deployment/webapp-deployment-rfc4f -n stgapp \
  --limits=cpu=4 --requests=cpu=3

Or edit YAML:

kubectl edit deployment/webapp-deployment-rfc4f -n stgapp 
# update resources: limits.cpu to "4"

Verify:

kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | grep -A5 resources
# and
kubectl top pod webapp-deployment-rfc4f -n stgapp

Expected result:

Pod has a larger CPU quota. If load is same, CPU% (relative to limit) will drop and alert will recover.

Option B — Scale replicas (horizontal fix)

If requests can be load-balanced across replicas, scale out to reduce per-pod load:

command-

kubectl scale deployment webapp-deployment-rfc4f -n stgapp --replicas=2

Verify:

command

kubectl get pods -n stgapp  -l app=webapp-deployment-rfc4f -o wide
kubectl top pods -n stgapp

Expected result:

Per-pod CPU load drops (if incoming workload is split across replicas).
We increased CPU limit to 4 and scaled to 2 replicas.

Increase CPU limit:

Command-

kubectl set resources deployment/webapp-deployment-rfc4f -n stgapp \
  --limits=cpu=4 --requests=cpu=3

Scale replicas:

kubectl scale deployment webapp-deployment-rfc4f -n stgapp --replicas=2

Verify changes:

kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | sed -n '/resources:/,+6p'
kubectl get pods -n stgapp -l app=webapp-deployment-rfc4f -o wide
kubectl top pod -n stgapp

Post-fix verification:

Deployment resources:

kubectl get deployment webapp-deployment-rfc4f -n stgapp -o yaml | grep -A5 resources

Output -

resources:
  limits:
    cpu: "4"
    memory: 3Gi
  requests:
    cpu: "3"

Summary

Alert triggered due to pod CPU usage ≈ 2.86 cores vs limit 3 cores → Alert flagged >95% sustained usage.
Investigation steps: kubectl top pod, kubectl top node, ps inside pod, check deployment resources.
Root cause: Java search process saturating the pod CPU (application-level).
Remediation: Increase CPU limit (vertical), or scale replicas (horizontal), and investigate hot threads/slow queries for permanent fixes.

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826

Cloud FinOps in Action: How I Saved Thousands by Optimizing AWS Architectures

alok shankar — Sun, 23 Nov 2025 16:36:41 +0000

Managing cloud spending is one of the biggest challenges for modern enterprises. As applications scale, costs silently grow through unused resources, over-provisioned workloads, and inefficient storage patterns. AWS provides numerous tools and best practices to control and optimize spend—yet most organizations use only a small fraction of them.

In this blog, I’m sharing the most effective AWS cost optimization techniques that I have personally implemented across real-world environments. These strategies are simple, practical, and deliver immediate results without compromising performance.

🚀 1. Migrate to Graviton Instances

AWS Graviton2 and Graviton3 processors offer 20–40% better price-performance compared to traditional x86 instances. They are energy-efficient and ideal for application servers, microservices, and container workloads. Migrating to Graviton is one of the easiest ways to cut EC2 compute costs significantly.

💰 2. Purchase Reserved Instances for Long-Running Workloads

If you have workloads running 24/7 (e.g., production servers, databases), Reserved Instances (RI) can cut costs by up to 72%. By committing to a 1-year or 3-year term, you get predictable and deeply discounted pricing compared to On-Demand.

📦 3. Apply S3 Lifecycle Policies

Without lifecycle policies, data sits forever in expensive S3 Standard storage. Using lifecycle rules, cold or unused data can automatically shift to cheaper tiers like Glacier, Glacier Deep Archive, or S3 Infrequent Access. This reduces storage costs dramatically for logs, backups, and infrequently accessed datasets.

🐳 4. Apply ECR Lifecycle Policies

Amazon ECR often stores hundreds of old container images that are no longer required. Implementing ECR lifecycle rules helps delete unused tags and old image versions, keeping repositories clean and reducing unnecessary storage costs.

📊 5. Set Retention Policies for CloudWatch Logs

CloudWatch Logs grow quickly—and storing logs forever gets expensive. Setting a retention period (7, 30, or 90 days) ensures logs are automatically deleted based on relevance. This is essential for cost control in environments with high log volume.

💾 6. Remove Unused AMIs and Snapshots

Unused AMIs and outdated snapshots accumulate over time, consuming EBS storage. Regular audits and deletion of stale snapshots help lower costs and maintain a clutter-free environment. I have used custom script to delete unused AMIs and associated snapshots.

🌐 7. Release Unused Elastic IPs

AWS charges for Elastic IPs that are allocated but not attached to a running instance. Releasing unused Elastic IPs prevents silent billing and keeps your network resources optimized.

🔍 8. Rightsize EC2 Instances

Over-provisioned EC2, RDS, or Auto Scaling Groups lead to unnecessary spending. Use AWS Compute Optimizer or CloudWatch metrics to identify resources that can be downsized. Rightsizing is often the quickest win with immediate cost reduction.

🎯 9. Use Spot Instances for Non-Critical Workloads

For flexible and fault-tolerant workloads, Spot Instances provide up to 90% cost savings. They are ideal for CI/CD pipelines, batch jobs, analytics workloads, and large-scale distributed tasks.

📂 10. Enable S3 Intelligent-Tiering

S3 Intelligent-Tiering automatically moves data between access tiers based on usage. This provides cost savings without needing manual lifecycle rules—perfect for unpredictable access patterns.

💤 11. Shut Down Non-Prod Resources During Off-Hours

DEV/QA environments typically run only during business hours. Automate shutdown using AWS Instance Scheduler or Lambda scripts. This alone can save 30–50% of EC2 costs for non-production environments.

🧾 12. Use AWS Savings Plans

Savings Plans offer flexible, commitment-based pricing across EC2, Fargate, Lambda, and SageMaker, delivering up to 66% savings. Unlike RIs, Savings Plans automatically apply across instance families, regions, and OS types.

⚖️ 13. Optimize Load Balancers

Delete unused ALBs/NLBs, idle target groups, and low-traffic load balancers. ALBs can also be more cost-effective than NLBs for HTTP workloads.

🗃️ 14. Use Aurora Serverless or DynamoDB On-Demand

Not all workloads need permanent, provisioned databases. Serverless and on-demand modes allow you to pay only when data is actually accessed, making them perfect for variable or unpredictable loads.

🔗 15. Reduce NAT Gateway Costs with VPC Endpoints

NAT Gateways charge per GB of data processed. Use VPC endpoints for S3 and DynamoDB to bypass NAT and significantly reduce data transfer charges—especially in data-intensive architectures.

📀 16. Optimize EBS Volumes

Convert GP2 to GP3 volumes to reduce cost and improve performance

Delete unattached EBS volumes
Use EBS Snapshot Lifecycle Manager to automate cleaning
These small changes collectively make a big impact on long-term cost savings.

📝 Conclusion

AWS offers a huge toolbox for cost optimization—but without active monitoring and periodic cleanup, cloud costs quickly spiral out of control. By implementing the techniques above—Graviton migration, lifecycle policies, RI/Savings Plans, rightsizing, and storage optimization—you can achieve substantial savings while keeping your cloud environment efficient and future-ready.

Cost optimization is not a one-time task; it’s a continuous FinOps practice. Start with small improvements and build a culture where teams regularly review and optimize their cloud usage.

Cost comparison post applying all FinOps practice:

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826

🚀 Hosting a React App on AWS Amplify with a Custom Domain

alok shankar — Fri, 20 Jun 2025 14:44:52 +0000

📌 1. Introduction

In today’s fast-paced development world, hosting frontend applications with speed, scalability, and security is essential. If you’ve built a React app and want to deploy it quickly with CI/CD and HTTPS, AWS Amplify is a perfect solution. And the best part? You can access your app using your own custom domain—even if it's hosted on third-party DNS provider like GoDaddy.

In this guide, I’ll walk through hosting a React app on AWS Amplify and linking it to a custom subdomain like https://subdomain.yourdomain.com.

🌐 2. Why Use AWS Amplify to Host Frontend Apps?

AWS Amplify is a full-stack hosting and deployment platform from AWS designed for modern web and mobile applications.

✅ Benefits of AWS Amplify for Hosting Frontend Apps:

- Zero server management
- CI/CD integration with GitHub, GitLab, Bitbucket
- Global CDN for faster content delivery
- Free SSL certificate with automatic HTTPS
- Custom domain support
- Preview environments for every Git branch
- Authentication, APIs, and storage if you expand to full-stack
- It makes deployment as simple as connecting your GitHub repo and clicking "Deploy".

📦 3. Clone a Sample React App from GitHub
Let’s get started with a simple Tailwind CSS + React frontend.

📁 Step-by-step:

# Clone the sample repo
git clone https://github.com/alokshanhbti/amplify-react-poc.git

cd amplify-react-poc

# Install dependencies
npm install

# Run locally
npm start


## If you want to create your own React app:

npx create-react-app amplify-react-poc
cd amplify-react-poc

# Install TailwindCSS
npm install -D tailwindcss postcss autoprefixer
npx tailwindcss init -p

# Run the App Locally
npm start
Your app opens in your browser at http://localhost:3000

# Then push it to GitHub:
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/amplify-react-poc.git
git push -u origin main

🔗 Summary GitHub Repo Structure

amplify-react-poc/
├── public/
├── src/
│   ├── App.js
│   ├── index.css
│   └── index.js
├── tailwind.config.js
├── postcss.config.js
├── package.json
└── README.md

🚀 4. Deployment on AWS Amplify

🌍 Steps to Deploy:

Go to AWS Amplify Console
Click "Create New App"

Choose GitHub → Connect your repo (amplify-react-poc)

Select the main branch and click Next

Amplify auto-detects React → leave build settings as-is

Click "Save and Deploy"

Note : Please check build log if deployment failed.

There was error while build process. So verified the build.setting and checked amplify.yml file.

Added the commands under preBuild section


        - nvm install 20.19.0
        - nvm use 20.19.0
        - node -v  
        - npm install

Post Update amplify.yml file

Deployment Completed

App accessible using amplify url

🌐 5. Add a Custom Domain (e.g., from GoDaddy)

Now let’s connect your Amplify app to a custom domain like https://subdomain.yourdomain.com.

To add a custom domain managed by a third-party DNS provider
Sign in to the AWS Management Console and open the Amplify console.

Choose your app that you want to add a custom domain to.
In the navigation pane, choose Hosting, Custom domains.
On the Custom domains page, choose Add domain.

Enter the name of your root domain. For example, if the name of your domain is https://example.com, enter example.com.

Amplify detects that you are not using a Route 53 domain and gives you the option to create a hosted zone in Route 53.

Once hosted zone created add those nameservers to your Domain provider.

Now add your subdomain and wait for DNS record be create for SSL creation

As my nameservers are changed to Route53 then we don’t have to do anything Amplify will add these records to route 53

Domain activation completed

🔓 6. Access the App Using Custom Domain

After DNS propagation and SSL verification (usually < 1 hour):

✅ Your app will be available at:

https://subdomain.yourdomain.com

It will be fully secured with HTTPS using a free AWS-issued SSL certificate.

🧰 7. Troubleshooting Steps

🔧 Problem                        ✅ Solution
❌ Domain verification failed      Ensure correct CNAME is added in GoDaddy
⏳ Stuck in "Pending"              Use https://dnschecker.org to confirm DNS
🔒 No SSL/HTTPS                     Wait for Amplify to finish provisioning or re-add domain
🛑 404 after deployment             Confirm the subdomain is correctly mapped to your app branch

✅ 8. Conclusion

Hosting a React app on AWS Amplify is fast, secure, and scalable. By combining it with your own custom domain—whether it's registered on GoDaddy or any third-party DNS provider—you get a professional-grade deployment pipeline in minutes.

No DevOps, no manual servers, and no complex SSL setups.

Just code → push → deploy

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826

Automate AWS CloudWatch Log Retention with Bash: Save Costs & Stay Compliant

alok shankar — Sun, 18 May 2025 14:31:16 +0000

🔹 Introduction :

Managing CloudWatch log groups is a critical part of maintaining operational efficiency and cost control in AWS. However, it's easy to overlook retention settings — especially when log groups are created automatically by various AWS services. Without a defined retention period, logs accumulate indefinitely, leading to increased storage costs and unnecessary clutter.

In this blog, I’ll walk through streamlined approach to automatically detect CloudWatch log groups without a retention policy, update them to a 30-day retention period, and generate an HTML report delivered straight to your inbox.

The solution is powered by a simple Bash script that leverages the AWS CLI and standard Linux utilities — making it easy to integrate into any DevOps workflow.

Whether you're a cloud engineer trying to stay compliant or just looking to reduce AWS costs, this automated approach will save time, improve visibility, and ensure consistent log management across your environment.

🔹 Challenges Faced in Manual Process:
Manually managing log retention policies in AWS is like trying to clean every file cabinet in a skyscraper—painful, slow, and error-prone. Some of the common problems:

❌ You can't visually identify which logs lack retention
❌ You have to click through each log group in the AWS Console
❌ There’s no built-in notification when retention is missing
❌ Risk of accumulating terabytes of unused logs

So I thought — “Why not automate the boring stuff?”

🔹 Benefits of Automating CloudWatch Retention Updates
Automating retention policies brings a whole bouquet of benefits:

🌟 Cost Control – Say goodbye to ever-growing log storage bills
🔍 Audit Friendly – Track what's changed, when, and how
📧 Proactive Alerting – Get email summaries with detailed tables
🧹 Cleaner Environment – Consistent retention policies = better hygiene
⏱️ Time Saved – No more manual clicking or forgetfulness

🔹 Prerequisites
Before I dive in, make sure you have the following:

- An AWS account with access to CloudWatch
- IAM permissions to read and update log groups
- AWS CLI configured on your machine
- Bash shell environment (Linux or macOS)
- Tools like jq, sendmail, mailutils installed

🔹 Step 1: Install AWS CLI
If you haven’t installed the AWS CLI yet, follow the steps below:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install

Then configure your credentials:
aws configure
🔹 Step 2: Install Dependencies
You’ll also need jq and sendmail for parsing and email delivery:

sudo apt install jq mailutils -y
🔹 Step 3: Create IAM Policy as per below , attached to IAM role and assign that role to EC2 instance.

You’ll need the following IAM permissions to make it work:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DescribeLogGroups",
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PutRetentionPolicy",
      "Effect": "Allow",
      "Action": "logs:PutRetentionPolicy",
      "Resource": "*"
    },
    {
      "Sid": "CloudWatchMetricsAccess",
      "Effect": "Allow",
      "Action": "cloudwatch:GetMetricStatistics",
      "Resource": "*"
    }
  ]
}

Permissions include:

logs:DescribeLogGroups
logs:DescribeLogStreams
logs:PutRetentionPolicy
cloudwatch:GetMetricStatistics

🔹 Step 4: Clone the GitHub Repository
Instead of writing the script manually, you can simply clone the prebuilt GitHub repository that includes the script, required IAM policy, and a README.
git clone https://github.com/alokshanhbti/cloudwatch-retention-update.git cd cloudwatch-retention-update
Inside the folder, you’ll find:

cloudwatch-retention-update.sh – The automation script
iam-policy.json – IAM policy required for permissions
README.md – Full documentation and usage instructions

🔹 Step 5: Make the Script Executable
After saving the script, make it executable with:
chmod +x cloudwatch-retention-update.sh
🔹 Step 6: Run the Script
Simply execute:
./cloudwatch-retention-update.sh
The script will log activity to a file, apply changes, and email the report to the address you specify.

🔹 Step 7: Script Flow
Here’s how the script works behind the scenes:

🔍 Scan CloudWatch for log groups with no retention
🧠 Fetch metadata: log group name, retention, last event, service name, and storage
✍️ Update retention to 30 days using put-retention-policy
📨 Generate HTML email with two colorful tables:
Before update
After update
📬 Send email via sendmail with all details

🔹 Step 8: Screen shots of email and logs

Email part Before update :

Email part After update :

Logs :

🔹 Conclusion

Automating CloudWatch log retention is a simple yet highly effective way to maintain a clean, cost-efficient, and compliant cloud environment. With this Bash script, you can easily identify log groups without retention settings, apply a consistent 30-day policy, and receive a well-formatted email report — all with minimal effort and zero manual intervention.

This solution not only improves visibility and governance but also frees up your time to focus on higher-value tasks.

Thank you for reading!
If this script helps improve your cloud hygiene, feel free to share it with your team or contribute to the project.

📂 Access the GitHub Repository Here:

alokshanhbti / cloudwatch-retention-update

cloudwatch-retention-update repo that audits AWS CloudWatch log groups with no retention period and updates them to 30-day retention and sends email

📊 CloudWatch Log Retention Manager

cloudwatch-retention-update.sh is a Bash script that audits AWS CloudWatch log groups with no retention period set, updates them to a 30-day retention, and sends a HTML email report containing color-coded tables.

🔧 Features

✅ Identifies log groups without retention
✅ Fetches last log date, associated AWS service, and storage usage (in GB)
✅ Applies a 30-day retention policy
✅ Sends an HTML email via sendmail with:

📋 Before Update Table
✅ After Update Table

📁 Script Overview

📂 Log Group Scan — Uses aws logs describe-log-groups and jq to filter targets
⏳ Retention Status — Detects null retention policies
📅 Last Log Timestamp — Uses describe-log-streams
💾 Storage Usage (GB) — Uses cloudwatch:GetMetricStatistics for StoredBytes
📧 HTML Email Report — Sends two HTML tables (before & after) with colors

🚀 Usage

Step 1: Make it executable

chmod +x cloudwatch-retention-update.sh

Step

…

View on GitHub

Happy automating! 🚀

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826