DEV Community

Cover image for I Built a Bot That Updates My EKS Nodes While I Sleep — Here's How

I Built a Bot That Updates My EKS Nodes While I Sleep — Here's How

TL;DR: Manual EKS AMI updates are slow, risky, and easy to forget. I wired together EventBridge, Lambda, Amazon Bedrock (Claude 3.5 Haiku), GitHub PRs, ArgoCD, and Karpenter into a pipeline that detects new AMIs, runs AI risk analysis, opens a PR for human review, and rolls out nodes automatically — zero downtime, full audit trail.


The problem every EKS team hits eventually

You're running production Kubernetes on AWS. You know you're supposed to keep worker nodes patched. But between sprints, incidents, and everything else — checking for new EKS-optimized AMIs falls through the cracks.

When you finally do an update, there's a whole ritual: find the new AMI ID, read through the release notes, assess any CVEs, draft a PR, wait for approvals, then carefully roll out nodes without taking down your workloads.

It's not rocket science — it's just slow, manual, and one of those tasks that always feels lower priority than the thing currently on fire.

What if the whole thing ran itself?


The solution in one sentence

Twice a day, a Lambda checks for new EKS AMIs. If one exists, Bedrock analyzes the risk and opens a GitHub PR. A human reviews it. Merging the PR triggers ArgoCD + Karpenter to roll out the new nodes with zero downtime.

The magic is that the only thing a human needs to do is read the AI's analysis and merge (or close) the PR. Everything else — detection, analysis, branch creation, notification, node rollout — is automated.


Architecture: Three clean phases

Three-phase EKS AMI automation pipeline: Detect via EventBridge and Lambda, AI Analyze via Amazon Bedrock and GitHub PR, Deploy via ArgoCD and Karpenter

Phase 1 — Detection

An EventBridge scheduled rule fires at 9 AM and 9 PM UTC every day. It triggers a Lambda that:

  1. Queries AWS SSM Parameter Store for the latest EKS-optimized AMI ID (/aws/service/eks/optimized-ami/1.34/amazon-linux-2023/recommended/image_id)
  2. Compares it against what's currently committed in your GitHub repository (your source of truth)
  3. If they differ — new AMI exists → triggers the Step Functions workflow

No new AMI? The Lambda exits quietly. Nothing else happens.

Phase 2 — AI Analysis + Pull Request

This is where it gets interesting. AWS Step Functions orchestrates three Lambda functions in sequence:

Lambda 1 — bedrock-analyzer

Fetches the real AMI release notes from GitHub (awslabs/amazon-eks-ami) and sends them to Amazon Bedrock running Claude 3.5 Haiku with this prompt:

Analyze this Amazon EKS AMI update using the actual release notes.
New AMI ID: {ami_id}
Previous AMI ID: {previous_ami}

ACTUAL EKS AMI RELEASE NOTES:
{release_notes}

Respond in JSON with:
- risk_score: 1–10
- recommendation: APPROVE or REJECT
- summary: one-line summary of actual changes
- pr_description: full markdown PR body with CVEs, package versions,
  risk assessment, and review guidance
Enter fullscreen mode Exit fullscreen mode

The output is a structured JSON object with a risk score and a ready-to-paste PR description.

Lambda 2 — gitops-updater

Uses GitHub App credentials (stored in AWS Secrets Manager) to:

  • Create a new branch
  • Update the Karpenter EC2NodeClass YAML with the new AMI ID
  • Open a Pull Request with the full Bedrock analysis embedded in the description

Lambda 3 — send-notification

Fires an SNS email to the team: "New AMI detected, PR #N is open for your review." Includes the PR link and the one-line AI summary.

The human's job: Read the AI analysis. Check the YAML diff (it's literally one line — the AMI ID). Merge to approve, close to reject.

Phase 3 — GitOps Deployment

After the PR is merged:

  • ArgoCD detects the commit on main, auto-syncs the updated EC2NodeClass manifest to the EKS cluster
  • Karpenter sees the new AMI ID in the EC2NodeClass, provisions new EC2 nodes with the updated AMI, then gracefully drains the old nodes
  • Workloads migrate to new nodes. Zero downtime.

The whole rollout happens without anyone touching kubectl.


What the PR actually looks like

This is what your team sees in GitHub:

## EKS AMI Update — ami-04b406d4e6eaca578

**AI Risk Score: 2/10 — APPROVE**

### What changed
- Go updated to 1.25.9
- Kernel updated to 6.12.79-101.147.amzn2023
- No new CVEs introduced

### CVE Assessment
No critical or high-severity CVEs in this update. Two previously
known CVEs (CVE-2024-XXXX, CVE-2024-YYYY) are patched.

### Review guidance
This is a routine kernel + runtime update. Low risk. Recommend
merging during business hours with normal monitoring in place.

---
*Merge this PR to trigger ArgoCD + Karpenter rollout.*
*Close this PR to skip this AMI version.*
Enter fullscreen mode Exit fullscreen mode

Your reviewer doesn't need to dig through release notes. The AI already did it.


CloudFormation: everything in one stack

The whole solution deploys from a single CloudFormation template. Here's what it provisions:

Resource Purpose
AWS Secrets Manager GitHub App credentials
Amazon SNS + subscription Email alerts
5 IAM roles Per-function least-privilege
4 Lambda functions Detector, analyzer, PR creator, notifier
Amazon Bedrock Guardrail Content filtering on AI output
Step Functions state machine Orchestrates analyze → PR → notify
EventBridge rule Twice-daily schedule

Deploy it:

aws cloudformation create-stack \
  --stack-name eks-ami-update \
  --template-body file://cloudformation-template.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameters \
    ParameterKey=NotificationEmail,ParameterValue=your@email.com \
    ParameterKey=GitHubAppId,ParameterValue=<app-id> \
    ParameterKey=GitHubAppInstallationId,ParameterValue=<install-id> \
    ParameterKey=GitHubAppPrivateKey,ParameterValue=$(base64 -i app.pem | tr -d '\n') \
    ParameterKey=GitHubRepoOwner,ParameterValue=<your-org> \
    ParameterKey=GitHubRepoName,ParameterValue=<your-repo> \
    ParameterKey=GitHubFilePath,ParameterValue=karpenter-configs/clusters/your-cluster/nodeclass.yaml \
    ParameterKey=GitHubBranch,ParameterValue=main \
    ParameterKey=EKSVersion,ParameterValue=1.34
Enter fullscreen mode Exit fullscreen mode

Takes about 2–3 minutes. Confirm the SNS subscription email when it arrives.


Prerequisites checklist

Before deploying, you need:

  • [ ] An existing EKS cluster (v1.34+)
  • [ ] Karpenter installed and configured
  • [ ] ArgoCD installed with auto-sync enabled
  • [ ] A GitHub repository for Karpenter configs
  • [ ] A GitHub App installed on that repo (you need App ID, Installation ID, and Private Key)
  • [ ] Amazon Bedrock enabled in your region (enable Claude 3.5 Haiku access in the Bedrock console)
  • [ ] AWS CLI + kubectl configured

Important: Fork the aws-samples repository to your own account — you need write access to configure the GitHub App. Deploy your EC2NodeClass config to the repo before running the stack.


Testing it without waiting for an AMI release

Don't want to wait up to 12 hours for the schedule to fire? Trigger it manually:

aws lambda invoke \
  --function-name eks-ami-detector \
  --payload '{}' \
  --cli-binary-format raw-in-base64-out \
  /tmp/response.json && cat /tmp/response.json
Enter fullscreen mode Exit fullscreen mode

Check your inbox. You should get an SNS email with the risk analysis and PR link within a couple of minutes.

After merging, verify the ArgoCD sync:

# Update your kubeconfig
aws eks update-kubeconfig --region <region> --name <cluster-name>

# Check ArgoCD sync policy
kubectl get application karpenter-nodeclass -n argocd \
  -o jsonpath='{.spec.syncPolicy}'

# Verify the AMI ID was applied
kubectl get ec2nodeclass default -o yaml | grep ami-
Enter fullscreen mode Exit fullscreen mode

Common issues and how to fix them

SNS subscription not confirmed — Check your spam folder. The confirmation email comes from AWS and sometimes gets filtered.

GitHub App auth failure — Double-check the App is installed on the correct repository with read/write permissions. Regenerate the private key in GitHub if needed and re-run the CloudFormation update.

Bedrock access denied — Go to the Amazon Bedrock console → Model access → enable Claude 3.5 Haiku in your region. This is a manual step that's easy to miss.

ArgoCD not syncing — Verify the Application resource has spec.syncPolicy.automated set. Check that the repo URL and path match exactly.

Step Functions failures — Check CloudWatch Logs for the failing Lambda. 99% of the time it's an IAM permission issue or a missing secret.


Why this architecture is worth copying

A few design decisions I want to highlight:

GitHub PRs as the approval interface — Engineers already live in GitHub. Using a PR as the human gate means no new tool to learn, built-in commenting, and a permanent audit trail in Git history. The PR description IS the change record.

AI analysis on real release notes — The Bedrock prompt fetches actual release notes from the awslabs/amazon-eks-ami repo. It's not making things up — it's summarizing real content. The risk score is grounded in actual CVE and package data.

Karpenter over managed node groups — Karpenter watches the EC2NodeClass for changes and handles the node lifecycle automatically. You don't need to write any drain/cordon scripts.

Least-privilege IAM — Each Lambda has its own role with only the permissions it needs. The CF template provisions five separate roles. This matters in production.

Guardrails on Bedrock — The solution includes a Bedrock Guardrail for content filtering on the AI output. Belt and suspenders.


Cleaning up

aws cloudformation delete-stack --stack-name eks-ami-update
Enter fullscreen mode Exit fullscreen mode

What I'd add next

A few things that would make this even better:

  • Slack notification instead of (or in addition to) SNS email — PR link directly in your #platform channel
  • Dry-run mode — run the full pipeline but don't actually open a PR, just log the analysis
  • Multi-cluster support — one stack managing AMI updates across dev/staging/prod with different approval thresholds per environment
  • Custom risk criteria — tune the Bedrock prompt to your org's specific compliance requirements (PCI-DSS, SOC 2, etc.)
  • Automatic REJECT on critical CVEs — skip the PR entirely and alert the team if the risk score is 8+

Get the code

Fork the repo, follow the README, and deploy:

👉 GitHub: suryansh639/sample-eks-ami-gitops-pipeline

The CloudFormation template, Lambda code, and example Karpenter configs are all there.


Wrapping up

The goal wasn't to remove humans from the loop — it was to remove the boring part of the loop. The AI reads the release notes. The AI writes the PR description. The human decides. The automation executes.

That's the right split. And it means your nodes actually get updated on time, every time, with a full audit trail and no 2 AM surprises.

If you try this out, drop a comment — I'd love to hear what customizations you make.


Top comments (0)