DEV Community: Robert Zsoter

Browser-Based kubectl Access: Managing Kubernetes Without Bastion Hosts

Robert Zsoter — Thu, 08 Jan 2026 16:17:00 +0000

TL;DR

This article presents a browser-based kubectl access pattern
Designed for temporary, auditable cluster interaction
No bastion host, no SSH, no heavy management tools
All actions go through the Kubernetes API and RBAC
Not intended for daily production operations

Accessing Kubernetes clusters securely is a recurring challenge, especially in environments where SSH access, bastion hosts, or heavy management tools are discouraged.

In this article, I’ll walk through a browser-based kubectl access pattern that enables temporary, auditable interaction with a Kubernetes cluster, without relying on jump hosts or always-on management platforms.

This approach is intentionally not designed for daily production operations. Its value lies in controlled access, not convenience.

Why Kubernetes Access Is Hard to Get Right

Most teams rely on one or more of these approaches:

Bastion hosts with SSH access
kubectl configured on local laptops/machines
Full-featured Kubernetes management tools
Cloud-provider shell environments

All of them work, but they come with trade-offs:

credential sprawl
infrastructure overhead/increased attack surface
long-lived access paths/credentials
limited auditability/unclear audit boundaries

In regulated or security-sensitive environments, these trade-offs become unacceptable.

Yet teams still need:

break-glass access
short-lived troubleshooting
training and workshop environments
controlled/restricted support access

This is the gap the browser-based approach addresses.

What “Browser-Based kubectl” Actually Means

This pattern does not introduce a new Kubernetes UI.

Instead, it exposes a restricted web terminal that runs kubectl inside the cluster, using:

a dedicated ServiceAccount
strict RBAC permissions
native Kubernetes audit logging

All access happens through HTTP(S).

There is:

no SSH access,
no node-level access and login,
no user kubeconfig distribution.

High-Level Architecture

Conceptually, the flow looks like this:

User Browser
     |
     v
HTTP(S) (restricted)
     |
     v
Ingress / Load Balancer
     |
     v
Service
     |
     v
Web Terminal Pod
     |
     v
kubectl
     |
     v
Kubernetes API Server

Important details:

The terminal runs as a Pod inside the cluster
Authorization is enforced by Kubernetes RBAC
All kubectl actions result in Kubernetes API requests, which can be captured by Kubernetes audit logs depending on the audit policy
Access can be disabled instantly by removing the Pod or Service
No persistent external access paths remain

and represented in an ASCII diagram:

A few pictures, in operation

basic commands:
jump into pod and use kubectl exec command -within the defined namespace:

Security Model: Why This Is Auditable by Design

This pattern relies on layered security, not a single control.

Network layer

TLS termination
IP allowlists
No direct node exposure

Kubernetes authorization

Dedicated ServiceAccount
Least-privilege RBAC
Optional read-only mode

Auditability

Every action flows through the Kubernetes API
Native audit logs capture requests

Auditability: What Is (and Is Not) Logged

This is an important clarification.

kubectl commands themselves are not logged
Kubernetes API requests are

When a user executes a command:

kubectl get pods
kubectl describe deployment ...
kubectl apply -f …

The resulting API calls can be recorded in Kubernetes audit logs, depending on the configured audit policy.

This means:

Resource access and mutations are traceable
RBAC enforcement is preserved
No hidden or opaque access paths exist (unlike SSH sessions)

This pattern relies on Kubernetes’ native security model, not on custom logging logic.

⚠️ Important

This approach does not bypass Kubernetes security controls, it depends on them.

How This Compares to Other Access Patterns

This pattern is not a replacement, it fills a specific operational niche.

When You Should (and Should Not) Use This

Not recommended

Daily production operations
CI/CD automation
Persistent admin workflows

The limitations are intentional.They help prevent accidental misuse.

Lessons Learned

After experimenting with this pattern in real environments, a few things became clear:

Kubernetes RBAC remains the single most important control
Auditability improves when access paths are explicit
Removing SSH simplifies security reviews
Temporary access patterns reduce long-term risk

Convenience is easy to add. Removing access later is much harder.

Final Thoughts

Secure Kubernetes access is less about tools and more about boundaries.

Browser-based kubectl access provides a minimal, auditable, and intentionally constrained way to interact with a cluster when traditional approaches are unavailable or undesirable.

Used correctly, it solves a real problem, without becoming a new one.

Reference Implementation

The repository demonstrating this pattern is available here: https://github.com/zsoterr/k8s-web-terminal-kubectl

And what's next?

I am planning a number of modifications and additions. You can find more information about these in README.md in the GitHub repository.

Note

This DEV.to post is a concise version of a longer, experience-based guide.

If you’re interested in deeper technical details, you can read it among My medium stories

About the Author

I’m Róbert Zsótér, Kubernetes & AWS architect. If you’re into Kubernetes, EKS, Terraform, AI and cloud-native security, follow my latest posts here:

LinkedIn: Róbert Zsótér
Substack: CSHU

Let’s build secure, scalable clusters, together.

Note: Originally published on Medium Browser-Based kubectl Access: Managing Kubernetes Without Bastion Hosts or Heavy Tools

Using Amazon Q for AI-Assisted Debugging in Amazon EKS

Robert Zsoter — Tue, 23 Dec 2025 15:16:00 +0000

Using Amazon Q for AI-Assisted Debugging in Amazon EKS

Practical insights for Kubernetes engineers

The first step: Use Amazon Q capabilities in EKS environment - Part 1:
Fix issue with AWS IAM permissions and configure EKS environment.

TL;DR

Amazon Q enables AI-assisted debugging directly in the AWS Console for EKS
It accelerates root-cause analysis but does not replace kubectl or observability tools
Correct IAM and EKS access configuration is critical: most Amazon Q “issues” are access-related
Best used as a diagnostic accelerator, not an automated fix engine

𝗗𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 Amazon 𝗘𝗞𝗦 environments is rarely straightforward. Even experienced Kubernetes engineers often need to 𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗲 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗮𝗰𝗿𝗼𝘀𝘀 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗹𝗮𝘆𝗲𝗿𝘀: pod logs, node health, IAM permissions, control plane behavior, networking, AWS-managed integrations, etc.

AWS introduced 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 some time ago.

𝗔𝗺𝗮𝘇𝗼𝗻 𝗤, an 𝗔𝗜 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 embedded into the AWS 𝗖𝗼𝗻𝘀𝗼𝗹𝗲, which brings a new operational model to 𝗘𝗞𝗦 𝘁𝗿𝗼𝘂𝗯𝗹𝗲𝘀𝗵𝗼𝗼𝘁𝗶𝗻𝗴: a 𝗰𝗼𝗻𝘁𝗲𝘅𝘁-𝗮𝘄𝗮𝗿𝗲, 𝗔𝗜-𝗮𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 directly where engineers already work.

This article 𝗲𝘅𝗽𝗹𝗮𝗶𝗻𝘀 what Amazon 𝗤 adds to 𝗘𝗞𝗦 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴, where it fits into real-world workflows, and why 𝗮𝗰𝗰𝗲𝘀𝘀 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗶𝗼𝗻 - not AI - is the real 𝗸𝗲𝘆 to success.

Why EKS Debugging Is Still Challenging

Although EKS abstracts much of the Kubernetes control plane, 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 remains 𝗰𝗼𝗺𝗽𝗹𝗲𝘅:
• Pod failures often involve IAM, networking, or node capacity

• Cluster events and logs are spread across services

• Kubernetes RBAC and AWS IAM must both align

• Engineers switch constantly between tools and consoles

𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 workflows rely heavily on kubectl, CloudWatch 𝗟𝗼𝗴𝘀, 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 dashboards, and 𝗱𝗲𝗲𝗽 platform 𝗸𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲. This is 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲, 𝗯𝘂𝘁 𝘀𝗹𝗼𝘄 and cognitively expensive.

What Amazon Q Brings to the EKS Console

𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗿 is an AI-powered assistant 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗲𝗱 into the AWS 𝗖𝗼𝗻𝘀𝗼𝗹𝗲 UI.

When used with EKS, 𝗶𝘁 𝗰𝗮𝗻:
• Inspect cluster state and related AWS resources

• Explain error conditions in natural language

• Correlate Kubernetes symptoms with AWS infrastructure

• Suggest likely causes and remediation paths

• Generate Kubernetes YAML examples

Unlike external AI tools, Amazon 𝗤 𝗼𝗽𝗲𝗿𝗮𝘁𝗲𝘀 𝘄𝗶𝘁𝗵𝗶𝗻 𝗔𝗪𝗦 𝗰𝗼𝗻𝘁𝗲𝘅𝘁, meaning its 𝗮𝗻𝘀𝘄𝗲𝗿𝘀 are 𝘁𝗶𝗲𝗱 to what it can 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝘀𝗲𝗲 in your account and cluster.

𝗡𝗼 𝗖𝗟𝗜 installation is required. The interaction happens directly 𝗶𝗻𝘀𝗶𝗱𝗲 𝘁𝗵𝗲 𝗰𝗼𝗻𝘀𝗼𝗹𝗲.

Example Queries You Can Ask

"Why is this pod in CrashLoopBackOff?"
"Explain why my node group isn't scaling up."
"Generate a Deployment YAML for NGINX with a LoadBalancer Service."
"Check if my cluster is using deprecated APIs before upgrading to 1.33."
"How do I restrict traffic between namespaces with a NetworkPolicy?"

Console-Aware vs Cluster-Aware Amazon Q

It’s 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝘁𝗼 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 that not all Amazon Q experiences are identical.

Today, 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 may 𝗲𝗻𝗰𝗼𝘂𝗻𝘁𝗲𝗿 the following:
• 𝗚𝗹𝗼𝗯𝗮𝗹 𝗖𝗼𝗻𝘀𝗼𝗹𝗲 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤: a general-purpose AWS assistant (broadly available)
• 𝗖𝗼𝗻𝘁𝗲𝘅𝘁-𝗮𝘄𝗮𝗿𝗲 𝗘𝗞𝗦-𝗻𝗮𝘁𝗶𝘃𝗲 𝗤: embedded directly in EKS resource views (pods, nodes, add-ons)

The 𝗴𝗹𝗼𝗯𝗮𝗹 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 works across services and regions.
The 𝗘𝗞𝗦-𝗻𝗮𝘁𝗶𝘃𝗲 𝘃𝗲𝗿𝘀𝗶𝗼𝗻 appears contextually on cluster pages and 𝗰𝗮𝗻 𝗶𝗻𝘀𝗽𝗲𝗰𝘁 workloads more 𝗱𝗲𝗲𝗽𝗹𝘆.
Both rely on the same fundamental principle: 𝘃𝗶𝘀𝗶𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝗰𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗲𝗱 𝗯𝘆 𝗮𝗰𝗰𝗲𝘀𝘀 𝗽𝗲𝗿𝗺𝗶𝘀𝘀𝗶𝗼𝗻𝘀.

Why Access Configuration Matters More Than the AI

A 𝗰𝗼𝗺𝗺𝗼𝗻 𝗺𝗶𝘀𝗰𝗼𝗻𝗰𝗲𝗽𝘁𝗶𝗼𝗻 is that Amazon 𝗤 “𝙙𝙤𝙚𝙨𝙣’𝙩 𝙬𝙤𝙧𝙠” when it 𝗿𝗲𝘁𝘂𝗿𝗻𝘀 partial or 𝘃𝗮𝗴𝘂𝗲 𝗮𝗻𝘀𝘄𝗲𝗿𝘀.
𝗜𝗻 𝗿𝗲𝗮𝗹𝗶𝘁𝘆, Amazon 𝗤 𝗰𝗮𝗻 𝗼𝗻𝗹𝘆 𝗮𝗻𝗮𝗹𝘆𝘇𝗲 what the 𝗰𝗼𝗻𝘀𝗼𝗹𝗲 𝗶𝗱𝗲𝗻𝘁𝗶𝘁𝘆 𝗶𝘀 𝗮𝗹𝗹𝗼𝘄𝗲𝗱 to access.
For 𝗘𝗞𝗦, this 𝗶𝗻𝘃𝗼𝗹𝘃𝗲𝘀:
• 𝗜𝗔𝗠 permissions (e.g., eks:AccessKubernetesApi)
• EKS 𝗮𝗰𝗰𝗲𝘀𝘀 𝗺𝗼𝗱𝗲 (Access Entries preferred; legacy 𝙖𝙬𝙨-𝙖𝙪𝙩𝙝 𝗱𝗲𝗽𝗿𝗲𝗰𝗮𝘁𝗲𝗱)
• Kubernetes 𝗥𝗕𝗔𝗖 mappings via Access Policies

Modern 𝗘𝗞𝗦 clusters (especially 1.30+) rely on 𝙀𝙆𝙎 𝘾𝙡𝙪𝙨𝙩𝙚𝙧 𝘼𝙘𝙘𝙚𝙨𝙨 𝙈𝙖𝙣𝙖𝙜𝙚𝙢𝙚𝙣𝙩, where access is controlled through 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝗶𝗲𝘀 and 𝗔𝗰𝗰𝗲𝘀𝘀 𝗣𝗼𝗹𝗶𝗰𝗶𝗲𝘀, 𝗻𝗼𝘁 the 𝙖𝙬𝙨-𝙖𝙪𝙩𝙝 𝗖𝗼𝗻𝗳𝗶𝗴𝗠𝗮𝗽.

If the 𝗰𝗼𝗻𝘀𝗼𝗹𝗲 𝗿𝗼𝗹𝗲 𝗹𝗮𝗰𝗸𝘀 proper EKS 𝗮𝗰𝗰𝗲𝘀𝘀:
• Amazon 𝗤 𝗰𝗮𝗻𝗻𝗼𝘁 𝗹𝗶𝘀𝘁 pods or nodes
• Cluster-level 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀 remain 𝘂𝗻𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗹𝗲
• 𝗘𝗿𝗿𝗼𝗿𝘀 appear as “𝙖𝙪𝙩𝙝𝙤𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣” or “𝙞𝙣𝙨𝙪𝙛𝙛𝙞𝙘𝙞𝙚𝙣𝙩 𝙖𝙘𝙘𝙚𝙨𝙨” messages
This is expected behavior, 𝗻𝗼𝘁 𝗮 𝗯𝘂𝗴.

A Common IAM Pitfall: When Amazon Q “Sees Nothing”

The user needs 𝗮𝗽𝗽𝗿𝗼𝗽𝗿𝗶𝗮𝘁𝗲 𝗜𝗔𝗠 𝗽𝗲𝗿𝗺𝗶𝘀𝘀𝗶𝗼𝗻𝘀 to interact with the cluster from the 𝗘𝗞𝗦 𝗰𝗼𝗻𝘀𝗼𝗹𝗲, which is typically achieved through 𝗘𝗞𝗦 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝗶𝗲𝘀 using 𝗔𝗱𝗺𝗶𝗻𝗩𝗶𝗲𝘄 or 𝗖𝗹𝘂𝘀𝘁𝗲𝗿𝗔𝗱𝗺𝗶𝗻 policies.

𝗕𝘆 𝗱𝗲𝗳𝗮𝘂𝗹𝘁, 𝘆𝗼𝘂 𝗺𝗮𝘆 𝗻𝗼𝘁 𝗵𝗮𝘃𝗲 𝘁𝗵𝗲 𝗰𝗼𝗿𝗿𝗲𝗰𝘁 𝗜𝗔𝗠 𝗮𝗻𝗱 𝗘𝗞𝗦 𝘀𝗲𝘁𝘁𝗶𝗻𝗴𝘀 𝘁o use it, and when you issue a command, you may get the following error.

For example, 𝘄𝗵𝗲𝗻 you enter a 𝗰𝗼𝗺𝗺𝗮𝗻𝗱 in the 𝗰𝗵𝗮𝘁 interface, 𝘆𝗼𝘂 𝗺𝗶𝗴𝗵𝘁 𝗲𝗻𝗰𝗼𝘂𝗻𝘁𝗲𝗿 𝗮𝗻 𝗲𝗿𝗿𝗼𝗿 message like the following: “𝘐 𝘦𝘯𝘤𝘰𝘶𝘯𝘵𝘦𝘳𝘦𝘥 𝘢𝘯 𝘢𝘶𝘵𝘩𝘰𝘳𝘪𝘻𝘢𝘵𝘪𝘰𝘯 𝘦𝘳𝘳𝘰𝘳 𝘸𝘩𝘦𝘯 𝘵𝘳𝘺𝘪𝘯𝘨 𝘵𝘰 𝘢𝘤𝘤𝘦𝘴𝘴 𝘵𝘩𝘦 𝘤𝘭𝘶𝘴𝘵𝘦𝘳…”

𝗥𝗲𝗮𝘀𝗼𝗻:
The Amazon Q panel is visible and operational, but it cannot access Kubernetes objects within the cluster because 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗹𝗮𝗰𝗸𝘀 the 𝗿𝗲𝗾𝘂𝗶𝗿𝗲𝗱 𝗽𝗲𝗿𝗺𝗶𝘀𝘀𝗶𝗼𝗻𝘀 𝘁𝗼 𝗿𝗲𝗮𝗱 𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀 𝗶𝗻 your 𝗘𝗞𝗦 cluster.

In details - the full, detailed guide

⚠️Important
You 𝗰𝗮𝗻 𝗳𝗶𝗻𝗱 the detailed guide (what 𝘀𝘁𝗲𝗽𝘀 are 𝗻𝗲𝗰𝗲𝘀𝘀𝗮𝗿𝘆 in relation to 𝗜𝗔𝗠 𝗮𝗻𝗱 𝗘𝗞𝗦, what to do if you 𝗱𝗼𝗻'𝘁 𝗵𝗮𝘃𝗲 an 𝙖𝙬𝙨-𝙖𝙪𝙩𝙝 𝗰𝗼𝗻𝗳𝗶𝗴𝗺𝗮𝗽) and the solution to the error: you will find the reference, the link, 𝗮𝘁 𝘁𝗵𝗲 𝗲𝗻𝗱 𝗼𝗳 𝘁𝗵𝗶𝘀 𝗽𝗼𝘀𝘁, in my 𝗠𝗲𝗱𝗶𝘂𝗺 article.

A brief explanation

One of the most common issues I’ve encountered is the assumption that 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰𝗮𝗹𝗹𝘆 𝗵𝗮𝘀 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀-𝗹𝗲𝘃𝗲𝗹 𝘃𝗶𝘀𝗶𝗯𝗶𝗹𝗶𝘁𝘆 once you open the EKS console.
In practice, this is often 𝗻𝗼𝘁 𝘁𝗿𝘂𝗲.
Typical symptoms:
• Amazon Q responds with partial or generic answers
• Pod- or node-level questions fail silently
• Messages like “𝘪𝘯𝘴𝘶𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘢𝘤𝘤𝘦𝘴𝘴” or “𝘶𝘯𝘢𝘣𝘭𝘦 𝘵𝘰 𝘳𝘦𝘵𝘳𝘪𝘦𝘷𝘦 𝘤𝘭𝘶𝘴𝘵𝘦𝘳 𝘥𝘢𝘵𝘢”
• The cluster appears healthy in the console, but Q cannot explain issues
This usually indicates an 𝗜𝗔𝗠 𝗮𝗰𝗰𝗲𝘀𝘀 𝗴𝗮𝗽, not a problem with Amazon Q itself.

The most common underlying causes
The IAM role used in the AWS Console:
• Doesn't have the right EKS permissions (e.g. DescribeCluster)
• Is not 𝗮𝘂𝘁𝗵𝗼𝗿𝗶𝘇𝗲𝗱 𝘁𝗼 𝗮𝗰𝗰𝗲𝘀𝘀 𝘁𝗵𝗲 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗔𝗣𝗜

Modern EKS clusters rely on 𝗘𝗞𝗦 𝗖𝗹𝘂𝘀𝘁𝗲𝗿 𝗔𝗰𝗰𝗲𝘀𝘀 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁, where Kubernetes access is controlled via:
• 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝗶𝗲𝘀
• 𝗔𝗰𝗰𝗲𝘀𝘀 𝗣𝗼𝗹𝗶𝗰𝗶𝗲𝘀
• IAM→ Kubernetes RBAC mapping
Legacy aws-auth-based assumptions no longer apply.

𝗧𝗵𝗲 𝗳𝗶𝘅 (𝗵𝗶𝗴𝗵 𝗹𝗲𝘃𝗲𝗹)
Ensure that the console role:
• Has eks:AccessKubernetesApi
• Is mapped via an 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝘆 to the appropriate Kubernetes permissions
• Uses a read-level or admin-level Access Policy depending on use case

Once this is correctly configured, Amazon Q immediately gains the visibility required to:
• List pods and nodes
• Inspect workload state
• Provide accurate, context-aware explanations
This behavior is expected and intentional, 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗻𝗲𝘃𝗲𝗿 𝗯𝘆𝗽𝗮𝘀𝘀𝗲𝘀 𝗜𝗔𝗠 𝗼𝗿 𝗥𝗕𝗔𝗖.

Where Amazon Q Fits in Real Operations

Amazon Q does 𝗻𝗼𝘁 replace:
• 𝙠𝙪𝙗𝙚𝙘𝙩𝙡
• GitOps pipelines (Argo CD/Flux)
• Full observability platforms
• Incident response processes

Instead, it acts as a 𝗱𝗶𝗮𝗴𝗻𝗼𝘀𝘁𝗶𝗰 𝗮𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗼𝗿:
• 𝗙𝗮𝘀𝘁𝗲𝗿 understanding of failures
• 𝗥𝗲𝗱𝘂𝗰𝗲𝗱 time-to-hypothesis
• 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗱 𝗼𝗻𝗯𝗼𝗮𝗿𝗱𝗶𝗻𝗴 for new engineers
• Consistent 𝗲𝘅𝗽𝗹𝗮𝗻𝗮𝘁𝗶𝗼𝗻𝘀 across teams
For 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺 and 𝗦𝗥𝗘 teams , it becomes a first-stop 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝘁𝗼𝗼𝗹 not the final authority.

When Amazon Q Is Most Useful

𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗲𝗱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀:
• Multi-cluster EKS environments
• Teams onboarding engineers new to Kubernetes
• Incident triage and exploratory debugging
• Environments with well-defined IAM and RBAC

𝗟𝗲𝘀𝘀 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀:
• Highly restricted clusters with minimal visibility
• Environments expecting “automatic fixes”
• Poorly structured access models

Amazon 𝗤 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝘀 on a good 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗻𝗴 𝗺𝗼𝗱𝗲𝗹; it 𝗱𝗼𝗲𝘀 𝗻𝗼𝘁 𝗿𝗲𝗽𝗹𝗮𝗰𝗲 missing fundamental elements.

Key Takeaways

• 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 provides 𝗔𝗜-𝗮𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗶𝗻𝘁𝗼 𝗘𝗞𝗦 operations
• Its value depends entirely on 𝗰𝗼𝗿𝗿𝗲𝗰𝘁 𝗮𝗰𝗰𝗲𝘀𝘀 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗶𝗼𝗻
• It 𝗮𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗲𝘀 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 but does not replace engineering judgment
• Teams 𝘀𝗵𝗼𝘂𝗹𝗱 𝘁𝗿𝗲𝗮𝘁 it as a trusted(?) 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁, not an autonomous operator

Used properly, Amazon 𝗤 can significantly 𝗿𝗲𝗱𝘂𝗰𝗲 the 𝘁𝗶𝗺𝗲 and effort required to 𝗱𝗲𝗯𝘂𝗴 complex 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗶𝘀𝘀𝘂𝗲𝘀 in AWS environments.

Final Thoughts

𝗔𝗜-𝗮𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 are becoming a foundational capability in modern cloud platforms. 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 represents AWS’s first serious step toward native, 𝗰𝗼𝗻𝘁𝗲𝘅𝘁-𝗮𝘄𝗮𝗿𝗲 𝗔𝗜 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 for Kubernetes.

The 𝘁𝗲𝗮𝗺𝘀 𝘁𝗵𝗮𝘁 𝗯𝗲𝗻𝗲𝗳𝗶𝘁 most will be those who 𝗰𝗼𝗺𝗯𝗶𝗻𝗲:
• Clean EKS access design
• Strong IAM and RBAC practices
• Realistic expectations of AI assistance

That 𝗰𝗼𝗺𝗯𝗶𝗻𝗮𝘁𝗶𝗼𝗻 -not AI alone- is what unlocks 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆.

Note

This DEV.to post is a concise version of a longer, experience-based guide.

If you’re interested in deeper technical details, IAM configuration nuances, and real-world EKS lessons learned, you can read it among My medium stories

This article is the first part of a series where we explore AI-oriented debugging and operational workflows in Kubernetes and Amazon EKS environments.

About the Author

I’m Róbert Zsótér, Kubernetes & AWS architect.

If you’re into Kubernetes, EKS, Terraform, and cloud-native security, follow my latest posts here:

LinkedIn: Róbert Zsótér
Substack: CSHU

Let’s build secure, scalable clusters, together.

Note: Originally published on Medium Enhancing Amazon EKS Operations with AI capabilities of Amazon Q -Part 1

kubectl-ai WebUI: A Visual Way to Use AI for Kubernetes Troubleshooting

Robert Zsoter — Tue, 25 Nov 2025 15:00:00 +0000

kubectl-ai WebUI: A Visual Interface for AI-Powered Kubernetes Troubleshooting

If you've been experimenting with kubectl-ai for AI-assisted troubleshooting on Kubernetes, you probably know one thing already:

It’s powerful, but strictly CLI-based.

This creates a real barrier for developers, students, or platform engineers who are less comfortable with command-line workflows but still want to benefit from AI-driven explanations, log analysis, issue hunting and YAML generation.

To solve this, I built something new:

A WebUI for kubectl-ai, a browser interface that makes AI-assisted Kubernetes troubleshooting accessible to everyone.

This article explains what it does, why it helps, and how you can try it out.

What is kubectl-ai

In short, kubectl-ai is an AI-powered plugin for kubectl that transforms natural-language questions into the appropriate Kubernetes commands.

It functions as a CLI extension that brings together:

Your active Kubernetes cluster context,
AI models (such as OpenAI, Anthropic, and others),
Instant command generation and analysis directly from the terminal.

If you are interested in how this works in practice, let me recommend my articles on this subject that are already available on Medium:

It helps you diagnose issues and generate solutions.

Typical kubectl-ai prompts:

"Why is my pod stuck in CrashLoopBackOff?" "Explain the last 100 lines of logs for service X." "Generate a correct Ingress for this Deployment." "Why does my Deployment not scale to 5 replicas?"

It’s extremely useful, especially for debugging.

But…

The CLI-Only Limitation

For many Kubernetes newcomers, the command line is:

intimidating
error-prone
difficult to navigate
visually limited
slow to learn

I have noticed in my environment that 40–60% of my colleagues working with Kubernetes prefer the visual interface.

That’s why I created a browser-based experience.

Introducing: kubectl-ai WebUI

A browser UI that exposes the same kubectl-ai logic, without requiring the terminal.

No deeper CLI knowledge needed

Just open your browser → type your question → get the answer.

Same capabilities as kubectl-ai CLI

Since kubectl-ai works in the background, it provides the same functionality but on a web interface:

Log analysis
Error explanations
YAML generation
Kubernetes troubleshooting
Best-practice fixes
etc.

Works on top of your existing Kubernetes context

The WebUI simply forwards your prompt to the CLI and displays the result.

Designed for:

beginners
platform teams
DevOps onboarding
training rooms
troubleshooting sessions
classroom labs

Architecture Overview (ASCII)

What happens under the hood:

The WebUI sends your prompt to the backend
The backend triggers kubectl-ai
kubectl-ai queries Kubernetes
AI model generates reasoning
The response is displayed visually

No terminal interaction needed.

Installation (Quick Start)

Every step for use is documented in detail in the GitHub repository, so please read it.

GitHub repo: k8s-kubectl-ai-web-ui
If you find it useful, feel free to ⭐ it or share your ideas there.

Using the WebUI (Examples)

Troubleshoot a CrashLoopBackOff

"Investigate why payment-service pod is restarting."

Explain log output

"Help me understand the last 20 log lines for checkout-api."

Generate YAML

"Create a working Ingress for my deployment: checkout-api, port 8080."

Diagnose deployment issues

"Why isn't my HPA scaling above 3 replicas?"

Everything happens visually, without needing terminal commands.

Screnshoots:
With ReadOnly mode:

and
With "Enabled cluster changes":

When the WebUI Is Useful

Ideal for:

Who are new to Kubernetes
Who are not confident or comfortable with the (Linux) shell
team onboarding
demos and training, Kubernetes courses
situations where some teammates prefer UI over CLI

Not ideal for:

production environments with very strict RBAC
shared clusters with single API key authentication
environments where shell access is mandatory for auditing

Features You Can Customize

Inside the project, you can modify:

AI provider configuration
namespace filters
safety rules for commands
input sanitization
UI wording and layout

The project is intended to be extended (check the README file in the affected folder on the Github).

Why This Project Exists

As a Kubernetes enthusiast and AWS/K8s architect, I’ve seen again and again:

people want to use AI for K8s
kubectl-ai is great
but the CLI stops many people from even trying

This WebUI removes that friction and makes AI-powered troubleshooting available to the entire team, not just the CLI-native power users.

Repository

Full project:k8s-kubectl-ai-web-ui

If you find it useful, feel free to ⭐ it or share your ideas there.

About the Author

I’m Róbert Zsótér, Kubernetes & AWS architect.

If you’re into Kubernetes, EKS, Terraform, and cloud-native security, follow my latest posts here:

LinkedIn: Róbert Zsótér
Substack: CSHU

Let’s build secure, scalable clusters, together.

Note: Originally published on Medium Kubectl-ai WebUI: Making AI Kubernetes Debugging Browser-Friendly for Al

Extending Your Kubernetes CLI: kubectl Plugins and Fixes

Robert Zsoter — Tue, 18 Nov 2025 15:06:00 +0000

The Power of kubectl Plugins and Fixing Common Errors.

Learn kubectl plugins use cases, how they work, alternatives, and troubleshooting errors like ‘unknown command’.”

As a Kubernetes user, have you ever wished your kubectl could do more? Or hit an error like “error: unknown command ‘oidc-login’ for ‘kubectl’”?

In this Medium story, we’ll break down why plugins are game-changers, how they work, when to (or not to) use them, and alternatives, plus a real fix on Ubuntu 24.04 with Amazon EKS.

What Are kubectl Plugins?

kubectl plugins are CLI extensions that boost the standard kubectl tool with extra features. They’re lightweight, task-focused utilities that blend seamlessly into your Kubernetes routine. They add new commands without touching the core tool.

Basics: Plugins are standalone executables that introduce sub-commands (e.g., kubectl oidc-login).

Use case: When built-in tools fall short, like for custom authentication or debugging.

Why good?: Modular design, simple setup, community-backed, speeds workflows without adding complexity.

Why Use kubectl Plugins?

Kubernetes is built for modularity, and its CLI follows suit. A plugin is an executable named kubectl-.

Typing kubectl triggers the CLI to scan your PATH and run it like a native command.

Use cases:

Custom authentication: like kubectl oidc-login for OIDC (Dex, Keycloak, Azure AD,etc.)
Inspection: kubectl tree, ctx for operators/developers.
Automation: kubectl neat cleans manifests, sort-manifests optimizes.
Security: kubectl who-can for audits, trace for runtime. Benefit: Bridges basic kubectl to robust production setups.

When to Use (or Not)?

Use: For custom gaps in core, boosting efficiency and speed.

Not: When built-in are enough, or avoid untrusted ones (! plugins run with kubectl permissions ! : potential security risk). Better alternatives for scripting (e.g., SDKs).

What Happens When a Plugin Is Missing? (a real use-case)

Real example on Ubuntu 24.04 LTS with Amazon EKS.

I ran this command after I’m imported the relevant kubeconfig file and I got an error:

$ kubectl get nodes - context <YOUR-CLUSTER-CONTEXT> 
error: unknown command "oidc-login" for "kubectl" E1104 07:21:55.470872XXXXX memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://XXXXXXXX.XX.XX-XXXX-X.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable kubectl failed with exit code 1

Reason? The OIDC plugin (kubectl oidc-login) wasn’t installed. Without it, Kubernetes can’t fetch OIDC tokens, failing the API call.

How to Fix It?

Use Krew, the plugin manager.

What is Krew?

Krew is the official plugin manager for the kubectl command-line tool. It lets you discover, install, and manage plugins (update) directly on your system, keeping tools organized and current.

Let’s go and install Krew (follow the official Krew docs or run these commands):

# One-time install of krew 
set -e 
cd "$(mktemp -d)" 
OS="$(uname | tr '[:upper:]' '[:lower:]' )" 
ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/aarch64/arm64/' -e 's/armv.*/arm/' )" 
curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/krew-${OS}_${ARCH}.tar.gz" 
tar zxvf "krew-${OS}_${ARCH}.tar.gz" 
./krew-"${OS}_${ARCH}" install krew 
# Make krew plugins available on PATH (persist for your shell) 
echo 'export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"' >> ~/.bashrc 
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"

Then run: kubectl krew install oidc-login

After, kubectl recognizes it, and OIDC authentication should works.

Benefits of Using Plugins via Krew

Krew simplifies management like apt/brew/pip.

Advantages:

Central trusted repositories of trusted plugins,
Easy updates: kubectl krew upgrade,
Per-user installation, no root required,
Seamless PATH integration.

Security Note!

Always verify sources or stick to the official Krew Index to avoid untrusted binaries, as plugins inherit the same access permissions as kubectl.

Alternatives to kubectl Plugins

While plugins enhance functionality flexibly, alternatives exist for extending Kubernetes interaction.

Each for different purposes:

Plugins → Extend kubectl directly, minimal overhead
Wrappers / SDKs → Ideal for CI/CD automation
Cloud CLIs → Best for provider-specific operations

Final Thoughts

kubectl plugins demonstrate Kubernetes’ extensibility, customizing the CLI for any Kubernetes environments, or workflow.

Which is the best plugin?

No “best”: find the most suitable for your use cases!

Let me recommend a few worth trying:

For Ops:

kubectl preq: Analyze apps for bugs, misconfigs, anti-patterns.
kubect tail: Real-time log streaming, aggregate from multiple pods, better than logs -f.
kubens: Switch namespaces instantly, companion to kubectx.

For Devs:

kubectl score: Static analysis for YAML, lints/validates manifests.

For both:

kubectl sniff: Run “tcpdump” inside Pods with Wireshark, capture network traffic.

Learn More!

Kubernetes plugins: Extend kubectl with plugins
Krew: Plugin manager for kubectl
kubelogin: Plugin for Kubernetes OIDC auth

About the Author
I’m Róbert Zsótér, Kubernetes & AWS architect.
If you’re into Kubernetes, EKS, Terraform, and cloud-native security, follow my latest posts here:

LinkedIn: Róbert Zsótér
Substack: CSHU

Let’s build secure, scalable clusters, together.

Note: Originally published on Medium Extending Your Kubernetes CLI: kubectl Plugins and Fixes

Hands-On with Kubernetes 1.33: My PoC on In-Place Vertical Scaling

Robert Zsoter — Tue, 11 Nov 2025 08:04:00 +0000

Introduction

Kubernetes 1.33 introduced a game-changing feature: in-place vertical pod scaling, now in beta and enabled by default. As a cloud engineer, I was eager to test its ability to dynamically adjust CPU and memory for running pods without restarts a potential win for resource optimization. Inspired by its promise, I set up a Proof of Concept (PoC) on AWS EC2 using Minikube to explore its practical applications.

In this guide, I’ll walk you through my step-by-step process, from cluster setup to scaling automation, and share insights for leveraging this feature in your own test environments.

Let’s dive in!

What is In-Place Vertical Pod Scaling?

Kubernetes 1.33 brings in-place vertical pod scaling as a default feature that lets you adjust a pod’s CPU and memory resources on the fly.

Unlike traditional scaling, this approach avoids pod restarts, making it ideal for maintaining application availability. I was curious by its potential to optimize my Customer’s application efficiently.

This guide will explain how this works and why it matters for your Kubernetes journey.

Prerequisites for the PoC

Before diving into my PoC, you’ll need the right tools to replicate my setup on AWS EC2. This includes installing Minikube with Kubernetes 1.33 and setting up the Metrics Server for resource monitoring. I ensured these were in place to make the scaling process smooth and measurable.

Let’s cover the essentials to get you started!

Note:

I don’t want to go into how to install a Minikube cluster: there are lots of good tutorials on the Internet. It is a very simple step.

In my PoC, I used an EC2 instance on AWS, but this is not a limitation: you can use any other environment, there is no dependency in this respect.

For the Minikube instance I chose an EC2 instance type (t3a.medium) with the following “hardware” parameters: vCPU: 2, RAM: 4Gib, storage: 40GB (with default storage type)

More information: https://aws.amazon.com/ec2/instance-types/t3/

Setting Up the Test Environment

Creating a solid test environment was the first step in my PoC journey with Kubernetes 1.33. I set up a Minikube cluster on AWS EC2 and deployed a Metrics Server to track pod resources. This setup allowed me to test the vertical scaling feature with a sample Nginx pod.

Here, I’ll guide you through building this foundation.

Start a Minikube cluster

Note: There are two key points here.

We need to use containerd instead of docker. The reason is: there was a problem with the resizing of the Pod if I used docker. If you need more information about this issue, contact with me. I don’t want to go in detail with it now, because I don’t want this story to be very long.
I need to also mention that after extensive testing and checking, I found that although “supposed to be” this is already an enabled feature in this version, it works a bit “differently” in Minikube: not allowed for all components*.* The easiest way in this case is to start the cluster with an additional command line option: “ -- feature-gates=InPlacePodVerticalScaling=true”

Start the Minikube:

minikube start --kubernetes-version=v1.33.1 --container-runtime=containerd --feature-gates=InPlacePodVerticalScaling=true

Before we go deeper, make sure that your Minikube cluster is up and running and you are interacting with this cluster (kubeconfig is configured for the Minikube)

minikube status alias k=kubectl k config current-context

If your kubeconfig is not configured correctly run this command:

kubectl config use-context minikube k get nodes

Expected output is:

NAME STATUS ROLES AGE VERSION minikube Ready control-plane 5m30s v1.33.1

Enable Metrics server

We need metrics, of course. We can install Metrics server via downloading manifest file and install it (using kubectl apply -f command) but the most easiest way: enable it

minikube addons enable metrics-server

Expected output is:

* metrics-server is an addon maintained by Kubernetes. For any concerns contact minikube on GitHub.
You can view the list of minikube maintainers at: https://github.com/kubernetes/minikube/blob/master/OWNERS
  - Using image registry.k8s.io/metrics-server/metrics-server:v0.7.2
* The 'metrics-server' addon is enabled

Wait a few minutes and make sure that you have metrics:

k top pods -A

Expected output is:

NAMESPACE     NAME                               CPU(cores)   MEMORY(bytes)  
kube-system   coredns-674b8bbfcf-m7jqr           2m           12Mi  
kube-system   etcd-minikube                      16m          26Mi  
kube-system   kindnet-2n29r                      1m           7Mi  
kube-system   kube-apiserver-minikube            33m          205Mi  
kube-system   kube-controller-manager-minikube   14m          41Mi  
kube-system   kube-proxy-jh6gm                   1m           11Mi  
kube-system   kube-scheduler-minikube            7m           19Mi  
kube-system   metrics-server-7fbb699795-bsb8v    3m           15Mi  
kube-system   storage-provisioner                2m           7Mi  
test          test-pod                           0m           2Mi

Create namespace and manifest file

Next create a new namespace and start a Pod within it:

kubectl create namespace test

Create a file, named “ test-pod.yaml” , save and apply it:

apiVersion: v1  
kind: Pod  
metadata:  
  name: test-pod  
  namespace: test  
spec:  
  containers:  
  - name: nginx  
    image: nginx  
    resources:  
      requests:  
        cpu: "200m"  
        memory: "128Mi"  
      limits:  
        cpu: "500m"  
        memory: "256Mi"

and apply it:
kubectl apply -f test-pod.yaml

Important! : don’t forget that you should match your kubectl version with the running Kubernetes’ version! (there is some tolerance for deviation but I always suggest that they should be at the same version level)

That means that you should avoid a similar situation:

$ kubectl versionClient Version: v1.28.3Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3Server Version: v1.33.1WARNING: version difference between client (1.28) and server (1.33) exceeds the supported minor version skew of +/-1

Confirm that the test pod is up and running:

k get pods -ntest

Expected output is:
NAME READY STATUS RESTARTS AGE test-pod 1/1 Running 0 17m

Testing manual resizing

Validate that the resizing works well: first, we will test it manually. Later, we will move forward and automate these steps.

First, run these commands:

kubectl patch pod test-pod -n test --subresource resize --patch '{"spec": {"containers": [{"name": "nginx", "resources": {"requests": {"cpu": "300m", "memory": "256Mi"}, "limits": {"cpu": "600m", "memory": "512Mi"}}}]}}'
kubectl get pod test-pod -n test -o yaml
kubectl get pod test-pod -n test

Make sure that the pod wasn’t restarted and limits and requests are updated correctly:

kubectl get pod test-pod -n test

NAME READY STATUS RESTARTS AGE test-pod 1/1 Running 0 30m

k get pods -ntest -oyaml |grep -i -E "limits|requests"  -A4

Expected output is:

 {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-pod","namespace":"test"},"spec":{"containers":\[{"image":"nginx","name":"nginx","resources":{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"200m","memory":"128Mi"}}}\]}}  
    creationTimestamp: "2025-05-28T11:54:23Z"  
    generation: 2  
    name: test-pod  
    namespace: test  

        limits:  
          cpu: 600m  
          memory: 512Mi  
        requests:  
          cpu: 300m  
          memory: 256Mi  
      terminationMessagePath: /dev/termination-log  
      terminationMessagePolicy: File  

        limits:  
          cpu: 600m  
          memory: 512Mi  
        requests:  
          cpu: 300m  
          memory: 256Mi  
      restartCount: 0  
      started: true

Configuring Access and Permissions

Set Up RBAC Permissions to enable your monitoring script to access pod metrics and perform resize operations. It is definitely necessary in our case now because it will support our automation goal (using a CronJob with monitor.sh script to resize pods based on resource usage).

First, we need to “restore” the previous state regarding our test pod (we already updated it):

Delete the pod and recreate it:

k delete pod test-pod -ntest && k apply -f test-pod.yaml

pod/test-pod created
and

k get pods -ntest -oyaml |grep -i -E "limits|requests"  -A4

Expected output is:

        {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-pod","namespace":"test"},"spec":{"containers":[{"image":"nginx","name":"nginx","resources":{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"200m","memory":"128Mi"}}}]}}
    creationTimestamp: "2025-05-28T11:54:23Z"
    generation: 1
    name: test-pod
    namespace: test
--
        limits:
          cpu: 500m
          memory: 256Mi
        requests:
          cpu: 200m
          memory: 128Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
--
        limits:
          cpu: 500m
          memory: 256Mi
        requests:
          cpu: 200m
          memory: 128Mi
      restartCount: 0
      started: true

The next, create the necessary Kubernetes objects.

Create and save these manifest files and apply those:

Name: pod-scaler-sa.yaml

apiVersion: v1  
kind: ServiceAccount  
metadata:  
  name: pod-scaler  
  namespace: test

and
Name: pod-scaler-role.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-scaler-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "patch"]
- apiGroups: [""]
  resources: ["pods/resize"]
  verbs: ["get", "patch"]
  resourceNames: ["test-pod"]
- apiGroups: ["metrics.k8s.io"]
  resources: ["pods", "podmetrics"]
  verbs: ["get", "list"]

and
Name: pod-scaler-binding.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pod-scaler-binding
subjects:
- kind: ServiceAccount
  name: pod-scaler
  namespace: test
roleRef:
  kind: ClusterRole
  name: pod-scaler-role
  apiGroup: rbac.authorization.k8s.io

Automating Scaling with a Monitor Script

The next step, we will verify the RBAC setup in a pod to simulate the environment where monitor.sh will run in the CronJob.

This ensures the ServiceAccount token is correctly mounted and can access the Kubernetes API (both metrics.k8s.io for metrics and pods/resize for resizing) from within a pod, mimicking the CronJob’s runtime behaviour.

This is critical to confirm the automation will work end-to-end.

First, create a new pod, named “test-access”:

Name: “test-access-pod.yaml”

apiVersion: v1
kind: Pod
metadata:
  name: test-access
  namespace: test
spec:
  serviceAccountName: pod-scaler
  containers:
  - name: test
    image: bitnami/kubectl:1.33
    command: ["sleep", "infinity"]

Apply it and make sure that both pods are up and running:

k apply -f test-access-pod.yaml

and

k get pods -ntest

NAME READY STATUS RESTARTS AGE test-access 1/1 Running 0 8s test-pod 1/1 Running 0 66m

Validate it! :

Availability of metrics

Jump into the pod and run these commands to make sure you get metrics:

kubectl exec -it test-access -n test -- bash

and within the pod run the commands:

apt-get update && apt-get install -y curl
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sSk -H "Authorization: Bearer $TOKEN" https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/namespaces/test/pods/test-pod

Expected output is:

Get:1 http://deb.debian.org/debian bookworm InRelease [151 kB]
Get:2 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
Get:3 http://deb.debian.org/debian-security bookworm-security InRelease [48.0 kB]
Get:4 http://deb.debian.org/debian bookworm/main amd64 Packages [8793 kB]
Get:5 http://deb.debian.org/debian bookworm-updates/main amd64 Packages [512 B]
Get:6 http://deb.debian.org/debian-security bookworm-security/main amd64 Packages [261 kB]
Fetched 9309 kB in 3s (3079 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
curl is already the newest version (7.88.1-10+deb12u12).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "test-pod",
    "namespace": "test",
    "creationTimestamp": "2025-05-28T13:01:02Z"
  },
  "timestamp": "2025-05-28T12:59:54Z",
  "window": "1m4.584s",
  "containers": [
    {
      "name": "nginx",
      "usage": {
        "cpu": "0",
        "memory": "3004Ki"
      }
    }
  ]
}root@test-access:/#

Test Pod resizing

Jump into the 2nd pod again, and inside the pod, named “test-access” and run these commands:

TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
kubectl --token=$TOKEN patch pod test-pod -n test --subresource resize --patch '{"spec": {"containers": [{"name": "nginx", "resouces": {"requests": {"cpu": "400m", "memory": "384Mi"}, "limits": {"cpu": "800m", "memory": "768Mi"}}}]}}'
kubectl --token=$TOKEN get pod test-pod -n test -o jsonpath='{.spec.containers[0].resources}'

Like this:

kubectl exec -it test-access -n test -- bash

and within the pod:

TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
kubectl --token=$TOKEN patch pod test-pod -n test --subresource resize --patch '{"spec": {"containers": [{"name": "nginx", "resouces": {"requests": {"cpu": "400m", "memory": "384Mi"}, "limits": {"cpu": "800m", "memory": "768Mi"}}}]}}'
kubectl --token=$TOKEN get pod test-pod -n test -o jsonpath='{.spec.containers[0].resources}'

Expected output is:
pod/test-pod patched {"limits":{"cpu":"800m","memory":"768Mi"},"requests":{"cpu":"400m","memory":"384Mi"}} I have no name!@test-access:/$

and

curl -sSk -H "Authorization: Bearer $TOKEN" https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/namespaces/test/pods/test-pod

Expected output is:

{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "test-pod",
    "namespace": "test",
    "creationTimestamp": "2025-05-28T13:24:42Z"
  },
  "timestamp": "2025-05-28T13:24:01Z",
  "window": "1m20.148s",
  "containers": [
    {
      "name": "nginx",
      "usage": {
        "cpu": "0",
        "memory": "3004Ki"
      }
    }
  ]
}I have no name!@test-access:/$ exit

Check the pods:

k get pods -ntest

NAME READY STATUS RESTARTS AGE test-access 1/1 Running 0 9m59s test-pod 1/1 Running 0 91m

You can realize that the affected pod configuration has been patched successfully and it wasn’t restarted.

Check it again:

k get pods test-pod -ntest -oyaml |grep -i -E "limits|requess"  -A4

Expected output is:


  {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-pod","namespace":"test"},"spec":{"containers":[{"mage":"nginx","name":"nginx","resources":{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"200m","memory":"128Mi"}}}]}
  creationTimestamp: "2025-05-28T13:55:30Z"
  generation: 2
  name: test-pod
  namespace: test
--
      limits:
        cpu: 800m
        memory: 768Mi
      requests:
        cpu: 400m
        memory: 384Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
--
      limits:
        cpu: 800m
        memory: 768Mi
      requests:
        cpu: 400m
        memory: 384Mi
    restartCount: 0
    started: true

Before we move on, we need to “restore” the original values of affected pod. Delete and recrate it.

k delete -f test-pod.yaml

pod "test-pod" deleted
and

k apply -f test-pod.yaml

pod/test-pod created
and

k get pods -ntest

NAME READY STATUS RESTARTS AGE test-access 1/1 Running 0 48m test-pod 1/1 Running 0 8m41s

Create a monitor.sh script file

We need a Linux script file so let’s go and create it.

Goal: Create a monitor.sh script to check test-pod’s resource usage via metrics.k8s.io and resize it if usage exceeds thresholds (e.g., CPU > 80% of request). We will use this script via a CronJob resource, with the pod-scaler ServiceAccount.

Name: “monitor.sh”

Content:

#!/bin/bash
set -e

# Install curl, jq, and kubectl
apt-get update && apt-get install -y curl jq wget
wget -q https://dl.k8s.io/release/v1.33.0/bin/linux/amd64/kubectl
chmod +x kubectl
mv kubectl /usr/local/bin/

# Configuration
POD_NAME="test-pod"
NAMESPACE="test"
CPU_THRESHOLD=320000000 # 80% of 400m (in nanocores)
NEW_REQUESTS_CPU="600m"
NEW_REQUESTS_MEMORY="512Mi"
NEW_LIMITS_CPU="1200m"
NEW_LIMITS_MEMORY="1024Mi"

# Get metrics
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
METRICS=$(curl -sSk -H "Authorization: Bearer $TOKEN" https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/namespaces/$NAMESPACE/pods/$POD_NAME)

# Parse CPU usage (in nanocores)
CPU_USAGE=$(echo "$METRICS" | jq -r '.containers[0].usage.cpu' | sed 's/n$//')
if [ -z "$CPU_USAGE" ]; then
  echo "Error: Could not retrieve CPU usage"
  exit 1
fi

echo "CPU Usage: $CPU_USAGE nanocores, Threshold: $CPU_THRESHOLD nanocores"

# Check if CPU usage exceeds threshold
if [ "$CPU_USAGE" -gt "$CPU_THRESHOLD" ]; then
  echo "CPU usage exceeds threshold, resizing pod..."
  cat <<EOF > patch.json
{
  "spec": {
    "containers": [
      {
        "name": "nginx",
        "resources": {
          "requests": {
            "cpu": "$NEW_REQUESTS_CPU",
            "memory": "$NEW_REQUESTS_MEMORY"
          },
          "limits": {
            "cpu": "$NEW_LIMITS_CPU",
            "memory": "$NEW_LIMITS_MEMORY"
          }
        }
      }
    ]
  }
}
EOF
  kubectl --token=$TOKEN patch pod $POD_NAME -n $NAMESPACE --subresource resize --patch-file patch.json
  if [ $? -eq 0 ]; then
    echo "Pod resized successfully"
  else
    echo "Error resizing pod"
    exit 1
  fi
else
  echo "CPU usage below threshold, no action needed."
fi

Note: focus on the specified variables, modify them if necessary.

Make it executable with thiscommand:

chmod +x monitor.sh

Create a configmap and store the monitor.sh file within it:

k create configmap monitor-script --from-file=monitor.s -n test

configmap/monitor-script created

In the next step, create a cronjob, named "pod-scaler-cronjob.yaml”

Name: "pod-scaler-cronjob.yaml”

Content:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pod-scaler-cronjob
  namespace: test
spec:
  schedule: "* * * * *" # Every minute
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: pod-scaler
          containers:
          - name: scaler
            image: debian:bookworm-slim
            command: ["/bin/bash", "/scripts/monitor.sh"]
            volumeMounts:
            - name: script
              mountPath: /scripts
          volumes:
          - name: script
            configMap:
              name: monitor-script
          restartPolicy: OnFailure

and apply it

k apply -f pod-scaler-cronjob.yaml

cronjob.batch/pod-scaler-cronjob created

Use Case: Dynamic Pod Resizing in Action

Testing the in-place vertical pod scaling feature in action was a key goal of my PoC. I used it to dynamically resize an Nginx pod based on CPU thresholds, simulating real-world demand. This experiment showcased its practicality for test environments.

Let’s explore how you can apply this in your own setup!

Generate workload

We will simulate workload (for 2 minutes) on the affected pod, named “test-pod”.

Jump into the pod:

kubectl exec -it test-pod -n test -- bash

and run these commands:

apt-get update && apt-get install -y stress
stress --cpu 2 --timeout 120

Expected output is:

Hit:1 http://deb.debian.org/debian bookworm InRelease
Hit:2 http://deb.debian.org/debian bookworm-updates InRelease
Hit:3 http://deb.debian.org/debian-security bookworm-security InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
stress is already the newest version (1.0.7-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
stress: info: [257] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd
stress: info: [257] successful run completed in 120s
root@test-pod:/# exit
exit

Validation

Check the pods: not restarted

kubectl get pods -n test

NAME READY STATUS RESTARTS AGE pod-scaler-cronjob-29140986-z2h98 1/1 Running 0 12s test-access 1/1 Running 0 5h50m test-pod 1/1 Running 0 22m

and run it again

kubectl get pods -n test

NAME READY STATUS RESTARTS AGE pod-scaler-cronjob-29140986-z2h98 0/1 Completed 0 14s test-access 1/1 Running 0 5h50m test-pod 1/1 Running 0 22m

Check the logs:

k logs pod-scaler-cronjob-29140986-z2h98  -n test

... CPU Usage: 990630785 nanocores, Threshold: 320000000 nanocores CPU usage exceeds threshold, resizing pod... pod/test-pod patched (no change) Pod resized successfully

Check the limits and requests values of affected pod:

kubectl describe pod test-pod -n test | grep -i -E "limits|requests" -A4

Expected output is:

    Limits:
      cpu:     1200m
      memory:  1Gi
    Requests:
      cpu:        600m
      memory:     512Mi
    Environment:  <none>
    Mounts:

Great! This is what we wanted!

Limitations and Production Considerations

While my PoC was exciting, it highlighted some limitations of in-place scaling in its beta state. It works well for test environments but requires refinement for production use.

I plan to share these insights to help with advance planning.

Conclusion and Next Steps

My PoC with Kubernetes 1.33’s in-place vertical pod scaling opened my eyes to its potential and challenges. This guide walked you through the process, from setup to automation, with real-world insights.

About the Author
I’m Róbert Zsótér, Kubernetes & AWS architect.
If you’re into Kubernetes, EKS, Terraform, and cloud-native security, follow my latest posts here:

LinkedIn: Róbert Zsótér
Substack: CSHU

Let’s build secure, scalable clusters, together.

Originally published on Medium: Hands-On with Kubernetes 1.33: My PoC on In-Place Vertical Scaling

Chaos Engineering on AWS: Using Fault Injection Simulator (FIS) for Resilience

Robert Zsoter — Thu, 02 Oct 2025 05:07:00 +0000

Part I: Building Resilient Systems on AWS: EC2 service and Auto Scaling Group

Introduction to Chaos Engineering

Chaos Engineering: Introduction to Resilient Systems

Why do we need Chaos Engineering?

Historically, disaster preparedness focused on catastrophic events like earthquakes or power outages, with organizations investing in disaster recovery (DR) plans to r*estore services from backup* data centers after major disruptions.

While effective for large-scale outages, this approach fails to address the frequent, smaller-scale failures prevalent in modern systems, rendering traditional DR insufficient as infrastructure evolves.

The shift from monolithic applications to distributed, microservices-based, cloud-native architectures on platforms like AWS and Kubernetes has brought scalability and agility but also increased fragility. A single misbehaving microservice can trigger cascading failures, a misconfigured network route can isolate critical components, or a faulty deployment can exhaust resources like CPU or memory on a single Kubernetes node, disrupting entire applications*.*

Imagine proactively simulating failures in a controlled environment: injecting latency into an AWS RDS database, terminating a Kubernetes pod mid-operation, or mimicking an AWS region outage.

By deliberately introducing faults, engineers can identify weaknesses, redesign for fault tolerance, measure blast radius, and validate monitoring, alerting, and auto-remediation mechanisms.

Chaos Engineering is the practice of intentionally injecting controlled failures to study system behavior and enhance resilience. It involves safely introducing faults, limiting their impact, observing outcomes, and iteratively applying architectural improvements.

This is not reckless disruption but a disciplined, test-driven approach to hardening systems.

Downtime carries significant financial and operational costs across industries like e-commerce, healthcare, and finance. In cloud-native environments, unpredictability is inherent.

While chaos cannot be eliminated, Chaos Engineering enables teams to prepare, test, and adapt systems to be self-healing and robust, ensuring resilience in the face of inevitable failures.

AWS FIS service

Overview about AWS Fault Injection Service (FIS)

Chaos Engineering has evolved into a critical practice for building resilient cloud systems, with various tools emerging to support controlled failure testing. The concept gained prominence with Netflix’s Chaos Monkey, which randomly terminated instances to validate system robustness. Today, the ecosystem includes tools like Gremlin, Azure Chaos Studio, and, within the AWS platform, the AWS Fault Injection Service (FIS).

In this guide, we focus on AWS FIS, Amazon’s managed solution for conducting Chaos Engineering experiments in AWS environments.

FIS enables teams to simulate real-world failures, identify vulnerabilities, and strengthen system resilience before issues impact users.

AWS Fault Injection Service is a fully managed platform designed to execute controlled fault injection experiments across AWS workloads.

By introducing failures like instance terminations or network disruptions, FIS helps engineers assess how systems handle unexpected conditions, allowing them to address weaknesses proactively and enhance reliability.

Why should we use AWS FIS?

AWS FIS empowers teams to:

Simulate failures such as EC2 instance crashes, EKS pod evictions, or RDS failovers.
Test complex scenarios, including Availability Zone (AZ) outages or cross-region disruptions.
Validate recovery mechanisms like auto-scaling, failover, and monitoring alerts.
Build confidence in system resilience, reducing Mean Time to Recovery (MTTR) and ensuring robust performance.

How FIS Works: Architecture Overview

FIS integrates seamlessly with AWS services to provide comprehensive observability and control:

Amazon CloudWatch: Monitors metrics and triggers rollbacks if experiment thresholds are exceeded.
AWS X-Ray: Traces the impact of failures across services for detailed analysis.
AWS IAM: Enforces granular permissions to manage who can run or configure experiments.

FIS offers predefined failure scenarios and experiment templates for services like:

EC2: Instance termination, CPU or memory stress.
EKS: Pod or node disruptions for Kubernetes workloads.
RDS: Database reboot or failover simulations.
S3: Network latency or access disruptions.
Multi-AZ/Region: Simulating large-scale outages.

Managing FIS

Experiments (experiment is a controlled test using the FIS to simulate real-world failure) can be configured and executed via:

AWS Management Console for interactive setup.
AWS CLI for scripted workflows.
AWS CloudFormation or AWS SDKs for infrastructure-as-code integration.

This flexibility allows FIS to fit into both manual testing processes and automated CI/CD pipelines, embedding Chaos Engineering into the Software Development Lifecycle (SDLC).

Pre-Requirements: IAM Roles for AWS FIS

Before you can run experiments with AWS Fault Injection Service (FIS), you must configure two essential IAM roles, each serving a distinct purpose in the security model.

About the roles

1. User Role: Who Can Control FIS Service

This is the IAM role or IAM identity** (user/group/role) used to log into the AWS Console or interact with the AWS CLI. This role defines who can view, create, modify, or start FIS experiment.

Because FIS experiments may impact availability or cause downtime, it is critical to strictly control who can access these capabilities. You should assign FIS-related permissions only to trusted users with experiments or platform engineering responsibilities.

2. FIS Service Role: What FIS Can Do

This is the IAM role assumed by AWS FIS itself when running an experiment. It governs what actions the FIS engine is allowed to perform on AWS resources.

For example, it defines whether FIS can:

Terminate an EC2 instance
Reboot or failover an RDS database
Inject faults into EKS or simulate an Availability Zone (AZ) outage

This role must include the exact permissions required to perform those actions, and must also trust the FIS service to assume the role.

This is called the “FIS Service Role.”

Creating the AWS FIS Service Role

Scope: IAM Role for FIS experiments

In this walkthrough, we’ll create the AWS FIS service role, an IAM role that the Fault Injection Simulator (FIS) will assume when performing its experiments (such as EC2 termination).

This role controls what actions FIS can take on AWS resources during a fault injection.

1. Navigate to IAM in AWS Console

Go to the IAM (Identity and Access Management) section of the AWS Console.
Click on “Roles” from the left menu.
Choose “Create role”.

2. Set Trusted Entity Type

Under Trusted entity type, select: ➤ “AWS service”
This is because the role will be assumed by an AWS-managed service.
“Use Case”: In the list of services, choose: ➤ “FIS: Fault Injection Simulator”

This tells IAM that only AWS FIS will be allowed to assume this role when performing experiments.

3. Select Use Case for experiment

AWS offers predefined permission sets depending on what type of experiment you’re planning to run:

For EC2 instance termination: choose the EC2 use case
For network-level disruptions: choose VPC/network-related permissions
For RDS failover testing: choose RDS
For ECS/EKS faults: select the corresponding container service
etc.

In this guide, we’ll focus on terminating EC2 instances, so select the EC2 use case: “AWSFaultInjectionSimulatorEC2Access”

4. Review Permissions and Trust Policy

After selecting the EC2 scenario:

IAM automatically attaches a predefined AWS managed policy that grants permissions for EC2-related fault injection actions.
Review the Trust policy, which should contain:

{  
  "Version": "2012-10-17",  
  "Statement": \[  
    {  
      "Effect": "Allow",  
      "Principal": {  
        "Service": "fis.amazonaws.com"  
      },  
      "Action": "sts:AssumeRole"  
    }  
  \]  
}

This policy allows FIS to assume the role during the experiment execution.

5. Name the Role

Provide a meaningful name like: FIS-EC2
Click “Create role” to finalize the creation.

6. Add CloudWatch Logging Permissions

To allow FIS to write logs, metrics, and diagnostic output, attach an additional permission policy for CloudWatch access:

Go to the newly created role
Click “Add permissions” → “Attach policies”
Search for and attach: ➤ CloudWatchLogsFullAccess (or define a scoped custom policy if preferred)

This allows the FIS experiments to send logging information to Amazon CloudWatch Logs, enabling you to monitor experiment results and rollbacks.

Understanding the Components

In this section, we will analyze the key components of the first fault injection experiment that we’ll execute using AWS FIS.

This experiment is designed to simulate EC2 instance termination within an AWS environment managed by an Auto Scaling Group (ASG).

1. The “Given”, Our Known Architecture

The first component of any FIS experiment is the “given”, a clear understanding of the system’s current behavior and architecture under normal (steady-state) conditions.

In this case, we know the following:

The application is hosted on EC2 instances
These instances are deployed across multiple Availability Zones (AZs)
An Auto Scaling Group (ASG) is configured to manage these instances
The ASG is expected to maintain a defined minimum capacity and automatically replace any unhealthy or terminated instance

This architectural context is essential. It establishes what we expect to happen before any fault is introduced, and serves as the baseline.

2. The Hypothesis, What We Expect to Happen

The second component is the hypothesis, a prediction of how the application should behave when a specific failure is happened.

For this experiment, our hypothesis is:

“If an EC2 instance is terminated, the Auto Scaling Group will detect the loss and automatically provision a new instance. As a result, the application will continue running without any disruption.”

This hypothesis is based on the expected behavior of ASGs in AWS, which are designed to maintain the desired capacity at all times.

Objective of the Experiment

By executing this AWS FIS experiment, we aim to test whether this hypothesis holds true under real conditions.

We want to observe:

Whether the ASG replaces the instance fast enough
Whether there is any downtime during the replacement
How other application components behave during this replacement window

This test will give us a clear understanding of the actual resilience of the EC2 + ASG setup.

Preparing the AWS Architecture

Affected services: EC2 + ASG

Before we initiate our first AWS Fault Injection Service (FIS) experiment, let’s define and build a simple and controlled architecture in your AWS account to support the experiment scenario.

To keep things straightforward and beginner-friendly, we will use the AWS Management Console for this setup in this case.

Architecture Components Overview

We will create three essential components in this setup:

1. EC2 Launch Template

We begin by creating an EC2 Launch Template, which defines the configuration for EC2 instances launched by the Auto Scaling Group.

This includes:

AMI ID (for example: Amazon Linux 2023 AMI 2023 under the “Free tier”)
Instance type (as you want but t3.micro is fine)
Security groups (no changes are needed there)
Subnets: any subnet of your VPC
Key pair (optional): we don’t need for that in this case
User data (optional)

As part of this configuration, we’ll add a Resource tag:

➤ Key: fis

➤ Value: true

This tag can later be used in FIS filters to target specific instances for termination.
Ensure that “Instances” has been added under the “Resources types”

2. Auto Scaling Group (ASG)

Next, we will create an Auto Scaling Group that uses the above launch template: In the next page, click on the “Create an Auto Scaling group from your template” link and create the ASG (or navigate to EC2/Auto Scaling Groups on the AWS Console and select the created launch template from the list)

Key configuration:

Select your affected VPC (Select the VPC that contains the subnet selected in the previous launch template.)
Spread instances across at least two Availability Zones (AZs): (select the same subnet and at least one or more subnet)
Leave other options as default, next page: also leave as default, next, and

Set the values like this:

Desired capacity = 1
Minimum capacity = 1
Maximum capacity = 4

Next, next, and under “Review page” click on “Create Auto Scaling group”

This setup ensures that when an EC2 instance is terminated, the ASG will automatically launch a new one to maintain the desired capacity.

3. CloudWatch Log Group for FIS

Lastly, we will create a CloudWatch Log Group named:

➤ test.fs

Navigate to CloudWatch on the AWS console and create a new log group under the “Log Groups” Set expiration for 1 day, other options: leave as default

This log group will be used by AWS FIS to write execution logs, helping you:

Track fault injection events
Monitor system responses
Audit the sequence of actions

Once these three components are in place, we’ll be ready to define and run our FIS experiment template targeting the EC2 instance within this Auto Scaling setup.

Creating the AWS FIS Experiment Template

Use case: “controlled” EC2 Termination in ASG

As part of this LAB we will now create our first AWS Fault Injection Service (FIS) experiment template. This experiment is designed to test the behavior of an Auto Scaling Group (ASG) when 50% of its EC2 instances are terminated.

1. Navigate to FIS Experiment Templates

Open the AWS Management Console
Go to AWS Fault Injection Simulator (FIS)
Click on “Experiment templates” in the left navigation pane
Click “Create an experiment template”

2. General Settings

Experiment Template Name: ➤ Example: fis-ec2
AWS Account: Select your current AWS account
Description: ➤ "Terminate 50% of instances in the Auto Scaling Group"

3. Define the Action

An action in AWS FIS defines*what fault is injected during the experiment. In our case, this will be **terminating EC2 instances: **click* on “Add Action” under the “Actions and Targets”

Action Name: TerminateEC2Instances
Action Description: "Terminate EC2 instance(s) to test ASG recovery"
Service: EC2
Action Type: select EC2 and choose aws:ec2:terminate-instances
Start after: optional Note: You may also define action sequencing here (for multi-step experiments), but we will skip this for now.
Target: now, leave as default
Click on “Save”

4. Target Definition

We now specifywhich EC2 instances** the action should apply to.

Click on “Instances-Target-1” on the created “diagram” and configure which instances will be affected

Click on the dots of the “Instances-Target-1” and edit the target
Target Name: fis-ec2-target
Resource Type: aws:ec2:instance
Target Method: select the “Resource tags, filters and parameters” and configure the tag (as you did it in the launch template): under the “Resource tags” add the same key and value
Key: fisValue: true ➤ This ensures only EC2 instances explicitly marked as “fis-true” are targeted.

Apply two filters:

We will configure this because we want to terminate 50% of running instances that means we need a multiple filters there.

Under the “Resources filters”, configure these filters:

Attribute: State.NameValue: running

➤ This ensures the experiment only affects running instances, skipping stopped, initializing, or terminated ones.

and add one more filter: PERCENT(50) → this ensures only 50% of matching instances will be affected from the all instances.

Click on “Save”.

5. Assign IAM Role

Select the IAM role that AWS FIS will assume when executing the experiment:

Example: FIS-EC2
This role (which has been created earlier) must have permissions to terminate EC2 instances and write to CloudWatch Logs

Leave other options here on default, Next.

6. Add Stop Conditions (Optional but Recommended, especially in Production/Live environment)

Important:
Although we will not define stop conditions in this lab, it’s considered best practice in production environments.

Stop conditions allow you to configure CloudWatch alarms that will immediately stop the FIS experiment if a critical threshold is breached (e.g., CPU drops below a threshold or latency spikes).

In the “PROD/LIVE” environment you should consider to configure it: “AWS FIS helps you run experiments on your workloads safely. You can set a limit, known as a stop condition, to end the experiment if it reaches the threshold defined by a CloudWatch alarm. If a stop condition is reached during an experiment, you can’t resume the experiment.”

7. CloudWatch Logging

Enable log delivery to an existing CloudWatch Log Group: configure** the created log group:

Log Group Name: test-fs

This allows you to monitor:

Action start/end times
Success/failure of the experiments
Target resource IDs and outcomes

8. Create the Template

Scroll to the bottom and click “Create experiment template”
AWS may show a warning that no stop condition was defined, acknowledge this for the demo
Type “create”

The template is now ready to use.

Running Your First AWS FIS Experiment

Affected components: EC2 + ASG (EC2 Termination)

In this section, we will execute the AWS FIS experiment that was created earlier. The goal is to test how the Auto Scaling Group (ASG) handles EC2 instance termination and validate whether our hypothesis holds true.

Step 1: Review the Target Instances

Before triggering the experiment, it’s important to confirm which instances will be targeted:

Navigate to the EC2 Dashboard
Apply two filters: Instance state: runningTag filter: Key=fis, Value=true

In our example, a single EC2 instance meets both conditions. This would be the affected instance if the experiment proceeds.

Because we configured 50% of the affected instance (in the FIS template) we need to increase the number of the instances from one to two.

Adjusting ASG Capacity

The current setup means there’s only one EC2 instance running. If that instance is terminated, it takes time for a replacement to launch, leading to potential downtime.

Modify the affected Auto Scaling Group:

Set Desired capacity = 2
Set Minimum capacity = 2

This ensures that two EC2 instances are always running, and terminating one will not interrupt service.

Wait for the second instance to launch before continuing. Ensure that both instances are up and running.

Step 2: Start the Experiment

Go to AWS FIS in the Console
Select Experiment templates
Locate the experiment template created earlier, and have a look the “Targets” tab. You can see an EC2 instance, (as “Resource”) under the “Preview” : This will show which actual resource(s) that is/are targeted when you start this experiment. The “T*arget information*” will be something like: “arn:aws:ec2:-x:xxxxxxxxxx:instance/i-xxxxxxxxxxxx” If not (if you can’t see anything) click on the “Generate preview” at the right and wait a few seconds.
Click Start experiment
Confirm the action (since it may cause disruption)

At this point, the experiment transitions to:

State: initiating
Status: pending

Step 3: Validation

Now, navigate to “EC2 Console” and check the affected instance(s). The affected instance should be terminated now.

You should now see:

State: Running → Shutting-down → Terminated
The instance count drops from 2 to 1 (one is terminated)

Note:

because one instance remains running, the application should continue functioning without interruption.
due to configuration values of ASG, the ASG will start a new instance (It will “replace” the terminated instance).
under the “EC2/Auto Scaling groups/test-fs” navigate to “I*nstance management*” and “Activity” tab and check the information about the instance(s) and you can find the events there

Best Practices and Takeaways

Implementing fault injection with AWS FIS is not just about testing for failure, it’s about building confidence in recovery.

Based on this covered scenario, here are the most important lessons and recommendations:

FIS Experiment Planning

Define a Clear Hypothesis Every FIS experiment must start with a clearly defined expected behavior based on your architecture. Avoid blind testing.
Understand Your Architecture’s Limits Knowing that ASGs maintain desired capacity isn’t enough, understand boot time, warm-up latency, and how it affects service availability.
Start Small Begin with safe experiments that target only a portion of your resources (e.g., 50%) and gradually expand the scope once confidence builds.

Targeting and Blast Radius Control

Use Tags and Filters for Target Isolation Always apply precise filters (like tag=fips:true and state=running) to limit the impact of experiments to approved resources.
Leverage Selection Modes Use PERCENT or COUNT to control how many instances are affected, this is crucial for minimizing unintended disruptions.

Security and Access

Separate IAM Roles Use two distinct IAM roles: - One for users/automation to manage/run experiments - One for FIS to assume during execution (with minimal required permissions)
Enable Least Privilege The FIS service role should only have the permissions it needs to execute the specific fault scenario (e.g., ec2:TerminateInstances).

Observability and Monitoring

Use CloudWatch Logs and Alarms Send all FIS logs to a dedicated log group for traceability. In production, always define stop conditions tied to CloudWatch alarms.
Preview Before Running Use the Preview Target feature to verify that FIS has resolved the correct resources, catch misconfigurations before runtime.

Iteration and Learning

Rerun with Adjustments Use each failed experiment as a learning opportunity. Adjust configurations (e.g., ASG size), then rerun and observe improvements.
Document Everything Maintain a knowledge base of all FIS templates, their assumptions, outcomes, and resulting architecture changes.

Conclusion

This initial FIS experiment with EC2 instances managed by an Auto Scaling Group demonstrates how structured fault injection reveals architectural weaknesses, and helps teams improve system resilience with confidence.

By applying best practices in targeting, IAM, observability, and post-experiment review, chaos engineering becomes a controlled, safe, and highly effective practice in modern AWS environments.

This is just the start. In future posts, I would like to explore:

Testing more complex architectures (like EKS, or RDS, multi-AZ)
Using stop conditions and CloudWatch alarms
Automating chaos experiments in CI/CD workflows

About the Author
I’m Róbert Zsótér, Kubernetes & AWS architect.
If you’re into Kubernetes, EKS, Terraform, and cloud-native security, follow my latest posts here:

LinkedIn: Róbert Zsótér
Substack: CSHU

Let’s build secure, scalable clusters, together.

Originally published on Medium: Chaos Engineering on AWS - Part 1