<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Robert Zsoter</title>
    <description>The latest articles on DEV Community by Robert Zsoter (@robert_r_7c237256b7614328).</description>
    <link>https://dev.to/robert_r_7c237256b7614328</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2090137%2F205fff6c-da14-416b-ba8e-400b5e8c11e9.png</url>
      <title>DEV Community: Robert Zsoter</title>
      <link>https://dev.to/robert_r_7c237256b7614328</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/robert_r_7c237256b7614328"/>
    <language>en</language>
    <item>
      <title>Browser-Based kubectl Access: Managing Kubernetes Without Bastion Hosts</title>
      <dc:creator>Robert Zsoter</dc:creator>
      <pubDate>Thu, 08 Jan 2026 16:17:00 +0000</pubDate>
      <link>https://dev.to/robert_r_7c237256b7614328/browser-based-kubectl-access-managing-kubernetes-without-bastion-hosts-1b4a</link>
      <guid>https://dev.to/robert_r_7c237256b7614328/browser-based-kubectl-access-managing-kubernetes-without-bastion-hosts-1b4a</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;This article presents a &lt;strong&gt;browser-based kubectl access pattern&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Designed for &lt;strong&gt;temporary, auditable cluster interaction&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No bastion host, no SSH, no heavy management tools&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All actions go through the &lt;strong&gt;Kubernetes API and RBAC&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Not intended for daily production operations&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Accessing Kubernetes clusters securely is a recurring challenge,  especially in environments where &lt;strong&gt;SSH access, bastion hosts, or heavy management tools are discouraged&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this article, I’ll walk through a &lt;strong&gt;browser-based kubectl access pattern&lt;/strong&gt; that enables &lt;strong&gt;temporary, auditable interaction&lt;/strong&gt; with a Kubernetes cluster, &lt;strong&gt;without&lt;/strong&gt; relying on &lt;strong&gt;jump hosts or&lt;/strong&gt; always-on &lt;strong&gt;management platforms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This approach is intentionally &lt;strong&gt;not designed for daily production operations&lt;/strong&gt;. Its value lies in &lt;strong&gt;controlled access&lt;/strong&gt;, not convenience.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Kubernetes Access Is Hard to Get Right
&lt;/h2&gt;

&lt;p&gt;Most &lt;strong&gt;teams rely on&lt;/strong&gt; one or more of these approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bastion hosts&lt;/strong&gt; with SSH access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;kubectl&lt;/strong&gt; configured on &lt;strong&gt;local laptops&lt;/strong&gt;/machines&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Full-featured &lt;strong&gt;Kubernetes management tools&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud-provider shell&lt;/strong&gt; environments&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of &lt;strong&gt;them work&lt;/strong&gt;, &lt;strong&gt;but&lt;/strong&gt; they &lt;strong&gt;come with&lt;/strong&gt; trade-offs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;credential sprawl &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;infrastructure overhead/increased attack surface&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;long-lived access paths/credentials&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;limited auditability/unclear audit boundaries&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In regulated or security-sensitive environments, these trade-offs become unacceptable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Yet teams still need&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;break-glass access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;short-lived troubleshooting&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;training and workshop environments&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;controlled/restricted support access&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the gap the browser-based approach addresses.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “Browser-Based kubectl” Actually Means
&lt;/h2&gt;

&lt;p&gt;This pattern does &lt;strong&gt;not&lt;/strong&gt; introduce a new Kubernetes UI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instead&lt;/strong&gt;, it exposes a &lt;strong&gt;restricted web terminal&lt;/strong&gt; that runs &lt;em&gt;kubectl&lt;/em&gt; inside the cluster, using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;a dedicated &lt;strong&gt;ServiceAccount&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;strict &lt;strong&gt;RBAC&lt;/strong&gt; permissions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;native&lt;/strong&gt; Kubernetes &lt;strong&gt;audit logging&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All &lt;strong&gt;access&lt;/strong&gt; happens through &lt;strong&gt;HTTP(S)&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;There is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;no SSH&lt;/strong&gt; access,&lt;/li&gt;
&lt;li&gt;no node-level access and login,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;no user kubeconfig&lt;/strong&gt; distribution.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  High-Level Architecture
&lt;/h2&gt;

&lt;p&gt;Conceptually, the flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Browser
     |
     v
HTTP(S) (restricted)
     |
     v
Ingress / Load Balancer
     |
     v
Service
     |
     v
Web Terminal Pod
     |
     v
kubectl
     |
     v
Kubernetes API Server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important details&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;terminal runs as a Pod&lt;/strong&gt; inside the cluster&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Authorization&lt;/strong&gt; is enforced &lt;strong&gt;by&lt;/strong&gt; Kubernetes &lt;strong&gt;RBAC&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All kubectl actions result in Kubernetes API requests, which can be captured by Kubernetes audit logs depending on the audit policy &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Access can be disabled instantly&lt;/strong&gt; by removing the Pod or Service&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No persistent external access paths remain&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and represented &lt;strong&gt;in an ASCII diagram&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozebhrmf3o19a83hivg3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozebhrmf3o19a83hivg3.png" alt="Web Terminal for kubectl - ascii diagram" width="353" height="710"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A few pictures, in operation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;basic commands:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67dixpvb99mrgg06xx1t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67dixpvb99mrgg06xx1t.png" alt="Web Terminal for kubectl commands" width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;jump into pod and use &lt;em&gt;kubectl exec&lt;/em&gt; command -within the defined namespace:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9n90cb0royc881pxft3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9n90cb0royc881pxft3e.png" alt="Web Terminal for kubectl exec command" width="800" height="665"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security Model: Why This Is Auditable by Design
&lt;/h2&gt;

&lt;p&gt;This pattern relies on &lt;strong&gt;layered security&lt;/strong&gt;, not a single control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Network layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;TLS termination&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IP allowlists&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No direct node exposure&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Kubernetes authorization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Dedicated ServiceAccount&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Least-privilege RBAC&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Optional read-only mode&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Auditability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Every action flows through the Kubernetes API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Native audit logs capture requests&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Auditability: What Is (and Is Not) Logged
&lt;/h2&gt;

&lt;p&gt;This is an important clarification.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;kubectl commands themselves are not logged&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kubernetes API requests are&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a user executes a command:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;kubectl get pods&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;kubectl describe deployment ...&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;kubectl apply -f …&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The resulting API calls &lt;strong&gt;can be recorded in Kubernetes audit logs&lt;/strong&gt;, depending on the configured audit policy.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Resource access and mutations are traceable&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;RBAC enforcement is preserved&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No hidden or opaque access paths exist (unlike SSH sessions)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern relies on &lt;strong&gt;Kubernetes’ native security model&lt;/strong&gt;, not on custom logging logic.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Important&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
This approach does not bypass Kubernetes security controls, it depends on them.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How This Compares to Other Access Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vixf2ybrznbsb6dkg3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vixf2ybrznbsb6dkg3u.png" alt="How web based kubectl Compares to Other Access Patterns" width="800" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This pattern is not a replacement, it fills a &lt;strong&gt;specific operational niche&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  When You Should (and Should Not) Use This
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Recommended
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Break-glass scenarios&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Training and workshops&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Restricted production environments&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Short-lived support access&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Not recommended
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Daily production operations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CI/CD automation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Persistent admin workflows&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The limitations are intentional.They help prevent accidental misuse.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;After experimenting with this pattern in real environments, &lt;strong&gt;a few things became clear&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Kubernetes RBAC remains the single most important control&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Auditability improves when access paths are explicit&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Removing SSH simplifies security reviews&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Temporary access patterns reduce long-term risk&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Convenience is easy to add. Removing access later is much harder.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Secure Kubernetes access is less about tools and more about &lt;strong&gt;boundaries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Browser-based kubectl access provides a &lt;strong&gt;minimal, auditable, and intentionally constrained&lt;/strong&gt; way to interact with a cluster &lt;strong&gt;when traditional approaches are unavailable or undesirable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Used correctly, it solves a real problem, without becoming a new one.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Reference Implementation
&lt;/h3&gt;

&lt;p&gt;The repository demonstrating this pattern is available here: &lt;a href="https://github.com/zsoterr/k8s-web-terminal-kubectl" rel="noopener noreferrer"&gt;https://github.com/zsoterr/k8s-web-terminal-kubectl&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  And what's next? 
&lt;/h2&gt;

&lt;p&gt;I am planning a number of modifications and additions. You can find more information about these in README.md in the GitHub repository.&lt;/p&gt;




&lt;h2&gt;
  
  
  Note
&lt;/h2&gt;

&lt;p&gt;This DEV.to post is a concise version of a longer, experience-based guide.&lt;/p&gt;

&lt;p&gt;If you’re &lt;strong&gt;interested in deeper technical details&lt;/strong&gt;, you can read it among &lt;a href="https://medium.com/@zs77.robert/" rel="noopener noreferrer"&gt;My medium stories&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  About the Author
&lt;/h2&gt;

&lt;p&gt;I’m &lt;strong&gt;Róbert Zsótér&lt;/strong&gt;, Kubernetes &amp;amp; AWS architect. If you’re into &lt;strong&gt;Kubernetes, EKS, Terraform&lt;/strong&gt;, &lt;strong&gt;AI&lt;/strong&gt; and &lt;strong&gt;cloud-native security&lt;/strong&gt;, follow my latest posts here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/r%C3%B3bert-zs%C3%B3t%C3%A9r-34541464/" rel="noopener noreferrer"&gt;Róbert Zsótér&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Substack: &lt;a href="https://cloudskillshu.substack.com/" rel="noopener noreferrer"&gt;CSHU&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Let’s build secure, scalable clusters, together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Originally published on Medium &lt;a href="https://medium.com/@zs77.robert/browser-based-kubectl-access-managing-kubernetes-without-bastion-hosts-or-heavy-tools-1b6c939ce8ee" rel="noopener noreferrer"&gt;Browser-Based kubectl Access: Managing Kubernetes Without Bastion Hosts or Heavy Tools&lt;/a&gt;&lt;/p&gt;




</description>
      <category>architecture</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>security</category>
    </item>
    <item>
      <title>Using Amazon Q for AI-Assisted Debugging in Amazon EKS</title>
      <dc:creator>Robert Zsoter</dc:creator>
      <pubDate>Tue, 23 Dec 2025 15:16:00 +0000</pubDate>
      <link>https://dev.to/robert_r_7c237256b7614328/using-amazon-q-for-ai-assisted-debugging-in-amazon-eks-fm8</link>
      <guid>https://dev.to/robert_r_7c237256b7614328/using-amazon-q-for-ai-assisted-debugging-in-amazon-eks-fm8</guid>
      <description>&lt;h2&gt;
  
  
  Using Amazon Q for AI-Assisted Debugging in Amazon EKS
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Practical insights for Kubernetes engineers&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;The first step: &lt;em&gt;Use Amazon Q capabilities in EKS environment&lt;/em&gt; - &lt;strong&gt;Part 1&lt;/strong&gt;: &lt;br&gt;
Fix issue with AWS IAM permissions and configure EKS environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Q enables AI-assisted debugging directly in the AWS Console for EKS&lt;/li&gt;
&lt;li&gt;It accelerates root-cause analysis but does not replace kubectl or observability tools&lt;/li&gt;
&lt;li&gt;Correct IAM and EKS access configuration is critical: most Amazon Q “issues” are access-related&lt;/li&gt;
&lt;li&gt;Best used as a diagnostic accelerator, not an automated fix engine&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;𝗗𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 Amazon 𝗘𝗞𝗦 environments is rarely straightforward. Even experienced Kubernetes engineers often need to 𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗲 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗮𝗰𝗿𝗼𝘀𝘀 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗹𝗮𝘆𝗲𝗿𝘀: pod logs, node health, IAM permissions, control plane behavior, networking, AWS-managed integrations, etc.&lt;/p&gt;

&lt;p&gt;AWS introduced 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 some time ago.&lt;br&gt;&lt;br&gt;
𝗔𝗺𝗮𝘇𝗼𝗻 𝗤, an 𝗔𝗜 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 embedded into the AWS 𝗖𝗼𝗻𝘀𝗼𝗹𝗲, which brings a new operational model to 𝗘𝗞𝗦 𝘁𝗿𝗼𝘂𝗯𝗹𝗲𝘀𝗵𝗼𝗼𝘁𝗶𝗻𝗴:  a 𝗰𝗼𝗻𝘁𝗲𝘅𝘁-𝗮𝘄𝗮𝗿𝗲, 𝗔𝗜-𝗮𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 directly where engineers already work.&lt;/p&gt;

&lt;p&gt;This article 𝗲𝘅𝗽𝗹𝗮𝗶𝗻𝘀 what Amazon 𝗤 adds to 𝗘𝗞𝗦 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴, where it fits into real-world workflows, and why 𝗮𝗰𝗰𝗲𝘀𝘀 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗶𝗼𝗻 - not AI - is the real 𝗸𝗲𝘆 to success.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why EKS Debugging Is Still Challenging
&lt;/h2&gt;

&lt;p&gt;Although EKS abstracts much of the Kubernetes control plane, 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 remains 𝗰𝗼𝗺𝗽𝗹𝗲𝘅:&lt;br&gt;
• Pod failures often involve IAM, networking, or node capacity&lt;br&gt;&lt;br&gt;
• Cluster events and logs are spread across services&lt;br&gt;&lt;br&gt;
• Kubernetes RBAC and AWS IAM must both align&lt;br&gt;&lt;br&gt;
• Engineers switch constantly between tools and consoles  &lt;/p&gt;

&lt;p&gt;𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 workflows rely heavily on &lt;em&gt;kubectl&lt;/em&gt;, CloudWatch 𝗟𝗼𝗴𝘀, 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 dashboards, and 𝗱𝗲𝗲𝗽 platform 𝗸𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲. This is 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲, 𝗯𝘂𝘁 𝘀𝗹𝗼𝘄 and cognitively expensive.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Amazon Q Brings to the EKS Console
&lt;/h2&gt;

&lt;p&gt;𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗿 is an AI-powered assistant 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗲𝗱 into the AWS 𝗖𝗼𝗻𝘀𝗼𝗹𝗲 UI.&lt;br&gt;&lt;br&gt;
When used with EKS, 𝗶𝘁 𝗰𝗮𝗻:&lt;br&gt;
• Inspect cluster state and related AWS resources&lt;br&gt;&lt;br&gt;
• Explain error conditions in natural language&lt;br&gt;&lt;br&gt;
• Correlate Kubernetes symptoms with AWS infrastructure&lt;br&gt;&lt;br&gt;
• Suggest likely causes and remediation paths&lt;br&gt;&lt;br&gt;
• Generate Kubernetes YAML examples  &lt;/p&gt;

&lt;p&gt;Unlike external AI tools, Amazon 𝗤 𝗼𝗽𝗲𝗿𝗮𝘁𝗲𝘀 𝘄𝗶𝘁𝗵𝗶𝗻 𝗔𝗪𝗦 𝗰𝗼𝗻𝘁𝗲𝘅𝘁, meaning its 𝗮𝗻𝘀𝘄𝗲𝗿𝘀 are 𝘁𝗶𝗲𝗱 to what it can 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝘀𝗲𝗲 in your account and cluster.&lt;br&gt;&lt;br&gt;
𝗡𝗼 𝗖𝗟𝗜 installation is required. The interaction happens directly 𝗶𝗻𝘀𝗶𝗱𝗲 𝘁𝗵𝗲 𝗰𝗼𝗻𝘀𝗼𝗹𝗲.&lt;/p&gt;




&lt;h3&gt;
  
  
  Example Queries You Can Ask
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Why is this pod in CrashLoopBackOff?"
"Explain why my node group isn't scaling up."
"Generate a Deployment YAML for NGINX with a LoadBalancer Service."
"Check if my cluster is using deprecated APIs before upgrading to 1.33."
"How do I restrict traffic between namespaces with a NetworkPolicy?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  Console-Aware vs Cluster-Aware Amazon Q
&lt;/h1&gt;

&lt;p&gt;It’s 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝘁𝗼 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 that not all Amazon Q experiences are identical.&lt;/p&gt;

&lt;p&gt;Today, 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 may 𝗲𝗻𝗰𝗼𝘂𝗻𝘁𝗲𝗿 the following:&lt;br&gt;
• 𝗚𝗹𝗼𝗯𝗮𝗹 𝗖𝗼𝗻𝘀𝗼𝗹𝗲 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤: a general-purpose AWS assistant (broadly available)&lt;br&gt;
• 𝗖𝗼𝗻𝘁𝗲𝘅𝘁-𝗮𝘄𝗮𝗿𝗲 𝗘𝗞𝗦-𝗻𝗮𝘁𝗶𝘃𝗲 𝗤:  embedded directly in EKS resource views (pods, nodes, add-ons)&lt;/p&gt;

&lt;p&gt;The 𝗴𝗹𝗼𝗯𝗮𝗹 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 works across services and regions. &lt;br&gt;
The 𝗘𝗞𝗦-𝗻𝗮𝘁𝗶𝘃𝗲 𝘃𝗲𝗿𝘀𝗶𝗼𝗻 appears contextually on cluster pages and 𝗰𝗮𝗻 𝗶𝗻𝘀𝗽𝗲𝗰𝘁 workloads more 𝗱𝗲𝗲𝗽𝗹𝘆.&lt;br&gt;
Both rely on the same fundamental principle: 𝘃𝗶𝘀𝗶𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝗰𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗲𝗱 𝗯𝘆 𝗮𝗰𝗰𝗲𝘀𝘀 𝗽𝗲𝗿𝗺𝗶𝘀𝘀𝗶𝗼𝗻𝘀.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Access Configuration Matters More Than the AI
&lt;/h1&gt;

&lt;p&gt;A 𝗰𝗼𝗺𝗺𝗼𝗻 𝗺𝗶𝘀𝗰𝗼𝗻𝗰𝗲𝗽𝘁𝗶𝗼𝗻 is that Amazon 𝗤 “𝙙𝙤𝙚𝙨𝙣’𝙩 𝙬𝙤𝙧𝙠” when it 𝗿𝗲𝘁𝘂𝗿𝗻𝘀 partial or 𝘃𝗮𝗴𝘂𝗲 𝗮𝗻𝘀𝘄𝗲𝗿𝘀.&lt;br&gt;
𝗜𝗻 𝗿𝗲𝗮𝗹𝗶𝘁𝘆, Amazon 𝗤 𝗰𝗮𝗻 𝗼𝗻𝗹𝘆 𝗮𝗻𝗮𝗹𝘆𝘇𝗲 what the 𝗰𝗼𝗻𝘀𝗼𝗹𝗲 𝗶𝗱𝗲𝗻𝘁𝗶𝘁𝘆 𝗶𝘀 𝗮𝗹𝗹𝗼𝘄𝗲𝗱 to access.&lt;br&gt;
For 𝗘𝗞𝗦, this 𝗶𝗻𝘃𝗼𝗹𝘃𝗲𝘀:&lt;br&gt;
• 𝗜𝗔𝗠 permissions (e.g., eks:AccessKubernetesApi) &lt;br&gt;
• EKS 𝗮𝗰𝗰𝗲𝘀𝘀 𝗺𝗼𝗱𝗲 (Access Entries preferred; legacy 𝙖𝙬𝙨-𝙖𝙪𝙩𝙝 𝗱𝗲𝗽𝗿𝗲𝗰𝗮𝘁𝗲𝗱) &lt;br&gt;
• Kubernetes 𝗥𝗕𝗔𝗖 mappings via Access Policies&lt;/p&gt;

&lt;p&gt;Modern 𝗘𝗞𝗦 clusters (especially 1.30+) rely on 𝙀𝙆𝙎 𝘾𝙡𝙪𝙨𝙩𝙚𝙧 𝘼𝙘𝙘𝙚𝙨𝙨 𝙈𝙖𝙣𝙖𝙜𝙚𝙢𝙚𝙣𝙩, where access is controlled through 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝗶𝗲𝘀 and 𝗔𝗰𝗰𝗲𝘀𝘀 𝗣𝗼𝗹𝗶𝗰𝗶𝗲𝘀, 𝗻𝗼𝘁 the 𝙖𝙬𝙨-𝙖𝙪𝙩𝙝 𝗖𝗼𝗻𝗳𝗶𝗴𝗠𝗮𝗽.&lt;/p&gt;

&lt;p&gt;If the 𝗰𝗼𝗻𝘀𝗼𝗹𝗲 𝗿𝗼𝗹𝗲 𝗹𝗮𝗰𝗸𝘀 proper EKS 𝗮𝗰𝗰𝗲𝘀𝘀:&lt;br&gt;
• Amazon 𝗤 𝗰𝗮𝗻𝗻𝗼𝘁 𝗹𝗶𝘀𝘁 pods or nodes &lt;br&gt;
• Cluster-level 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀 remain 𝘂𝗻𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗹𝗲 &lt;br&gt;
• 𝗘𝗿𝗿𝗼𝗿𝘀 appear as “𝙖𝙪𝙩𝙝𝙤𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣” or “𝙞𝙣𝙨𝙪𝙛𝙛𝙞𝙘𝙞𝙚𝙣𝙩 𝙖𝙘𝙘𝙚𝙨𝙨” messages&lt;br&gt;
This is expected behavior, 𝗻𝗼𝘁 𝗮 𝗯𝘂𝗴.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Common IAM Pitfall: When Amazon Q “Sees Nothing”
&lt;/h2&gt;

&lt;p&gt;The user needs 𝗮𝗽𝗽𝗿𝗼𝗽𝗿𝗶𝗮𝘁𝗲 𝗜𝗔𝗠 𝗽𝗲𝗿𝗺𝗶𝘀𝘀𝗶𝗼𝗻𝘀 to interact with the cluster from the 𝗘𝗞𝗦 𝗰𝗼𝗻𝘀𝗼𝗹𝗲, which is typically achieved through 𝗘𝗞𝗦 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝗶𝗲𝘀 using 𝗔𝗱𝗺𝗶𝗻𝗩𝗶𝗲𝘄 or 𝗖𝗹𝘂𝘀𝘁𝗲𝗿𝗔𝗱𝗺𝗶𝗻 policies.&lt;/p&gt;

&lt;p&gt;𝗕𝘆 𝗱𝗲𝗳𝗮𝘂𝗹𝘁, 𝘆𝗼𝘂 𝗺𝗮𝘆 𝗻𝗼𝘁 𝗵𝗮𝘃𝗲 𝘁𝗵𝗲 𝗰𝗼𝗿𝗿𝗲𝗰𝘁 𝗜𝗔𝗠 𝗮𝗻𝗱 𝗘𝗞𝗦 𝘀𝗲𝘁𝘁𝗶𝗻𝗴𝘀 𝘁o use it, and when you issue a command, you may get the following error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For example&lt;/strong&gt;, 𝘄𝗵𝗲𝗻 you enter a 𝗰𝗼𝗺𝗺𝗮𝗻𝗱 in the 𝗰𝗵𝗮𝘁 interface, 𝘆𝗼𝘂 𝗺𝗶𝗴𝗵𝘁 𝗲𝗻𝗰𝗼𝘂𝗻𝘁𝗲𝗿 𝗮𝗻 𝗲𝗿𝗿𝗼𝗿 message like the following: “𝘐 𝘦𝘯𝘤𝘰𝘶𝘯𝘵𝘦𝘳𝘦𝘥 𝘢𝘯 𝘢𝘶𝘵𝘩𝘰𝘳𝘪𝘻𝘢𝘵𝘪𝘰𝘯 𝘦𝘳𝘳𝘰𝘳 𝘸𝘩𝘦𝘯 𝘵𝘳𝘺𝘪𝘯𝘨 𝘵𝘰 𝘢𝘤𝘤𝘦𝘴𝘴 𝘵𝘩𝘦 𝘤𝘭𝘶𝘴𝘵𝘦𝘳…”&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs73rb90t0pdy1gj7u0xk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs73rb90t0pdy1gj7u0xk.png" alt="Amazon Q with missing permissions on Amazon EKS dashboards" width="292" height="653"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;𝗥𝗲𝗮𝘀𝗼𝗻:&lt;br&gt;
The Amazon &lt;strong&gt;Q&lt;/strong&gt; panel is visible and operational, but it &lt;strong&gt;cannot access&lt;/strong&gt; Kubernetes objects within the cluster because 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗹𝗮𝗰𝗸𝘀 the 𝗿𝗲𝗾𝘂𝗶𝗿𝗲𝗱 𝗽𝗲𝗿𝗺𝗶𝘀𝘀𝗶𝗼𝗻𝘀 𝘁𝗼 𝗿𝗲𝗮𝗱 𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀 𝗶𝗻 your 𝗘𝗞𝗦 cluster.&lt;/p&gt;







&lt;h1&gt;
  
  
  In details - the full, detailed guide
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️&lt;strong&gt;Important&lt;/strong&gt;&lt;br&gt;
You 𝗰𝗮𝗻 𝗳𝗶𝗻𝗱 the &lt;strong&gt;detailed guide&lt;/strong&gt; (what 𝘀𝘁𝗲𝗽𝘀 are 𝗻𝗲𝗰𝗲𝘀𝘀𝗮𝗿𝘆 in relation to 𝗜𝗔𝗠 𝗮𝗻𝗱 𝗘𝗞𝗦, what to do if you 𝗱𝗼𝗻'𝘁 𝗵𝗮𝘃𝗲 an 𝙖𝙬𝙨-𝙖𝙪𝙩𝙝 𝗰𝗼𝗻𝗳𝗶𝗴𝗺𝗮𝗽) and the solution to the error: you will find the reference, the link, 𝗮𝘁 𝘁𝗵𝗲 𝗲𝗻𝗱 𝗼𝗳 𝘁𝗵𝗶𝘀 𝗽𝗼𝘀𝘁, in my 𝗠𝗲𝗱𝗶𝘂𝗺 article.&lt;/p&gt;




&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  A brief explanation
&lt;/h2&gt;

&lt;p&gt;One of the most common issues I’ve encountered is the &lt;strong&gt;assumption&lt;/strong&gt; that 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰𝗮𝗹𝗹𝘆 𝗵𝗮𝘀 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀-𝗹𝗲𝘃𝗲𝗹 𝘃𝗶𝘀𝗶𝗯𝗶𝗹𝗶𝘁𝘆 once you open the EKS console.&lt;br&gt;
In practice, this is often 𝗻𝗼𝘁 𝘁𝗿𝘂𝗲.&lt;br&gt;
&lt;strong&gt;Typical&lt;/strong&gt; symptoms:&lt;br&gt;
• Amazon Q responds with partial or generic answers&lt;br&gt;
• Pod- or node-level questions fail silently&lt;br&gt;
• Messages like “𝘪𝘯𝘴𝘶𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘢𝘤𝘤𝘦𝘴𝘴” or “𝘶𝘯𝘢𝘣𝘭𝘦 𝘵𝘰 𝘳𝘦𝘵𝘳𝘪𝘦𝘷𝘦 𝘤𝘭𝘶𝘴𝘵𝘦𝘳 𝘥𝘢𝘵𝘢”&lt;br&gt;
• The cluster appears healthy in the console, but Q cannot explain issues&lt;br&gt;
This usually &lt;strong&gt;indicates&lt;/strong&gt; an 𝗜𝗔𝗠 𝗮𝗰𝗰𝗲𝘀𝘀 𝗴𝗮𝗽, not a problem with Amazon Q itself.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;most common&lt;/strong&gt; underlying &lt;strong&gt;causes&lt;/strong&gt;&lt;br&gt;
The IAM role used in the AWS Console:&lt;br&gt;
• &lt;strong&gt;Doesn't have&lt;/strong&gt; the right EKS &lt;strong&gt;permissions&lt;/strong&gt; (e.g. DescribeCluster)&lt;br&gt;
• Is &lt;strong&gt;not&lt;/strong&gt; 𝗮𝘂𝘁𝗵𝗼𝗿𝗶𝘇𝗲𝗱 𝘁𝗼 𝗮𝗰𝗰𝗲𝘀𝘀 𝘁𝗵𝗲 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗔𝗣𝗜&lt;/p&gt;

&lt;p&gt;Modern EKS clusters rely on &lt;em&gt;𝗘𝗞𝗦 𝗖𝗹𝘂𝘀𝘁𝗲𝗿 𝗔𝗰𝗰𝗲𝘀𝘀 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁&lt;/em&gt;, where Kubernetes access is controlled via:&lt;br&gt;
• 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝗶𝗲𝘀&lt;br&gt;
• 𝗔𝗰𝗰𝗲𝘀𝘀 𝗣𝗼𝗹𝗶𝗰𝗶𝗲𝘀&lt;br&gt;
• &lt;strong&gt;IAM&lt;/strong&gt;→ Kubernetes RBAC mapping&lt;br&gt;
&lt;strong&gt;Legacy&lt;/strong&gt; &lt;em&gt;aws-auth-based&lt;/em&gt; assumptions no longer apply.&lt;/p&gt;

&lt;p&gt;𝗧𝗵𝗲 𝗳𝗶𝘅 (𝗵𝗶𝗴𝗵 𝗹𝗲𝘃𝗲𝗹)&lt;br&gt;
&lt;strong&gt;Ensure&lt;/strong&gt; that the console role:&lt;br&gt;
• &lt;strong&gt;Has&lt;/strong&gt; &lt;em&gt;eks:AccessKubernetesApi&lt;/em&gt;&lt;br&gt;
• Is &lt;strong&gt;mapped&lt;/strong&gt; via an 𝗔𝗰𝗰𝗲𝘀𝘀 𝗘𝗻𝘁𝗿𝘆 to the appropriate Kubernetes permissions&lt;br&gt;
• Uses a &lt;em&gt;read-level&lt;/em&gt; or &lt;em&gt;admin-level&lt;/em&gt; Access &lt;strong&gt;Policy&lt;/strong&gt; depending on use case&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Once this is correctly configured&lt;/strong&gt;, Amazon Q immediately gains the visibility required to:&lt;br&gt;
• &lt;strong&gt;List&lt;/strong&gt; pods and nodes&lt;br&gt;
• &lt;strong&gt;Inspect&lt;/strong&gt; workload state&lt;br&gt;
• &lt;strong&gt;Provide&lt;/strong&gt; accurate, context-aware &lt;strong&gt;explanations&lt;/strong&gt;&lt;br&gt;
This behavior is expected and intentional, 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 𝗻𝗲𝘃𝗲𝗿 𝗯𝘆𝗽𝗮𝘀𝘀𝗲𝘀 𝗜𝗔𝗠 𝗼𝗿 𝗥𝗕𝗔𝗖.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqr5qpwhespfdb3ahbuk6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqr5qpwhespfdb3ahbuk6.png" alt="Amazon Q with the right permissions on EKS dashboards" width="369" height="722"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Where Amazon Q Fits in Real Operations
&lt;/h1&gt;

&lt;p&gt;Amazon Q does 𝗻𝗼𝘁 replace:&lt;br&gt;
• 𝙠𝙪𝙗𝙚𝙘𝙩𝙡 &lt;br&gt;
• GitOps pipelines (Argo CD/Flux) &lt;br&gt;
• Full observability platforms &lt;br&gt;
• Incident response processes&lt;/p&gt;

&lt;p&gt;Instead, it acts as a 𝗱𝗶𝗮𝗴𝗻𝗼𝘀𝘁𝗶𝗰 𝗮𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗼𝗿:&lt;br&gt;
• 𝗙𝗮𝘀𝘁𝗲𝗿 understanding of failures &lt;br&gt;
• 𝗥𝗲𝗱𝘂𝗰𝗲𝗱 time-to-hypothesis &lt;br&gt;
• 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗱 𝗼𝗻𝗯𝗼𝗮𝗿𝗱𝗶𝗻𝗴 for new engineers &lt;br&gt;
• Consistent 𝗲𝘅𝗽𝗹𝗮𝗻𝗮𝘁𝗶𝗼𝗻𝘀 across teams&lt;br&gt;
For 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺 and 𝗦𝗥𝗘 &lt;strong&gt;teams&lt;/strong&gt; , it becomes a first-stop 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝘁𝗼𝗼𝗹 not the final authority.&lt;/p&gt;




&lt;h1&gt;
  
  
  When Amazon Q Is Most Useful
&lt;/h1&gt;

&lt;p&gt;𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗲𝗱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀:&lt;br&gt;
• Multi-cluster EKS environments &lt;br&gt;
• Teams onboarding engineers new to Kubernetes &lt;br&gt;
• Incident triage and exploratory debugging &lt;br&gt;
• Environments with well-defined IAM and RBAC&lt;/p&gt;

&lt;p&gt;𝗟𝗲𝘀𝘀 𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀:&lt;br&gt;
• Highly restricted clusters with minimal visibility &lt;br&gt;
• Environments expecting “automatic fixes” &lt;br&gt;
• Poorly structured access models&lt;/p&gt;

&lt;p&gt;Amazon 𝗤 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝘀 on a good 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗻𝗴 𝗺𝗼𝗱𝗲𝗹; it 𝗱𝗼𝗲𝘀 𝗻𝗼𝘁 𝗿𝗲𝗽𝗹𝗮𝗰𝗲 missing fundamental elements.&lt;/p&gt;




&lt;h1&gt;
  
  
  Key Takeaways
&lt;/h1&gt;

&lt;p&gt;• 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 provides 𝗔𝗜-𝗮𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗶𝗻𝘁𝗼 𝗘𝗞𝗦 operations &lt;br&gt;
• Its value &lt;strong&gt;depends&lt;/strong&gt; entirely on 𝗰𝗼𝗿𝗿𝗲𝗰𝘁 𝗮𝗰𝗰𝗲𝘀𝘀 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗶𝗼𝗻 &lt;br&gt;
• It 𝗮𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗲𝘀 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 but does &lt;strong&gt;not replace engineering&lt;/strong&gt; judgment &lt;br&gt;
• Teams 𝘀𝗵𝗼𝘂𝗹𝗱 𝘁𝗿𝗲𝗮𝘁 it as a trusted(?) 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁, not an autonomous operator&lt;/p&gt;

&lt;p&gt;Used properly, Amazon 𝗤 can significantly 𝗿𝗲𝗱𝘂𝗰𝗲 the 𝘁𝗶𝗺𝗲 and effort required to 𝗱𝗲𝗯𝘂𝗴 complex 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗶𝘀𝘀𝘂𝗲𝘀 in AWS environments.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;𝗔𝗜-𝗮𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 are &lt;strong&gt;becoming&lt;/strong&gt; a foundational capability in modern cloud platforms. 𝗔𝗺𝗮𝘇𝗼𝗻 𝗤 represents AWS’s first serious step toward native, 𝗰𝗼𝗻𝘁𝗲𝘅𝘁-𝗮𝘄𝗮𝗿𝗲 𝗔𝗜 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 for Kubernetes.&lt;/p&gt;

&lt;p&gt;The 𝘁𝗲𝗮𝗺𝘀 𝘁𝗵𝗮𝘁 𝗯𝗲𝗻𝗲𝗳𝗶𝘁 most will be those who 𝗰𝗼𝗺𝗯𝗶𝗻𝗲:&lt;br&gt;
• Clean EKS access design &lt;br&gt;
• Strong IAM and RBAC practices &lt;br&gt;
• Realistic expectations of AI assistance&lt;/p&gt;

&lt;p&gt;That 𝗰𝗼𝗺𝗯𝗶𝗻𝗮𝘁𝗶𝗼𝗻  -not AI alone- is what unlocks 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆.&lt;/p&gt;




&lt;h1&gt;
  
  
  Note
&lt;/h1&gt;

&lt;p&gt;This DEV.to post is a concise version of a longer, experience-based guide.&lt;/p&gt;

&lt;p&gt;If you’re &lt;strong&gt;interested in deeper technical details&lt;/strong&gt;, &lt;strong&gt;IAM configuration nuances&lt;/strong&gt;, and real-world EKS lessons learned, you can read it among &lt;a href="https://medium.com/@zs77.robert/" rel="noopener noreferrer"&gt;My medium stories&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This article&lt;/strong&gt; is the &lt;strong&gt;first part of a series&lt;/strong&gt; where we explore &lt;strong&gt;AI-oriented debugging and operational&lt;/strong&gt; workflows in &lt;strong&gt;Kubernetes&lt;/strong&gt; and Amazon &lt;strong&gt;EKS&lt;/strong&gt; environments.&lt;/p&gt;




&lt;h1&gt;
  
  
  About the Author
&lt;/h1&gt;

&lt;p&gt;I’m &lt;strong&gt;Róbert Zsótér&lt;/strong&gt;, Kubernetes &amp;amp; AWS architect.&lt;br&gt;&lt;br&gt;
If you’re into &lt;strong&gt;Kubernetes, EKS, Terraform&lt;/strong&gt;, and &lt;strong&gt;cloud-native security&lt;/strong&gt;, follow my latest posts here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  LinkedIn: &lt;a href="https://www.linkedin.com/in/r%C3%B3bert-zs%C3%B3t%C3%A9r-34541464/" rel="noopener noreferrer"&gt;Róbert Zsótér&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Substack: &lt;a href="https://cloudskillshu.substack.com/" rel="noopener noreferrer"&gt;CSHU&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s build secure, scalable clusters, together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Originally published on Medium &lt;a href="https://medium.com/@zs77.robert/enhancing-amazon-eks-operations-with-ai-capabilities-of-amazon-q-part-1-7d01f25f01df" rel="noopener noreferrer"&gt;Enhancing Amazon EKS Operations with AI capabilities of Amazon Q -Part 1&lt;/a&gt;&lt;/p&gt;




</description>
      <category>aws</category>
      <category>eks</category>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>kubectl-ai WebUI: A Visual Way to Use AI for Kubernetes Troubleshooting</title>
      <dc:creator>Robert Zsoter</dc:creator>
      <pubDate>Tue, 25 Nov 2025 15:00:00 +0000</pubDate>
      <link>https://dev.to/robert_r_7c237256b7614328/kubectl-ai-webui-a-visual-way-to-use-ai-for-kubernetes-troubleshooting-34g6</link>
      <guid>https://dev.to/robert_r_7c237256b7614328/kubectl-ai-webui-a-visual-way-to-use-ai-for-kubernetes-troubleshooting-34g6</guid>
      <description>&lt;h2&gt;
  
  
  kubectl-ai WebUI: A Visual Interface for AI-Powered Kubernetes Troubleshooting
&lt;/h2&gt;

&lt;p&gt;If you've been experimenting with &lt;strong&gt;kubectl-ai&lt;/strong&gt; for AI-assisted troubleshooting on Kubernetes, you probably know one thing already:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It’s powerful, but strictly CLI-based.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This creates a real barrier for developers, students, or platform engineers &lt;strong&gt;who are less comfortable&lt;/strong&gt; with &lt;strong&gt;command-line&lt;/strong&gt; workflows but still want to benefit from AI-driven explanations, log analysis, issue hunting and YAML generation.&lt;/p&gt;

&lt;p&gt;To solve this, &lt;strong&gt;I built&lt;/strong&gt; something new:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A WebUI for kubectl-ai&lt;/strong&gt;, a &lt;strong&gt;browser interface&lt;/strong&gt; that makes AI-assisted Kubernetes troubleshooting accessible to everyone.&lt;/p&gt;

&lt;p&gt;This article explains &lt;strong&gt;what it does&lt;/strong&gt;, &lt;strong&gt;why it helps&lt;/strong&gt;, and &lt;strong&gt;how you can try it out.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0f6d60kaua7ofudluq7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0f6d60kaua7ofudluq7.png" alt="WebUI for kubectl-ai" width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What is kubectl-ai
&lt;/h2&gt;

&lt;p&gt;In short, kubectl-ai is an &lt;strong&gt;AI-powered plugin&lt;/strong&gt; for kubectl that transforms &lt;strong&gt;natural-language questions&lt;/strong&gt; into the appropriate Kubernetes commands.&lt;/p&gt;

&lt;p&gt;It functions as a CLI extension that brings together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your active Kubernetes cluster context,&lt;/li&gt;
&lt;li&gt;AI models (such as OpenAI, Anthropic, and others),&lt;/li&gt;
&lt;li&gt;Instant command generation and analysis directly from the terminal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are &lt;strong&gt;interested in&lt;/strong&gt; how this works in practice, let me recommend my articles on this subject that are already available on &lt;strong&gt;Medium&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@zs77.robert/unveiling-kubectl-ai-an-extended-exploration-of-ai-powered-kubernetes-management-3e4facd5b505" rel="noopener noreferrer"&gt;Unveiling kubectl-ai: An Extended Exploration of AI-Powered Kubernetes Management&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@zs77.robert/issue-hunting-in-a-kubernetes-cluster-with-kubectl-ai-a-practical-step-by-step-guide-4b34d5a1b7db" rel="noopener noreferrer"&gt;Issue Hunting in a Kubernetes Cluster with kubectl-ai: A Practical step-by-step Guide&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It &lt;strong&gt;helps&lt;/strong&gt; you diagnose issues and generate solutions.&lt;/p&gt;

&lt;p&gt;Typical kubectl-ai prompts:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"Why is my pod stuck in CrashLoopBackOff?"  &lt;br&gt;
"Explain the last 100 lines of logs for service X."  &lt;br&gt;
"Generate a correct Ingress for this Deployment."  &lt;br&gt;
"Why does my Deployment not scale to 5 replicas?"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It’s extremely useful, especially for debugging.&lt;/p&gt;

&lt;p&gt;But…&lt;/p&gt;




&lt;h2&gt;
  
  
  The CLI-Only Limitation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For many Kubernetes newcomers&lt;/strong&gt;, the command line is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;intimidating&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;error-prone&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;difficult to navigate&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;visually limited&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;slow to learn&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have noticed in my environment that &lt;strong&gt;40–60%&lt;/strong&gt; of my colleagues working with Kubernetes prefer the &lt;strong&gt;visual interface&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That’s why&lt;/strong&gt; I created a browser-based experience.&lt;/p&gt;




&lt;h2&gt;
  
  
  Introducing: &lt;strong&gt;kubectl-ai WebUI&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;browser UI&lt;/strong&gt; that exposes the same kubectl-ai logic, &lt;strong&gt;without&lt;/strong&gt; requiring the &lt;strong&gt;terminal&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  No deeper CLI knowledge needed
&lt;/h3&gt;

&lt;p&gt;Just open your browser → type your question → get the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Same capabilities as kubectl-ai CLI
&lt;/h3&gt;

&lt;p&gt;Since kubectl-ai works in the background, it provides the same functionality but on a web interface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Log analysis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Error explanations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;YAML generation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kubernetes troubleshooting&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Best-practice fixes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;etc.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Works on top of your existing Kubernetes context
&lt;/h3&gt;

&lt;p&gt;The WebUI simply &lt;strong&gt;forwards your prompt&lt;/strong&gt; to the CLI and &lt;strong&gt;displays&lt;/strong&gt; the &lt;strong&gt;result&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Designed for:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;beginners&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;platform teams&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DevOps onboarding&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;training rooms&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;troubleshooting sessions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;classroom labs&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture Overview (ASCII)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa02b4ossitdm0x9oh74r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa02b4ossitdm0x9oh74r.png" alt="Architecture overview of webui for kubectl-ai" width="710" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens under the hood:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The WebUI sends your prompt to the backend&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The backend triggers kubectl-ai&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;kubectl-ai queries Kubernetes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AI model generates reasoning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The response is displayed visually&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;No terminal interaction&lt;/strong&gt; needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Installation (Quick Start)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Every step&lt;/strong&gt; for use is documented in detail in the &lt;strong&gt;GitHub&lt;/strong&gt; repository, so please read it. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt; repo: &lt;a href="https://github.com/zsoterr/kubernetes-ai-examples/tree/main/project-k8s-kubectl-ai-web-ui" rel="noopener noreferrer"&gt;k8s-kubectl-ai-web-ui&lt;/a&gt;&lt;br&gt;
If you find it useful, feel free to ⭐ it or share your ideas there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Using the WebUI (Examples)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Troubleshoot a CrashLoopBackOff
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;"Investigate why payment-service pod is restarting."&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Explain log output
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;"Help me understand the last 20 log lines for checkout-api."&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Generate YAML
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;"Create a working Ingress for my deployment: checkout-api, port 8080."&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagnose deployment issues
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;"Why isn't my HPA scaling above 3 replicas?"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Everything happens visually, without needing terminal commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screnshoots&lt;/strong&gt;:&lt;br&gt;
With &lt;em&gt;ReadOnly&lt;/em&gt; mode:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6d08qtdnshgu1znzwf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6d08qtdnshgu1znzwf1.png" alt="WebUI - " width="699" height="801"&gt;&lt;/a&gt;&lt;br&gt;
and&lt;br&gt;
With &lt;em&gt;"Enabled cluster changes"&lt;/em&gt;:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokss3dzoc5abeykfejjf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokss3dzoc5abeykfejjf.png" alt="WebUI - " width="800" height="602"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When the WebUI Is Useful
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Ideal for:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Who are new to Kubernetes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Who are not confident or comfortable with the (Linux) shell&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;team onboarding&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;demos and training, Kubernetes courses&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;situations where some teammates prefer UI over CLI&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Not ideal for:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;production environments with very strict RBAC&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;shared clusters with single API key authentication&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;environments where shell access is mandatory for auditing&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Features You Can Customize
&lt;/h2&gt;

&lt;p&gt;Inside the project, you can modify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AI provider configuration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;namespace filters&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;safety rules for commands&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;input sanitization&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;UI wording and layout&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project is &lt;strong&gt;intended to be extended&lt;/strong&gt; (check the README file in the affected folder on the Github).&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Project Exists
&lt;/h2&gt;

&lt;p&gt;As a Kubernetes enthusiast and AWS/K8s architect, I’ve seen again and again:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;people want to use AI for K8s&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;kubectl-ai is great&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;but the CLI stops many people from even trying&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This WebUI removes that friction and makes AI-powered troubleshooting available to the entire team, not just the CLI-native power users.&lt;/p&gt;




&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;Full project:&lt;a href="https://github.com/zsoterr/kubernetes-ai-examples/tree/main/project-k8s-kubectl-ai-web-ui" rel="noopener noreferrer"&gt;k8s-kubectl-ai-web-ui&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you find it useful, feel free to ⭐ it or share your ideas there.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
I’m &lt;strong&gt;Róbert Zsótér&lt;/strong&gt;, Kubernetes &amp;amp; AWS architect.&lt;br&gt;&lt;br&gt;
If you’re into &lt;strong&gt;Kubernetes, EKS, Terraform&lt;/strong&gt;, and &lt;strong&gt;cloud-native security&lt;/strong&gt;, follow my latest posts here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  LinkedIn: &lt;a href="https://www.linkedin.com/in/r%C3%B3bert-zs%C3%B3t%C3%A9r-34541464/" rel="noopener noreferrer"&gt;Róbert Zsótér&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Substack: &lt;a href="https://cloudskillshu.substack.com/" rel="noopener noreferrer"&gt;CSHU&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s build secure, scalable clusters, together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Originally published on Medium &lt;a href="https://medium.com/p/cc5dcf78d41f" rel="noopener noreferrer"&gt;Kubectl-ai WebUI: Making AI Kubernetes Debugging Browser-Friendly for Al&lt;/a&gt;&lt;/p&gt;




</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>kubectl</category>
      <category>devops</category>
    </item>
    <item>
      <title>Extending Your Kubernetes CLI: kubectl Plugins and Fixes</title>
      <dc:creator>Robert Zsoter</dc:creator>
      <pubDate>Tue, 18 Nov 2025 15:06:00 +0000</pubDate>
      <link>https://dev.to/robert_r_7c237256b7614328/extending-your-kubernetes-cli-kubectl-plugins-and-fixes-4h9p</link>
      <guid>https://dev.to/robert_r_7c237256b7614328/extending-your-kubernetes-cli-kubectl-plugins-and-fixes-4h9p</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;The Power of kubectl Plugins&lt;/strong&gt; and Fixing Common Errors.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Learn kubectl plugins&lt;/strong&gt; use cases, how they work, alternatives, and troubleshooting errors like ‘&lt;em&gt;unknown command&lt;/em&gt;’.”&lt;/p&gt;




&lt;p&gt;As a Kubernetes user, have you ever &lt;strong&gt;wished your kubectl could do more&lt;/strong&gt;? &lt;strong&gt;Or hit an error&lt;/strong&gt; like “&lt;em&gt;error: unknown command ‘oidc-login’ for ‘kubectl&lt;/em&gt;’”? &lt;br&gt;&lt;br&gt;
In this Medium story, we’ll break down why &lt;strong&gt;plugins&lt;/strong&gt; are game-changers, &lt;strong&gt;how they work,&lt;/strong&gt; when to (or not to) use them, and alternatives, plus &lt;strong&gt;a real fix&lt;/strong&gt; on Ubuntu 24.04 with Amazon EKS.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;What Are kubectl Plugins?&lt;/strong&gt; 
&lt;/h3&gt;

&lt;p&gt;kubectl plugins are &lt;strong&gt;CLI extensions&lt;/strong&gt; that boost the standard kubectl tool with extra features. They’re lightweight, &lt;strong&gt;task-focused utilities&lt;/strong&gt; that blend seamlessly into your Kubernetes routine. They add new commands without touching the core tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basics&lt;/strong&gt;: Plugins are &lt;strong&gt;standalone executables&lt;/strong&gt; that introduce &lt;strong&gt;sub-commands&lt;/strong&gt; (e.g., &lt;em&gt;kubectl oidc-login&lt;/em&gt;). &lt;br&gt;&lt;br&gt;
&lt;strong&gt;Use case&lt;/strong&gt;: When built-in tools fall short, like for custom authentication or debugging. &lt;br&gt;&lt;br&gt;
&lt;strong&gt;Why good?:&lt;/strong&gt; Modular design, simple setup, community-backed, speeds workflows without adding complexity.&lt;/p&gt;


&lt;h3&gt;
  
  
  Why Use kubectl Plugins? 
&lt;/h3&gt;

&lt;p&gt;Kubernetes is built for &lt;strong&gt;modularity&lt;/strong&gt;, and its CLI follows suit. A plugin is an &lt;strong&gt;executable&lt;/strong&gt; named &lt;strong&gt;&lt;em&gt;kubectl-&lt;/em&gt;&lt;/strong&gt;. &lt;br&gt;&lt;br&gt;
Typing &lt;em&gt;kubectl &lt;/em&gt; &lt;strong&gt;triggers the CLI to scan your PATH&lt;/strong&gt; and &lt;strong&gt;run&lt;/strong&gt; it like a native &lt;strong&gt;command&lt;/strong&gt;. &lt;br&gt;&lt;br&gt;
&lt;strong&gt;Use cases:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Custom authentication&lt;/strong&gt;: like &lt;em&gt;kubectl oidc-login&lt;/em&gt; for OIDC (Dex, &lt;strong&gt;Keycloak&lt;/strong&gt;, Azure AD,etc.) 
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inspection&lt;/strong&gt;: &lt;em&gt;kubectl tree&lt;/em&gt;, &lt;em&gt;ctx&lt;/em&gt; for operators/developers. 
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt;: &lt;em&gt;kubectl neat&lt;/em&gt; cleans manifests, sort-manifests optimizes. 
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: &lt;em&gt;kubectl who-can&lt;/em&gt; for audits, trace for runtime. 
&lt;strong&gt;Benefit&lt;/strong&gt;: Bridges basic kubectl to robust production setups.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  When to Use (or Not)?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use&lt;/strong&gt;: For custom gaps in core, boosting efficiency and speed. &lt;br&gt;&lt;br&gt;
&lt;strong&gt;Not&lt;/strong&gt;: When built-in are enough, or avoid untrusted ones (! &lt;strong&gt;plugins run with kubectl permissions&lt;/strong&gt; ! : potential security risk). Better alternatives for scripting (e.g., SDKs).&lt;/p&gt;


&lt;h3&gt;
  
  
  What Happens When a Plugin Is Missing? (a real use-case)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Real example&lt;/strong&gt; on &lt;strong&gt;Ubuntu&lt;/strong&gt; 24.04 LTS &lt;strong&gt;with&lt;/strong&gt; Amazon &lt;strong&gt;EKS. &lt;br&gt;&lt;br&gt;
I ran this&lt;/strong&gt; command after I’m imported the relevant kubeconfig file and I &lt;strong&gt;got an error&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl get nodes - context &amp;lt;YOUR-CLUSTER-CONTEXT&amp;gt; 
error: unknown command "oidc-login" for "kubectl" E1104 07:21:55.470872XXXXX memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://XXXXXXXX.XX.XX-XXXX-X.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable kubectl failed with exit code 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Reason&lt;/strong&gt;? The OIDC plugin (&lt;em&gt;kubectl oidc-login&lt;/em&gt;) wasn’t installed. Without it, Kubernetes can’t fetch OIDC tokens, failing the API call.&lt;/p&gt;




&lt;h4&gt;
  
  
  How to Fix It? 
&lt;/h4&gt;

&lt;p&gt;Use &lt;strong&gt;Krew&lt;/strong&gt;, the plugin manager. &lt;br&gt;&lt;br&gt;
&lt;strong&gt;What is Krew?&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
Krew is the &lt;strong&gt;official plugin manager&lt;/strong&gt; for the kubectl command-line tool. It lets you &lt;strong&gt;discover&lt;/strong&gt;, &lt;strong&gt;install&lt;/strong&gt;, and &lt;strong&gt;manage plugins&lt;/strong&gt; (update) directly on your system, keeping tools organized and current. &lt;/p&gt;

&lt;p&gt;Let’s go and &lt;strong&gt;install Krew&lt;/strong&gt; (follow the official Krew docs or run these commands):&lt;br&gt;
&lt;br&gt;
   &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# One-time install of krew 
set -e 
cd "$(mktemp -d)" 
OS="$(uname | tr '[:upper:]' '[:lower:]' )" 
ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/aarch64/arm64/' -e 's/armv.*/arm/' )" 
curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/krew-${OS}_${ARCH}.tar.gz" 
tar zxvf "krew-${OS}_${ARCH}.tar.gz" 
./krew-"${OS}_${ARCH}" install krew 
# Make krew plugins available on PATH (persist for your shell) 
echo 'export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"' &amp;gt;&amp;gt; ~/.bashrc 
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Then run&lt;/strong&gt;: &lt;em&gt;kubectl krew install oidc-login&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;After&lt;/strong&gt;, kubectl recognizes it, and OIDC authentication should works.&lt;/p&gt;




&lt;h3&gt;
  
  
  Benefits of Using Plugins via Krew 
&lt;/h3&gt;

&lt;p&gt;Krew &lt;strong&gt;simplifies management&lt;/strong&gt; like apt/brew/pip. &lt;br&gt;&lt;br&gt;
&lt;strong&gt;Advantages&lt;/strong&gt;:   &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Central trusted repositories of trusted plugins, 
&lt;/li&gt;
&lt;li&gt;Easy updates: &lt;em&gt;kubectl krew upgrade&lt;/em&gt;, 
&lt;/li&gt;
&lt;li&gt;Per-user installation, &lt;strong&gt;no root required&lt;/strong&gt;, 
&lt;/li&gt;
&lt;li&gt;Seamless PATH integration.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Security Note!&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Always verify sources&lt;/strong&gt; or &lt;strong&gt;stick to the official Krew Index&lt;/strong&gt; to avoid untrusted binaries, as &lt;strong&gt;plugins&lt;/strong&gt; inherit the same &lt;strong&gt;access permissions as kubectl&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Alternatives to kubectl Plugins 
&lt;/h2&gt;

&lt;p&gt;While plugins enhance functionality flexibly, &lt;strong&gt;alternatives exist&lt;/strong&gt; for extending Kubernetes interaction. &lt;br&gt;&lt;br&gt;
Each &lt;strong&gt;for different purposes&lt;/strong&gt;:   &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plugins&lt;/strong&gt; → Extend kubectl directly, minimal overhead 
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrappers&lt;/strong&gt; / SDKs → Ideal for CI/CD automation 
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud CLIs&lt;/strong&gt; → Best for provider-specific operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvvqiz4kk7wixir48v5v2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvvqiz4kk7wixir48v5v2.png" alt="Alternatives to kubectl Plugins" width="786" height="407"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Final Thoughts 
&lt;/h3&gt;

&lt;p&gt;kubectl &lt;strong&gt;plugins&lt;/strong&gt; demonstrate Kubernetes’ extensibility, &lt;strong&gt;customizing the CLI&lt;/strong&gt; for any Kubernetes environments, or workflow. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which is the best plugin?&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
No “best”: &lt;strong&gt;find the most suitable for your use cases!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let me &lt;strong&gt;recommend a few&lt;/strong&gt; worth trying: &lt;br&gt;&lt;br&gt;
&lt;strong&gt;For Ops:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;kubectl preq&lt;/strong&gt;: Analyze apps for bugs, misconfigs, anti-patterns. 
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kubect tail:&lt;/strong&gt; Real-time log streaming, aggregate from multiple pods, better than &lt;em&gt;logs -f&lt;/em&gt;. 
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kubens&lt;/strong&gt;: Switch namespaces instantly, companion to kubectx. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For Devs:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;kubectl score&lt;/strong&gt;: Static analysis for YAML, lints/validates manifests. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For both:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;kubectl sniff:&lt;/strong&gt; Run “tcpdump” inside Pods with Wireshark, capture network traffic. &lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Learn More!&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes &lt;strong&gt;plugins&lt;/strong&gt;: &lt;a href="https://medium.com/r/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftasks%2Fextend-kubectl%2Fkubectl-plugins%2F" rel="noopener noreferrer"&gt;Extend kubectl with plugins&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Krew&lt;/strong&gt;: &lt;a href="https://medium.com/r/?url=https%3A%2F%2Fkrew.sigs.k8s.io%2F" rel="noopener noreferrer"&gt;Plugin manager for kubectl&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kubelogin&lt;/strong&gt;: &lt;a href="https://medium.com/r/?url=https%3A%2F%2Fgithub.com%2Fint128%2Fkubelogin" rel="noopener noreferrer"&gt;Plugin for Kubernetes OIDC auth&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;br&gt;
I’m &lt;strong&gt;Róbert Zsótér&lt;/strong&gt;, Kubernetes &amp;amp; AWS architect.&lt;br&gt;
If you’re into &lt;strong&gt;Kubernetes, EKS, Terraform&lt;/strong&gt;, and &lt;strong&gt;cloud-native security&lt;/strong&gt;, follow my latest posts here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/r%C3%B3bert-zs%C3%B3t%C3%A9r-34541464/" rel="noopener noreferrer"&gt;Róbert Zsótér&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Substack: &lt;a href="//cloudskillshu.substack.com"&gt;CSHU&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s build secure, scalable clusters, together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Originally published on Medium &lt;a href="https://medium.com/@zs77.robert/extending-your-kubernetes-cli-kubectl-plugins-and-fixes-fc5755d0b8d2" rel="noopener noreferrer"&gt;Extending Your Kubernetes CLI: kubectl Plugins and Fixes&lt;/a&gt;&lt;/p&gt;





</description>
      <category>kubernetes</category>
      <category>kubectl</category>
      <category>troubleshooting</category>
      <category>devops</category>
    </item>
    <item>
      <title>Hands-On with Kubernetes 1.33: My PoC on In-Place Vertical Scaling</title>
      <dc:creator>Robert Zsoter</dc:creator>
      <pubDate>Tue, 11 Nov 2025 08:04:00 +0000</pubDate>
      <link>https://dev.to/robert_r_7c237256b7614328/hands-on-with-kubernetes-133-my-poc-on-in-place-vertical-scaling-147</link>
      <guid>https://dev.to/robert_r_7c237256b7614328/hands-on-with-kubernetes-133-my-poc-on-in-place-vertical-scaling-147</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Kubernetes 1.33 &lt;strong&gt;introduced&lt;/strong&gt; a game-changing feature: &lt;strong&gt;&lt;em&gt;in-place vertical pod scaling&lt;/em&gt;&lt;/strong&gt;, now in beta and &lt;strong&gt;enabled by default&lt;/strong&gt;. As a cloud engineer, I was eager to &lt;strong&gt;test its ability&lt;/strong&gt; to &lt;strong&gt;dynamically adjust CPU&lt;/strong&gt; and memory for running pods &lt;strong&gt;without restarts&lt;/strong&gt; a potential win for &lt;strong&gt;resource optimization&lt;/strong&gt;. Inspired by its promise, I set up a Proof of Concept (&lt;strong&gt;PoC&lt;/strong&gt;) on &lt;strong&gt;AWS EC2&lt;/strong&gt; using &lt;strong&gt;Minikube&lt;/strong&gt; to explore its practical applications.&lt;/p&gt;

&lt;p&gt;In this guide, &lt;strong&gt;I’ll walk you&lt;/strong&gt; through my &lt;strong&gt;step-by-step&lt;/strong&gt; process, from cluster setup to scaling automation, and share &lt;strong&gt;insights&lt;/strong&gt; for leveraging this feature in your own test environments.&lt;br&gt;&lt;br&gt;
Let’s dive in!&lt;/p&gt;


&lt;h2&gt;
  
  
  What is In-Place Vertical Pod Scaling?
&lt;/h2&gt;

&lt;p&gt;Kubernetes &lt;strong&gt;1.33&lt;/strong&gt; brings &lt;strong&gt;&lt;em&gt;in-place vertical pod scaling&lt;/em&gt;&lt;/strong&gt; as a default feature that lets you &lt;strong&gt;&lt;em&gt;adjust a pod’s CPU and memory resources on the fly&lt;/em&gt;&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Unlike traditional scaling, this approach &lt;strong&gt;avoids pod restarts&lt;/strong&gt;, making it ideal for maintaining &lt;strong&gt;application availability&lt;/strong&gt;. I was curious by its potential to optimize my Customer’s application efficiently.&lt;br&gt;&lt;br&gt;
This guide will explain &lt;strong&gt;how&lt;/strong&gt; this works and &lt;strong&gt;why&lt;/strong&gt; it matters for your Kubernetes journey.&lt;/p&gt;


&lt;h2&gt;
  
  
  Prerequisites for the PoC
&lt;/h2&gt;

&lt;p&gt;Before diving into my PoC, you’ll need the &lt;strong&gt;right tools&lt;/strong&gt; to replicate my setup on &lt;strong&gt;AWS EC2&lt;/strong&gt;. This includes &lt;strong&gt;installing Minikube&lt;/strong&gt; with Kubernetes &lt;strong&gt;1.33&lt;/strong&gt; and setting up the &lt;strong&gt;Metrics Server&lt;/strong&gt; for resource monitoring. I ensured these were in place to make the scaling process smooth and measurable.&lt;br&gt;&lt;br&gt;
Let’s cover the essentials to get you started!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
I don’t want to go into &lt;strong&gt;how to install a Minikube&lt;/strong&gt; cluster: there are lots of good tutorials on the Internet. It is a very simple step.&lt;/p&gt;



&lt;p&gt;In my PoC, I used an &lt;strong&gt;EC2&lt;/strong&gt; instance on &lt;strong&gt;AWS&lt;/strong&gt;, but this is &lt;strong&gt;not a limitation&lt;/strong&gt;: you &lt;strong&gt;can use&lt;/strong&gt; any &lt;strong&gt;other&lt;/strong&gt; environment, there is no dependency in this respect.&lt;br&gt;&lt;br&gt;
For the &lt;strong&gt;Minikube instance&lt;/strong&gt; I chose an EC2 &lt;strong&gt;instance type&lt;/strong&gt; (t3a.medium) with the following “hardware” parameters: vCPU: 2, RAM: 4Gib, storage: 40GB (with default storage type)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;More&lt;/strong&gt; information: &lt;a href="https://aws.amazon.com/ec2/instance-types/t3/" rel="noopener noreferrer"&gt;https://aws.amazon.com/ec2/instance-types/t3/&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Setting Up the Test Environment
&lt;/h2&gt;

&lt;p&gt;Creating a solid test environment was the &lt;strong&gt;first step&lt;/strong&gt; in my PoC journey with &lt;strong&gt;Kubernetes 1.33&lt;/strong&gt;. I &lt;strong&gt;set up a Minikube&lt;/strong&gt; cluster on AWS EC2 and &lt;strong&gt;deployed a Metrics Server&lt;/strong&gt; to track pod resources. This setup allowed me to test the vertical scaling feature with a sample Nginx pod.&lt;/p&gt;

&lt;p&gt;Here, I’ll guide you through building this foundation.&lt;/p&gt;


&lt;h3&gt;
  
  
  Start a Minikube cluster
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: There are &lt;strong&gt;two key points&lt;/strong&gt; here.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; We need to use &lt;strong&gt;containerd&lt;/strong&gt; instead of docker.
The reason is: there &lt;strong&gt;was a problem with the resizing of the Pod if I used docker&lt;/strong&gt;. If you need &lt;strong&gt;more information&lt;/strong&gt; about this issue, &lt;strong&gt;contact&lt;/strong&gt; with me. I don’t want to go in detail with it now, because I don’t want this story to be very long.&lt;/li&gt;
&lt;li&gt; I need to &lt;strong&gt;also mention that&lt;/strong&gt; after extensive testing and checking, I found that although “supposed to be” this is &lt;strong&gt;already an enabled&lt;/strong&gt; feature in this version, it &lt;strong&gt;works a bit “differently”&lt;/strong&gt; in Minikube: not allowed for all components*&lt;em&gt;.&lt;/em&gt;*
&lt;strong&gt;The easiest way&lt;/strong&gt; in this case is to &lt;strong&gt;start&lt;/strong&gt; the cluster with an &lt;strong&gt;additional&lt;/strong&gt; command line option: “ &lt;em&gt;-- feature-gates=InPlacePodVerticalScaling=true&lt;/em&gt;”&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Start&lt;/strong&gt; the Minikube:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;minikube start --kubernetes-version=v1.33.1 --container-runtime=containerd --feature-gates=InPlacePodVerticalScaling=true&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before we go deeper,&lt;/strong&gt; make sure that your &lt;strong&gt;Minikube cluster is up and running&lt;/strong&gt; and you are &lt;strong&gt;interacting with this cluster&lt;/strong&gt; (&lt;em&gt;kubeconfig&lt;/em&gt; is configured for the Minikube)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;minikube status  &lt;br&gt;
alias k=kubectl  &lt;br&gt;
k config current-context&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If&lt;/strong&gt; your &lt;em&gt;kubeconfig&lt;/em&gt; &lt;strong&gt;is not configured&lt;/strong&gt; correctly run this command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl config use-context minikube  &lt;br&gt;
k get nodes&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Expected output is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NAME       STATUS   ROLES           AGE     VERSION  &lt;br&gt;
minikube   Ready    control-plane   5m30s   v1.33.1&lt;/code&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Enable Metrics server
&lt;/h3&gt;

&lt;p&gt;We need metrics, of course. We &lt;strong&gt;can install&lt;/strong&gt; Metrics server via downloading manifest file and install it (using &lt;em&gt;kubectl apply -f &lt;/em&gt; command) but the most easiest way: &lt;strong&gt;enable it&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;minikube addons enable metrics-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* metrics-server is an addon maintained by Kubernetes. For any concerns contact minikube on GitHub.
You can view the list of minikube maintainers at: https://github.com/kubernetes/minikube/blob/master/OWNERS
  - Using image registry.k8s.io/metrics-server/metrics-server:v0.7.2
* The 'metrics-server' addon is enabled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Wait a few minutes&lt;/strong&gt; and make sure that you have metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k top pods -A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAMESPACE     NAME                               CPU(cores)   MEMORY(bytes)  
kube-system   coredns-674b8bbfcf-m7jqr           2m           12Mi  
kube-system   etcd-minikube                      16m          26Mi  
kube-system   kindnet-2n29r                      1m           7Mi  
kube-system   kube-apiserver-minikube            33m          205Mi  
kube-system   kube-controller-manager-minikube   14m          41Mi  
kube-system   kube-proxy-jh6gm                   1m           11Mi  
kube-system   kube-scheduler-minikube            7m           19Mi  
kube-system   metrics-server-7fbb699795-bsb8v    3m           15Mi  
kube-system   storage-provisioner                2m           7Mi  
test          test-pod                           0m           2Mi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Create namespace and manifest file
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Next&lt;/strong&gt; create a new &lt;strong&gt;namespace&lt;/strong&gt; and &lt;strong&gt;start a Pod&lt;/strong&gt; within it:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl create namespace test&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create&lt;/strong&gt; a file, named “ &lt;em&gt;test-pod.yaml&lt;/em&gt;” , &lt;strong&gt;save&lt;/strong&gt; and &lt;strong&gt;apply&lt;/strong&gt; it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1  
kind: Pod  
metadata:  
  name: test-pod  
  namespace: test  
spec:  
  containers:  
  - name: nginx  
    image: nginx  
    resources:  
      requests:  
        cpu: "200m"  
        memory: "128Mi"  
      limits:  
        cpu: "500m"  
        memory: "256Mi"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and apply it: &lt;br&gt;
&lt;code&gt;kubectl apply -f test-pod.yaml&lt;/code&gt;&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;! : don’t forget that you should &lt;strong&gt;match your kubectl version with the running Kubernetes’ version&lt;/strong&gt;! (there is some tolerance for deviation but I always suggest that they should be at the same version level)&lt;/p&gt;



&lt;p&gt;That means that you should &lt;strong&gt;avoid a similar situation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ kubectl versionClient Version: v1.28.3Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3Server Version: v1.33.1WARNING: version difference between client (1.28) and server (1.33) exceeds the supported minor version skew of +/-1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confirm&lt;/strong&gt; that the test pod is up and &lt;strong&gt;running&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;k get pods -ntest&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;code&gt;NAME       READY   STATUS    RESTARTS   AGE&lt;br&gt;
test-pod   1/1     Running   0          17m&lt;/code&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Testing manual resizing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Validate&lt;/strong&gt; that the resizing works well: first, we will test it &lt;strong&gt;manually&lt;/strong&gt;. Later, we will move forward and &lt;strong&gt;automate&lt;/strong&gt; these steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, run these commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl patch pod test-pod -n test --subresource resize --patch '{"spec": {"containers": [{"name": "nginx", "resources": {"requests": {"cpu": "300m", "memory": "256Mi"}, "limits": {"cpu": "600m", "memory": "512Mi"}}}]}}'
kubectl get pod test-pod -n test -o yaml
kubectl get pod test-pod -n test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Make sure&lt;/strong&gt; that the pod wasn’t restarted and &lt;strong&gt;limits&lt;/strong&gt; and &lt;strong&gt;requests&lt;/strong&gt; are &lt;strong&gt;updated&lt;/strong&gt; correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pod test-pod -n test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NAME       READY   STATUS    RESTARTS   AGE  &lt;br&gt;
test-pod   1/1     Running   0          30m&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k get pods -ntest -oyaml |grep -i -E "limits|requests"  -A4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"apiVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"v1"&lt;/span&gt;,&lt;span class="s2"&gt;"kind"&lt;/span&gt;:&lt;span class="s2"&gt;"Pod"&lt;/span&gt;,&lt;span class="s2"&gt;"metadata"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"annotations"&lt;/span&gt;:&lt;span class="o"&gt;{}&lt;/span&gt;,&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"test-pod"&lt;/span&gt;,&lt;span class="s2"&gt;"namespace"&lt;/span&gt;:&lt;span class="s2"&gt;"test"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="s2"&gt;"spec"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"containers"&lt;/span&gt;:&lt;span class="se"&gt;\[&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"image"&lt;/span&gt;:&lt;span class="s2"&gt;"nginx"&lt;/span&gt;,&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"nginx"&lt;/span&gt;,&lt;span class="s2"&gt;"resources"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"limits"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"cpu"&lt;/span&gt;:&lt;span class="s2"&gt;"500m"&lt;/span&gt;,&lt;span class="s2"&gt;"memory"&lt;/span&gt;:&lt;span class="s2"&gt;"256Mi"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="s2"&gt;"requests"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"cpu"&lt;/span&gt;:&lt;span class="s2"&gt;"200m"&lt;/span&gt;,&lt;span class="s2"&gt;"memory"&lt;/span&gt;:&lt;span class="s2"&gt;"128Mi"&lt;/span&gt;&lt;span class="o"&gt;}}}&lt;/span&gt;&lt;span class="se"&gt;\]&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;  
    creationTimestamp: &lt;span class="s2"&gt;"2025-05-28T11:54:23Z"&lt;/span&gt;  
    generation: 2  
    name: test-pod  
    namespace: &lt;span class="nb"&gt;test  

        &lt;/span&gt;limits:  
          cpu: 600m  
          memory: 512Mi  
        requests:  
          cpu: 300m  
          memory: 256Mi  
      terminationMessagePath: /dev/termination-log  
      terminationMessagePolicy: File  

        limits:  
          cpu: 600m  
          memory: 512Mi  
        requests:  
          cpu: 300m  
          memory: 256Mi  
      restartCount: 0  
      started: &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Configuring Access and Permissions
&lt;/h2&gt;

&lt;p&gt;Set Up &lt;strong&gt;RBAC&lt;/strong&gt; Permissions to enable your monitoring script to &lt;strong&gt;access pod metrics&lt;/strong&gt; and perform resize operations. It is definitely necessary in our case now because it will support our automation goal (using a &lt;em&gt;CronJob&lt;/em&gt; with &lt;em&gt;monitor.sh&lt;/em&gt; script to &lt;strong&gt;resize pods based on resource usage&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, we need to “restore” the previous state regarding our test pod (we already updated it):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delete&lt;/strong&gt; the pod and &lt;strong&gt;recreate&lt;/strong&gt; it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k delete pod test-pod &lt;span class="nt"&gt;-ntest&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; k apply &lt;span class="nt"&gt;-f&lt;/span&gt; test-pod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pod/test-pod created&lt;/code&gt;&lt;br&gt;
and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k get pods &lt;span class="nt"&gt;-ntest&lt;/span&gt; &lt;span class="nt"&gt;-oyaml&lt;/span&gt; |grep &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"limits|requests"&lt;/span&gt;  &lt;span class="nt"&gt;-A4&lt;/span&gt;     
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;        &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"apiVersion"&lt;/span&gt;:&lt;span class="s2"&gt;"v1"&lt;/span&gt;,&lt;span class="s2"&gt;"kind"&lt;/span&gt;:&lt;span class="s2"&gt;"Pod"&lt;/span&gt;,&lt;span class="s2"&gt;"metadata"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"annotations"&lt;/span&gt;:&lt;span class="o"&gt;{}&lt;/span&gt;,&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"test-pod"&lt;/span&gt;,&lt;span class="s2"&gt;"namespace"&lt;/span&gt;:&lt;span class="s2"&gt;"test"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="s2"&gt;"spec"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"containers"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"image"&lt;/span&gt;:&lt;span class="s2"&gt;"nginx"&lt;/span&gt;,&lt;span class="s2"&gt;"name"&lt;/span&gt;:&lt;span class="s2"&gt;"nginx"&lt;/span&gt;,&lt;span class="s2"&gt;"resources"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"limits"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"cpu"&lt;/span&gt;:&lt;span class="s2"&gt;"500m"&lt;/span&gt;,&lt;span class="s2"&gt;"memory"&lt;/span&gt;:&lt;span class="s2"&gt;"256Mi"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="s2"&gt;"requests"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"cpu"&lt;/span&gt;:&lt;span class="s2"&gt;"200m"&lt;/span&gt;,&lt;span class="s2"&gt;"memory"&lt;/span&gt;:&lt;span class="s2"&gt;"128Mi"&lt;/span&gt;&lt;span class="o"&gt;}}}]}}&lt;/span&gt;
    creationTimestamp: &lt;span class="s2"&gt;"2025-05-28T11:54:23Z"&lt;/span&gt;
    generation: 1
    name: test-pod
    namespace: &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;span class="nt"&gt;--&lt;/span&gt;
        limits:
          cpu: 500m
          memory: 256Mi
        requests:
          cpu: 200m
          memory: 128Mi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
&lt;span class="nt"&gt;--&lt;/span&gt;
        limits:
          cpu: 500m
          memory: 256Mi
        requests:
          cpu: 200m
          memory: 128Mi
      restartCount: 0
      started: &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;The &lt;strong&gt;next&lt;/strong&gt;, &lt;strong&gt;create&lt;/strong&gt; the necessary Kubernetes objects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create and save&lt;/strong&gt; these manifest files and &lt;strong&gt;apply&lt;/strong&gt; those:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Name&lt;/strong&gt;: &lt;em&gt;pod-scaler-sa.yaml&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1  
kind: ServiceAccount  
metadata:  
  name: pod-scaler  
  namespace: test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and&lt;br&gt;
&lt;strong&gt;Name&lt;/strong&gt;: &lt;em&gt;pod-scaler-role.yaml&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-scaler-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "patch"]
- apiGroups: [""]
  resources: ["pods/resize"]
  verbs: ["get", "patch"]
  resourceNames: ["test-pod"]
- apiGroups: ["metrics.k8s.io"]
  resources: ["pods", "podmetrics"]
  verbs: ["get", "list"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and&lt;br&gt;
&lt;strong&gt;Name&lt;/strong&gt;: &lt;em&gt;pod-scaler-binding.yaml&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pod-scaler-binding
subjects:
- kind: ServiceAccount
  name: pod-scaler
  namespace: test
roleRef:
  kind: ClusterRole
  name: pod-scaler-role
  apiGroup: rbac.authorization.k8s.io
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Automating Scaling with a Monitor Script
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;next&lt;/strong&gt; step, we will &lt;strong&gt;verify the RBAC&lt;/strong&gt; setup in a pod to simulate the environment where &lt;em&gt;monitor.sh&lt;/em&gt; will run in the CronJob.&lt;br&gt;&lt;br&gt;
This &lt;strong&gt;ensures&lt;/strong&gt; the &lt;em&gt;ServiceAccount token&lt;/em&gt; is correctly mounted and &lt;strong&gt;can access&lt;/strong&gt; the Kubernetes &lt;strong&gt;API&lt;/strong&gt; (both metrics.k8s.io for metrics and pods/resize for resizing) from within a pod, mimicking the CronJob’s runtime behaviour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is critical to confirm the automation will work&lt;/strong&gt; end-to-end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, &lt;strong&gt;create&lt;/strong&gt; a new pod, named “&lt;em&gt;test-access&lt;/em&gt;”:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Name&lt;/strong&gt;: “&lt;em&gt;test-access-pod.yaml&lt;/em&gt;”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Pod
metadata:
  name: test-access
  namespace: test
spec:
  serviceAccountName: pod-scaler
  containers:
  - name: test
    image: bitnami/kubectl:1.33
    command: ["sleep", "infinity"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Apply&lt;/strong&gt; it and &lt;strong&gt;make sure&lt;/strong&gt; that both pods are up and running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k apply -f test-access-pod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k get pods -ntest  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NAME          READY   STATUS    RESTARTS   AGE  &lt;br&gt;
test-access   1/1     Running   0          8s  &lt;br&gt;
test-pod      1/1     Running   0          66m&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate&lt;/strong&gt; it! :&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffd8k4a8dg9srwbwvbepi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffd8k4a8dg9srwbwvbepi.png" alt="Validate RBAC permission and patch the Pod" width="800" height="197"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Availability of metrics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Jump&lt;/strong&gt; into the pod and &lt;strong&gt;run&lt;/strong&gt; these commands to &lt;strong&gt;make sure&lt;/strong&gt; you get metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it test-access -n test -- bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and within the pod run the commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apt-get update &amp;amp;&amp;amp; apt-get install -y curl
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sSk -H "Authorization: Bearer $TOKEN" https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/namespaces/test/pods/test-pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Get:1 http://deb.debian.org/debian bookworm InRelease [151 kB]
Get:2 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
Get:3 http://deb.debian.org/debian-security bookworm-security InRelease [48.0 kB]
Get:4 http://deb.debian.org/debian bookworm/main amd64 Packages [8793 kB]
Get:5 http://deb.debian.org/debian bookworm-updates/main amd64 Packages [512 B]
Get:6 http://deb.debian.org/debian-security bookworm-security/main amd64 Packages [261 kB]
Fetched 9309 kB in 3s (3079 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
curl is already the newest version (7.88.1-10+deb12u12).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "test-pod",
    "namespace": "test",
    "creationTimestamp": "2025-05-28T13:01:02Z"
  },
  "timestamp": "2025-05-28T12:59:54Z",
  "window": "1m4.584s",
  "containers": [
    {
      "name": "nginx",
      "usage": {
        "cpu": "0",
        "memory": "3004Ki"
      }
    }
  ]
}root@test-access:/#
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9rvttmi9ypd58rju0wo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9rvttmi9ypd58rju0wo.png" alt="Get metrics using TOKEN variable within the Pod" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Test Pod resizing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Jump&lt;/strong&gt; into the &lt;strong&gt;2nd&lt;/strong&gt; pod again, and &lt;strong&gt;inside&lt;/strong&gt; the pod, &lt;strong&gt;named&lt;/strong&gt; “&lt;em&gt;test-access&lt;/em&gt;” and &lt;strong&gt;run&lt;/strong&gt; these commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
kubectl --token=$TOKEN patch pod test-pod -n test --subresource resize --patch '{"spec": {"containers": [{"name": "nginx", "resouces": {"requests": {"cpu": "400m", "memory": "384Mi"}, "limits": {"cpu": "800m", "memory": "768Mi"}}}]}}'
kubectl --token=$TOKEN get pod test-pod -n test -o jsonpath='{.spec.containers[0].resources}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Like this:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgj3bs2bhvjnyn389tijd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgj3bs2bhvjnyn389tijd.png" alt="Calling API endpoint from the Pod" width="800" height="387"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it test-access -n test -- bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and within the pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
kubectl --token=$TOKEN patch pod test-pod -n test --subresource resize --patch '{"spec": {"containers": [{"name": "nginx", "resouces": {"requests": {"cpu": "400m", "memory": "384Mi"}, "limits": {"cpu": "800m", "memory": "768Mi"}}}]}}'
kubectl --token=$TOKEN get pod test-pod -n test -o jsonpath='{.spec.containers[0].resources}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;code&gt;pod/test-pod patched&lt;br&gt;
{"limits":{"cpu":"800m","memory":"768Mi"},"requests":{"cpu":"400m","memory":"384Mi"}}&lt;br&gt;
I have no name!@test-access:/$&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -sSk -H "Authorization: Bearer $TOKEN" https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/namespaces/test/pods/test-pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "test-pod",
    "namespace": "test",
    "creationTimestamp": "2025-05-28T13:24:42Z"
  },
  "timestamp": "2025-05-28T13:24:01Z",
  "window": "1m20.148s",
  "containers": [
    {
      "name": "nginx",
      "usage": {
        "cpu": "0",
        "memory": "3004Ki"
      }
    }
  ]
}I have no name!@test-access:/$ exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check&lt;/strong&gt; the pods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k get pods -ntest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NAME          READY   STATUS    RESTARTS   AGE&lt;br&gt;
test-access   1/1     Running   0          9m59s&lt;br&gt;
test-pod      1/1     Running   0          91m&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can &lt;strong&gt;realize&lt;/strong&gt; that the affected pod configuration has been &lt;strong&gt;patched&lt;/strong&gt; successfully and it &lt;strong&gt;wasn’t restarted&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check&lt;/strong&gt; it again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k get pods test-pod &lt;span class="nt"&gt;-ntest&lt;/span&gt; &lt;span class="nt"&gt;-oyaml&lt;/span&gt; |grep &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"limits|requess"&lt;/span&gt;  &lt;span class="nt"&gt;-A4&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
  {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-pod","namespace":"test"},"spec":{"containers":[{"mage":"nginx","name":"nginx","resources":{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"200m","memory":"128Mi"}}}]}
  creationTimestamp: "2025-05-28T13:55:30Z"
  generation: 2
  name: test-pod
  namespace: test
--
      limits:
        cpu: 800m
        memory: 768Mi
      requests:
        cpu: 400m
        memory: 384Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
--
      limits:
        cpu: 800m
        memory: 768Mi
      requests:
        cpu: 400m
        memory: 384Mi
    restartCount: 0
    started: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; we move on, we need to “&lt;strong&gt;restore&lt;/strong&gt;” the &lt;strong&gt;original value&lt;/strong&gt;s of affected pod. &lt;strong&gt;Delete&lt;/strong&gt; and &lt;strong&gt;recrate&lt;/strong&gt; it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k delete &lt;span class="nt"&gt;-f&lt;/span&gt; test-pod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pod "test-pod" deleted&lt;/code&gt;&lt;br&gt;
and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k apply -f test-pod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pod/test-pod created&lt;/code&gt;&lt;br&gt;
and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k get pods -ntest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NAME          READY   STATUS    RESTARTS   AGE&lt;br&gt;
test-access   1/1     Running   0          48m&lt;br&gt;
test-pod      1/1     Running   0          8m41s&lt;/code&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Create a monitor.sh script file
&lt;/h3&gt;

&lt;p&gt;We need a Linux &lt;strong&gt;script&lt;/strong&gt; file so let’s go and create it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: &lt;strong&gt;Create&lt;/strong&gt; a &lt;em&gt;monitor.sh&lt;/em&gt; script to check test-pod’s &lt;strong&gt;resource usage&lt;/strong&gt; via &lt;em&gt;metrics.k8s.io&lt;/em&gt; and &lt;strong&gt;resize&lt;/strong&gt; it if usage &lt;strong&gt;exceeds thresholds&lt;/strong&gt; (e.g., CPU &amp;gt; 80% of request). We will use this script via a &lt;strong&gt;CronJob resource,&lt;/strong&gt; with the pod-scaler &lt;em&gt;ServiceAccount&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Name&lt;/strong&gt;: &lt;em&gt;“monitor.sh&lt;/em&gt;”&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Content&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="c"&gt;# Install curl, jq, and kubectl&lt;/span&gt;
apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; curl jq wget
wget &lt;span class="nt"&gt;-q&lt;/span&gt; https://dl.k8s.io/release/v1.33.0/bin/linux/amd64/kubectl
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x kubectl
&lt;span class="nb"&gt;mv &lt;/span&gt;kubectl /usr/local/bin/

&lt;span class="c"&gt;# Configuration&lt;/span&gt;
&lt;span class="nv"&gt;POD_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"test-pod"&lt;/span&gt;
&lt;span class="nv"&gt;NAMESPACE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"test"&lt;/span&gt;
&lt;span class="nv"&gt;CPU_THRESHOLD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;320000000 &lt;span class="c"&gt;# 80% of 400m (in nanocores)&lt;/span&gt;
&lt;span class="nv"&gt;NEW_REQUESTS_CPU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"600m"&lt;/span&gt;
&lt;span class="nv"&gt;NEW_REQUESTS_MEMORY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"512Mi"&lt;/span&gt;
&lt;span class="nv"&gt;NEW_LIMITS_CPU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"1200m"&lt;/span&gt;
&lt;span class="nv"&gt;NEW_LIMITS_MEMORY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"1024Mi"&lt;/span&gt;

&lt;span class="c"&gt;# Get metrics&lt;/span&gt;
&lt;span class="nv"&gt;TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /var/run/secrets/kubernetes.io/serviceaccount/token&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;METRICS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-sSk&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/namespaces/&lt;span class="nv"&gt;$NAMESPACE&lt;/span&gt;/pods/&lt;span class="nv"&gt;$POD_NAME&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Parse CPU usage (in nanocores)&lt;/span&gt;
&lt;span class="nv"&gt;CPU_USAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$METRICS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.containers[0].usage.cpu'&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/n$//'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-z&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CPU_USAGE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Error: Could not retrieve CPU usage"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CPU Usage: &lt;/span&gt;&lt;span class="nv"&gt;$CPU_USAGE&lt;/span&gt;&lt;span class="s2"&gt; nanocores, Threshold: &lt;/span&gt;&lt;span class="nv"&gt;$CPU_THRESHOLD&lt;/span&gt;&lt;span class="s2"&gt; nanocores"&lt;/span&gt;

&lt;span class="c"&gt;# Check if CPU usage exceeds threshold&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CPU_USAGE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CPU_THRESHOLD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CPU usage exceeds threshold, resizing pod..."&lt;/span&gt;
  &lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; patch.json
{
  "spec": {
    "containers": [
      {
        "name": "nginx",
        "resources": {
          "requests": {
            "cpu": "&lt;/span&gt;&lt;span class="nv"&gt;$NEW_REQUESTS_CPU&lt;/span&gt;&lt;span class="sh"&gt;",
            "memory": "&lt;/span&gt;&lt;span class="nv"&gt;$NEW_REQUESTS_MEMORY&lt;/span&gt;&lt;span class="sh"&gt;"
          },
          "limits": {
            "cpu": "&lt;/span&gt;&lt;span class="nv"&gt;$NEW_LIMITS_CPU&lt;/span&gt;&lt;span class="sh"&gt;",
            "memory": "&lt;/span&gt;&lt;span class="nv"&gt;$NEW_LIMITS_MEMORY&lt;/span&gt;&lt;span class="sh"&gt;"
          }
        }
      }
    ]
  }
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;  kubectl &lt;span class="nt"&gt;--token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$TOKEN&lt;/span&gt; patch pod &lt;span class="nv"&gt;$POD_NAME&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nv"&gt;$NAMESPACE&lt;/span&gt; &lt;span class="nt"&gt;--subresource&lt;/span&gt; resize &lt;span class="nt"&gt;--patch-file&lt;/span&gt; patch.json
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$?&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Pod resized successfully"&lt;/span&gt;
  &lt;span class="k"&gt;else
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Error resizing pod"&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
  &lt;span class="k"&gt;fi
else
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CPU usage below threshold, no action needed."&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: focus on the specified &lt;strong&gt;variables&lt;/strong&gt;, &lt;strong&gt;modify&lt;/strong&gt; them if necessary.&lt;/p&gt;

&lt;p&gt;Make it &lt;strong&gt;executable&lt;/strong&gt; with thiscommand:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chmod +x monitor.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;Create&lt;/strong&gt; a &lt;strong&gt;configmap&lt;/strong&gt; and &lt;strong&gt;store&lt;/strong&gt; the &lt;em&gt;monitor.sh&lt;/em&gt; file within it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k create configmap monitor-script --from-file=monitor.s -n test  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;configmap/monitor-script created&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;strong&gt;next&lt;/strong&gt; step, &lt;strong&gt;create&lt;/strong&gt; a &lt;strong&gt;cronjob,&lt;/strong&gt; named "&lt;em&gt;pod-scaler-cronjob.yaml&lt;/em&gt;”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Name&lt;/strong&gt;: "&lt;em&gt;pod-scaler-cronjob.yaml&lt;/em&gt;”&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Content&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: batch/v1
kind: CronJob
metadata:
  name: pod-scaler-cronjob
  namespace: test
spec:
  schedule: "* * * * *" # Every minute
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: pod-scaler
          containers:
          - name: scaler
            image: debian:bookworm-slim
            command: ["/bin/bash", "/scripts/monitor.sh"]
            volumeMounts:
            - name: script
              mountPath: /scripts
          volumes:
          - name: script
            configMap:
              name: monitor-script
          restartPolicy: OnFailure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and apply it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k apply &lt;span class="nt"&gt;-f&lt;/span&gt; pod-scaler-cronjob.yaml 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;cronjob.batch/pod-scaler-cronjob created&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Case: Dynamic Pod Resizing in Action
&lt;/h2&gt;

&lt;p&gt;Testing the in-place vertical pod scaling feature in action was a key goal of my PoC. &lt;strong&gt;I used it to dynamically resize an Nginx pod based on CPU thresholds&lt;/strong&gt;, simulating real-world demand. This experiment showcased its practicality for test environments.&lt;br&gt;&lt;br&gt;
Let’s explore how you can apply this in your own setup!&lt;/p&gt;
&lt;h3&gt;
  
  
  Generate workload
&lt;/h3&gt;

&lt;p&gt;We will &lt;strong&gt;simulate&lt;/strong&gt; workload (for 2 minutes) on the &lt;strong&gt;affected&lt;/strong&gt; pod, named “&lt;em&gt;test-pod&lt;/em&gt;”.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpk90b0cbnf65cq2nvein.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpk90b0cbnf65cq2nvein.png" alt="Stress test on the test Pod" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jump&lt;/strong&gt; into the pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; test-pod &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and &lt;strong&gt;run&lt;/strong&gt; these commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apt-get update &amp;amp;&amp;amp; apt-get install -y stress
stress --cpu 2 --timeout 120
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hit:1 http://deb.debian.org/debian bookworm InRelease
Hit:2 http://deb.debian.org/debian bookworm-updates InRelease
Hit:3 http://deb.debian.org/debian-security bookworm-security InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
stress is already the newest version (1.0.7-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
stress: info: [257] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd
stress: info: [257] successful run completed in 120s
root@test-pod:/# exit
exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Validation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Check&lt;/strong&gt; the pods: not restarted&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NAME                                READY   STATUS    RESTARTS   AGE&lt;br&gt;
pod-scaler-cronjob-29140986-z2h98   1/1     Running   0          12s&lt;br&gt;
test-access                         1/1     Running   0          5h50m&lt;br&gt;
test-pod                            1/1     Running   0          22m&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;and run it again&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods -n test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NAME                                READY   STATUS      RESTARTS   AGE&lt;br&gt;
pod-scaler-cronjob-29140986-z2h98   0/1     Completed   0          14s&lt;br&gt;
test-access                         1/1     Running     0          5h50m&lt;br&gt;
test-pod                            1/1     Running     0          22m&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check&lt;/strong&gt; the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k logs pod-scaler-cronjob-29140986-z2h98  &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;...&lt;br&gt;
CPU Usage: 990630785 nanocores, Threshold: 320000000 nanocores&lt;br&gt;
CPU usage exceeds threshold, resizing pod...&lt;br&gt;
pod/test-pod patched (no change)&lt;br&gt;
Pod resized successfully&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check&lt;/strong&gt; the &lt;strong&gt;limits&lt;/strong&gt; and &lt;strong&gt;requests&lt;/strong&gt; values of affected pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod test-pod &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"limits|requests"&lt;/span&gt; &lt;span class="nt"&gt;-A4&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected &lt;strong&gt;output&lt;/strong&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Limits:
      cpu:     1200m
      memory:  1Gi
    Requests:
      cpu:        600m
      memory:     512Mi
    Environment:  &amp;lt;none&amp;gt;
    Mounts:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Great! This is what we wanted!&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations and Production Considerations
&lt;/h2&gt;

&lt;p&gt;While my PoC was exciting, it highlighted &lt;strong&gt;some limitations&lt;/strong&gt; of in-place scaling in its beta state. It &lt;strong&gt;works well&lt;/strong&gt; for test environments but requires &lt;strong&gt;refinement&lt;/strong&gt; for &lt;strong&gt;production&lt;/strong&gt; use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I plan to share&lt;/strong&gt; these insights to help with advance planning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion and Next Steps
&lt;/h2&gt;

&lt;p&gt;My &lt;strong&gt;PoC&lt;/strong&gt; with Kubernetes 1.33’s &lt;strong&gt;in-place vertical pod scaling&lt;/strong&gt; opened my eyes to its &lt;strong&gt;potential&lt;/strong&gt; and &lt;strong&gt;challenges&lt;/strong&gt;. This guide walked you &lt;strong&gt;through the process&lt;/strong&gt;, from setup to automation, with real-world insights.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;br&gt;
I’m &lt;strong&gt;Róbert Zsótér&lt;/strong&gt;, Kubernetes &amp;amp; AWS architect.&lt;br&gt;
If you’re into &lt;strong&gt;Kubernetes, EKS, Terraform&lt;/strong&gt;, and &lt;strong&gt;cloud-native security&lt;/strong&gt;, follow my latest posts here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/r%C3%B3bert-zs%C3%B3t%C3%A9r-34541464/" rel="noopener noreferrer"&gt;Róbert Zsótér&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Substack: &lt;a href="//cloudskillshu.substack.com"&gt;CSHU&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s build secure, scalable clusters, together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Originally&lt;/strong&gt; published on &lt;strong&gt;Medium&lt;/strong&gt;: &lt;a href="https://medium.com/@zs77.robert/hands-on-with-kubernetes-1-33-my-poc-on-in-place-vertical-scaling-0b731dd3e9e5" rel="noopener noreferrer"&gt;Hands-On with Kubernetes 1.33: My PoC on In-Place Vertical Scaling&lt;/a&gt;&lt;/p&gt;




</description>
      <category>tutorial</category>
      <category>devops</category>
      <category>performance</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Chaos Engineering on AWS: Using Fault Injection Simulator (FIS) for Resilience</title>
      <dc:creator>Robert Zsoter</dc:creator>
      <pubDate>Thu, 02 Oct 2025 05:07:00 +0000</pubDate>
      <link>https://dev.to/robert_r_7c237256b7614328/chaos-engineering-on-aws-using-fault-injection-simulator-fis-for-resilience-hap</link>
      <guid>https://dev.to/robert_r_7c237256b7614328/chaos-engineering-on-aws-using-fault-injection-simulator-fis-for-resilience-hap</guid>
      <description>&lt;p&gt;&lt;strong&gt;Part I&lt;/strong&gt;: Building Resilient Systems on AWS: EC2 service and Auto Scaling Group&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction to Chaos Engineering
&lt;/h2&gt;

&lt;p&gt;Chaos Engineering: Introduction to Resilient Systems&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why do we need&lt;/strong&gt; Chaos Engineering?&lt;br&gt;&lt;br&gt;
Historically, &lt;strong&gt;disaster preparedness focused on catastrophic events&lt;/strong&gt; like earthquakes or power outages, with organizations investing in disaster recovery (&lt;strong&gt;DR&lt;/strong&gt;) plans to r*&lt;em&gt;estore services from backup&lt;/em&gt;* data centers after major disruptions.&lt;br&gt;&lt;br&gt;
While effective for large-scale outages, &lt;strong&gt;this approach fails to address the frequent, smaller-scale failures&lt;/strong&gt; prevalent in modern systems, rendering traditional DR insufficient as infrastructure evolves.&lt;/p&gt;

&lt;p&gt;The shift &lt;strong&gt;from monolithic applications to distributed, microservices-based&lt;/strong&gt;, cloud-native architectures on platforms &lt;strong&gt;like AWS and Kubernetes&lt;/strong&gt; has brought scalability and agility but also increased fragility. &lt;strong&gt;A single misbehaving microservice&lt;/strong&gt; can trigger &lt;strong&gt;cascading failures&lt;/strong&gt;, a &lt;strong&gt;misconfigured network route&lt;/strong&gt; can isolate critical components, or a &lt;strong&gt;faulty deployment&lt;/strong&gt; can exhaust resources like CPU or memory on a single Kubernetes node, &lt;strong&gt;disrupting entire applications*.&lt;/strong&gt;*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Imagine proactively simulating failures&lt;/strong&gt; &lt;strong&gt;in a controlled environment&lt;/strong&gt;: injecting latency into an AWS RDS database, terminating a Kubernetes pod mid-operation, or mimicking an AWS region outage.&lt;br&gt;&lt;br&gt;
By deliberately introducing faults, engineers can &lt;strong&gt;identify weaknesses&lt;/strong&gt;, &lt;strong&gt;redesign&lt;/strong&gt; for fault tolerance, &lt;strong&gt;measure&lt;/strong&gt; blast radius, and &lt;strong&gt;validate monitoring&lt;/strong&gt;, &lt;strong&gt;alerting&lt;/strong&gt;, and &lt;strong&gt;auto-remediation&lt;/strong&gt; mechanisms.&lt;/p&gt;

&lt;p&gt;Chaos Engineering is the practice of intentionally &lt;strong&gt;injecting controlled failures&lt;/strong&gt; to study system behavior and enhance resilience. It involves &lt;strong&gt;safely introducing faults&lt;/strong&gt;, limiting their impact, observing outcomes, and iteratively &lt;strong&gt;applying architectural improvements&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
This is not reckless disruption but a disciplined, &lt;strong&gt;test-driven approach to hardening systems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Downtime carries significant financial and operational costs&lt;/strong&gt; across industries like e-commerce, healthcare, and finance. In cloud-native environments, unpredictability is inherent.&lt;br&gt;&lt;br&gt;
While chaos cannot be eliminated, &lt;strong&gt;Chaos Engineering enables teams to prepare, test, and adapt&lt;/strong&gt; systems to be &lt;strong&gt;self-healing and robust&lt;/strong&gt;, ensuring resilience in the face of inevitable failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  AWS FIS service
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overview about AWS Fault Injection Service (FIS)
&lt;/h3&gt;

&lt;p&gt;Chaos Engineering has evolved into a critical practice for building resilient cloud systems, with various tools emerging to support controlled failure testing. The concept gained prominence with &lt;strong&gt;Netflix’s Chaos Monkey&lt;/strong&gt;, which randomly terminated instances to validate system robustness. Today, the ecosystem includes tools like &lt;strong&gt;Gremlin&lt;/strong&gt;, &lt;strong&gt;Azure Chaos Studio&lt;/strong&gt;, and, within the AWS platform, the &lt;strong&gt;AWS Fault Injection Service (FIS)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this guide, we focus on &lt;strong&gt;AWS FIS&lt;/strong&gt;, &lt;strong&gt;Amazon’s managed solution&lt;/strong&gt; for conducting Chaos Engineering experiments in AWS environments.&lt;br&gt;&lt;br&gt;
FIS enables teams to &lt;strong&gt;simulate real-world failures,&lt;/strong&gt; &lt;strong&gt;identify vulnerabilities&lt;/strong&gt;, and &lt;strong&gt;strengthen system resilience&lt;/strong&gt; before issues impact users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Fault Injection Service&lt;/strong&gt; is a &lt;strong&gt;fully managed platform&lt;/strong&gt; designed to execute controlled fault injection experiments across AWS workloads.&lt;br&gt;&lt;br&gt;
By introducing failures like instance terminations or network disruptions, &lt;strong&gt;FIS helps engineers assess how systems handle unexpected conditions&lt;/strong&gt;, &lt;strong&gt;allowing them to address weaknesses proactively and enhance reliability&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why should we use AWS FIS?
&lt;/h3&gt;

&lt;p&gt;AWS FIS empowers teams to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Simulate failures such as EC2 instance crashes, EKS pod evictions, or RDS failovers.&lt;/li&gt;
&lt;li&gt;  Test complex scenarios, including Availability Zone (AZ) outages or cross-region disruptions.&lt;/li&gt;
&lt;li&gt;  Validate recovery mechanisms like auto-scaling, failover, and monitoring alerts.&lt;/li&gt;
&lt;li&gt;  Build confidence in system resilience, reducing &lt;strong&gt;Mean Time to Recovery (MTTR)&lt;/strong&gt; and ensuring robust performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How FIS Works: Architecture Overview
&lt;/h3&gt;

&lt;p&gt;FIS integrates seamlessly with AWS services to provide comprehensive observability and control:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Amazon CloudWatch&lt;/strong&gt;: Monitors metrics and triggers rollbacks if experiment thresholds are exceeded.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;AWS X-Ray&lt;/strong&gt;: Traces the impact of failures across services for detailed analysis.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;AWS IAM&lt;/strong&gt;: Enforces granular permissions to manage who can run or configure experiments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FIS offers &lt;strong&gt;predefined failure scenarios&lt;/strong&gt; and &lt;strong&gt;experiment templates&lt;/strong&gt; for services like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;EC2&lt;/strong&gt;: Instance termination, CPU or memory stress.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;EKS&lt;/strong&gt;: Pod or node disruptions for Kubernetes workloads.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;RDS&lt;/strong&gt;: Database reboot or failover simulations.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;S3&lt;/strong&gt;: Network latency or access disruptions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Multi-AZ/Region&lt;/strong&gt;: Simulating large-scale outages.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Managing FIS
&lt;/h3&gt;

&lt;p&gt;Experiments (&lt;strong&gt;experiment is a controlled test&lt;/strong&gt; using the FIS to simulate real-world failure) can be configured and executed via:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;AWS Management Console&lt;/strong&gt; for interactive setup.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;AWS CLI&lt;/strong&gt; for scripted workflows.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;AWS CloudFormation&lt;/strong&gt; or &lt;strong&gt;AWS SDKs&lt;/strong&gt; for infrastructure-as-code integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This flexibility allows FIS to fit into both manual testing processes and &lt;strong&gt;automated CI/CD pipelines&lt;/strong&gt;, embedding Chaos Engineering into the Software Development Lifecycle (&lt;strong&gt;SDLC&lt;/strong&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Pre-Requirements: IAM Roles for AWS FIS
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; you can run experiments with AWS Fault Injection Service (FIS), you &lt;strong&gt;must configure two essential IAM roles&lt;/strong&gt;, each serving a distinct purpose in the security model.&lt;/p&gt;

&lt;h3&gt;
  
  
  About the roles
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. User Role: Who Can Control FIS Service&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;This is the&lt;/strong&gt; IAM role or IAM identity** (user/group/role) used to log into the AWS Console or interact with the AWS CLI. This role defines &lt;em&gt;who can view, create, modify, or start FIS experiment&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Because FIS &lt;em&gt;experiments&lt;/em&gt; may impact availability or cause downtime, &lt;strong&gt;it is critical to strictly control who can access these capabilities.&lt;/strong&gt; You should assign FIS-related permissions only to trusted users with &lt;em&gt;experiments&lt;/em&gt; or platform engineering responsibilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. FIS Service Role: What FIS Can Do&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;This is the IAM role&lt;/strong&gt; assumed by AWS FIS itself when running an &lt;em&gt;experiment&lt;/em&gt;. It governs &lt;strong&gt;what actions the FIS engine is allowed to perform on AWS resources&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For example&lt;/strong&gt;, it defines whether FIS can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Terminate an EC2 instance&lt;/li&gt;
&lt;li&gt;  Reboot or failover an RDS database&lt;/li&gt;
&lt;li&gt;  Inject faults into EKS or simulate an Availability Zone (AZ) outage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This role must include the exact &lt;strong&gt;permissions required to perform those actions&lt;/strong&gt;, and must also &lt;strong&gt;trust&lt;/strong&gt; the FIS service to &lt;strong&gt;assume the role&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
This is called the &lt;em&gt;“FIS Service Role&lt;/em&gt;.”&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating the AWS FIS Service Role
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scope&lt;/strong&gt;: IAM Role for FIS experiments&lt;/p&gt;

&lt;p&gt;In this walkthrough, we’ll create the &lt;strong&gt;AWS FIS service role,&lt;/strong&gt; an &lt;strong&gt;IAM role that the Fault Injection Simulator (FIS)&lt;/strong&gt; will &lt;strong&gt;assume&lt;/strong&gt; when performing its &lt;strong&gt;experiments&lt;/strong&gt; (such as EC2 termination).&lt;/p&gt;

&lt;p&gt;This role controls what &lt;strong&gt;actions FIS can take&lt;/strong&gt; on AWS resources during a fault injection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Navigate to IAM in AWS Console&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Go to the &lt;strong&gt;IAM (Identity and Access Management)&lt;/strong&gt; section of the AWS Console.&lt;/li&gt;
&lt;li&gt;  Click on &lt;strong&gt;“Roles”&lt;/strong&gt; from the left menu.&lt;/li&gt;
&lt;li&gt;  Choose &lt;strong&gt;“Create role”&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Set Trusted Entity Type&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Under &lt;strong&gt;Trusted entity type&lt;/strong&gt;, select:
➤ &lt;strong&gt;“AWS service”&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  This is because the role will be assumed by an &lt;strong&gt;AWS-managed service&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  “&lt;strong&gt;Use Case&lt;/strong&gt;”: In the list of services, choose:
➤ &lt;strong&gt;“FIS: Fault Injection Simulator”&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This tells IAM that only AWS FIS will be allowed to assume this role when performing &lt;strong&gt;experiments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Select Use Case for experiment&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;AWS offers predefined permission sets depending on what type of experiment&lt;/strong&gt; you’re planning to run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  For EC2 instance termination: choose the &lt;strong&gt;EC2 use case&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  For network-level disruptions: choose &lt;strong&gt;VPC/network-related permissions&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  For RDS failover testing: choose &lt;strong&gt;RDS&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  For ECS/EKS faults: select the corresponding container service&lt;/li&gt;
&lt;li&gt;  etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this guide, we’ll focus on &lt;strong&gt;terminating EC2 instances&lt;/strong&gt;, so select the &lt;strong&gt;EC2&lt;/strong&gt; use case: “&lt;em&gt;AWSFaultInjectionSimulatorEC2Access&lt;/em&gt;”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Review Permissions and Trust Policy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After &lt;strong&gt;selecting&lt;/strong&gt; the EC2 scenario:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  IAM automatically attaches a predefined &lt;strong&gt;AWS managed policy&lt;/strong&gt; that grants permissions for EC2-related fault injection actions.&lt;/li&gt;
&lt;li&gt;  Review the &lt;strong&gt;Trust policy&lt;/strong&gt;, which should contain:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "Version": "2012-10-17",  
  "Statement": \[  
    {  
      "Effect": "Allow",  
      "Principal": {  
        "Service": "fis.amazonaws.com"  
      },  
      "Action": "sts:AssumeRole"  
    }  
  \]  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This policy allows &lt;strong&gt;FIS to assume the role&lt;/strong&gt; during the experiment execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Name the Role&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Provide a meaningful name like:
&lt;code&gt;FIS-EC2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  Click &lt;strong&gt;“Create role”&lt;/strong&gt; to finalize the creation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Add CloudWatch Logging Permissions&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;To allow FIS to write logs&lt;/strong&gt;, metrics, and diagnostic output, attach an &lt;strong&gt;additional permission&lt;/strong&gt; policy for &lt;strong&gt;CloudWatch&lt;/strong&gt; access:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Go&lt;/strong&gt; to the newly created role&lt;/li&gt;
&lt;li&gt;  Click &lt;strong&gt;“Add permissions” → “Attach policies”&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  Search for and attach:
➤ &lt;code&gt;CloudWatchLogsFullAccess&lt;/code&gt; &lt;em&gt;(or define a scoped custom policy if preferred)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows the FIS experiments to send logging information to &lt;strong&gt;Amazon CloudWatch Logs&lt;/strong&gt;, enabling you to &lt;strong&gt;monitor experiment results and rollbacks&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding the Components&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this section, we will analyze the key components of the &lt;strong&gt;first fault injection experiment&lt;/strong&gt; that we’ll execute using AWS FIS.&lt;/p&gt;

&lt;p&gt;This experiment is designed to simulate &lt;strong&gt;EC2 instance termination&lt;/strong&gt; within an AWS environment managed by an &lt;strong&gt;Auto Scaling Group (ASG)&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The “Given”, Our Known Architecture
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;first component&lt;/strong&gt; of any FIS experiment is the &lt;strong&gt;“given”,&lt;/strong&gt; a &lt;strong&gt;clear understanding of the system’s current behavior and architecture&lt;/strong&gt; under normal (steady-state) conditions.&lt;/p&gt;

&lt;p&gt;In this case, we know the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The application is hosted on &lt;strong&gt;EC2 instances&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  These instances are deployed &lt;strong&gt;across multiple Availability Zones (AZs)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  An &lt;strong&gt;Auto Scaling Group (ASG)&lt;/strong&gt; is configured to manage these instances&lt;/li&gt;
&lt;li&gt;  The ASG is expected to maintain &lt;strong&gt;a defined minimum capacity&lt;/strong&gt; and automatically &lt;strong&gt;replace any unhealthy or terminated instance&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This &lt;strong&gt;architectural context is essential.&lt;/strong&gt; It establishes what we expect to happen &lt;strong&gt;before any fault is introduced&lt;/strong&gt;, and serves as the baseline.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Hypothesis, What We Expect to Happen
&lt;/h3&gt;

&lt;p&gt;The second component is the &lt;strong&gt;hypothesis,&lt;/strong&gt; a prediction of &lt;strong&gt;how the application should behave&lt;/strong&gt; when a specific &lt;strong&gt;failure is happened&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For this experiment, our hypothesis is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If an EC2 instance is terminated, the Auto Scaling Group will detect the loss and automatically provision a new instance. As a result, the application will continue running without any disruption.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This hypothesis is based on the expected behavior of ASGs in AWS, which are designed to maintain the desired capacity at all times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Objective of the Experiment
&lt;/h3&gt;

&lt;p&gt;By executing this &lt;strong&gt;AWS FIS experiment&lt;/strong&gt;, we aim to test whether this hypothesis holds true under real conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We want to observe&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Whether the ASG &lt;strong&gt;replaces the instance fast enough&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  Whether there is &lt;strong&gt;any downtime&lt;/strong&gt; during the replacement&lt;/li&gt;
&lt;li&gt;  How other application components behave during this replacement window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This test will give us a &lt;strong&gt;clear understanding of the actual resilience&lt;/strong&gt; of the EC2 + ASG setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Preparing the AWS Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Affected services&lt;/strong&gt;: EC2 + ASG&lt;/p&gt;

&lt;p&gt;Before we initiate our first &lt;strong&gt;AWS Fault Injection Service (FIS)&lt;/strong&gt; experiment, let’s define and build a &lt;strong&gt;simple and controlled architecture&lt;/strong&gt; in your AWS account to support the experiment scenario.&lt;/p&gt;

&lt;p&gt;To keep things &lt;strong&gt;straightforward and beginner-friendly&lt;/strong&gt;, we will use the &lt;strong&gt;AWS Management Console&lt;/strong&gt; for this setup in this case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Components Overview
&lt;/h3&gt;

&lt;p&gt;We will create &lt;strong&gt;three essential components&lt;/strong&gt; in this setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. EC2 Launch Template&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;We begin by creating an EC2 Launch Template&lt;/strong&gt;, which defines the configuration for EC2 instances launched by the Auto Scaling Group.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This includes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  AMI ID (for example: &lt;em&gt;Amazon Linux 2023 AMI 2023 under the “Free tier”&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;  Instance type (as you want but &lt;em&gt;t3.micro&lt;/em&gt; is fine)&lt;/li&gt;
&lt;li&gt;  Security groups (no changes are needed there)&lt;/li&gt;
&lt;li&gt;  Subnets: any subnet of your VPC&lt;/li&gt;
&lt;li&gt;  Key pair (optional): we don’t need for that in this case&lt;/li&gt;
&lt;li&gt;  User data (optional)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As part of this configuration, we’ll &lt;strong&gt;add&lt;/strong&gt; a &lt;strong&gt;Resource&lt;/strong&gt; &lt;strong&gt;tag&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
➤ &lt;code&gt;Key: fis&lt;/code&gt;&lt;br&gt;&lt;br&gt;
➤ &lt;code&gt;Value: true&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  This tag can later be used in FIS filters to &lt;strong&gt;target specific instances&lt;/strong&gt; for termination.&lt;/li&gt;
&lt;li&gt;  Ensure that “&lt;em&gt;Instances&lt;/em&gt;” has been added under the “&lt;em&gt;Resources types&lt;/em&gt;”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Auto Scaling Group (ASG)&lt;br&gt;&lt;br&gt;
Next&lt;/strong&gt;, we will &lt;strong&gt;create&lt;/strong&gt; an &lt;strong&gt;Auto Scaling Group&lt;/strong&gt; that uses the above launch template: In the next page, click on the “&lt;em&gt;Create an Auto Scaling group from your template&lt;/em&gt;” link and &lt;strong&gt;create the ASG&lt;/strong&gt; (or navigate to &lt;em&gt;EC2/Auto Scaling Groups&lt;/em&gt; on the AWS Console and select the created launch template from the list)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key configuration&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Select your affected &lt;strong&gt;VPC&lt;/strong&gt; (Select the VPC that contains the subnet selected in the previous launch template.)&lt;/li&gt;
&lt;li&gt;  Spread instances across at least &lt;strong&gt;two Availability Zones&lt;/strong&gt; (AZs): (select the same subnet and at least one or more subnet)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Leave&lt;/strong&gt; other options as &lt;strong&gt;default&lt;/strong&gt;, next page: also leave as default, next, and&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Set the values&lt;/strong&gt; like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Desired capacity&lt;/strong&gt; = &lt;code&gt;1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Minimum capacity&lt;/strong&gt; = &lt;code&gt;1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Maximum capacity&lt;/strong&gt; = 4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Next&lt;/strong&gt;, next, and under “&lt;em&gt;Review page&lt;/em&gt;” &lt;strong&gt;click&lt;/strong&gt; on “&lt;strong&gt;&lt;em&gt;Create&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Auto Scaling group&lt;/em&gt;”&lt;/p&gt;

&lt;p&gt;This setup ensures that &lt;strong&gt;when an EC2 instance is terminated&lt;/strong&gt;, the ASG will automatically &lt;strong&gt;launch a new one&lt;/strong&gt; to maintain the desired capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. CloudWatch Log Group for FIS&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Lastly, we will create&lt;/strong&gt; a &lt;strong&gt;CloudWatch Log Group&lt;/strong&gt; named:&lt;br&gt;&lt;br&gt;
➤ &lt;code&gt;test.fs&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Navigate&lt;/strong&gt; to &lt;em&gt;CloudWatch&lt;/em&gt; on the AWS console and &lt;strong&gt;create&lt;/strong&gt; a new log group under the “&lt;em&gt;Log Groups&lt;/em&gt;”
&lt;strong&gt;Set expiration&lt;/strong&gt; for &lt;strong&gt;1 day&lt;/strong&gt;, other options: leave as default&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This log group &lt;strong&gt;will be used by AWS FIS&lt;/strong&gt; to write execution logs, &lt;strong&gt;helping you&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Track fault injection events&lt;/li&gt;
&lt;li&gt;  Monitor system responses&lt;/li&gt;
&lt;li&gt;  Audit the sequence of actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once these three components are in place, we’ll be ready to define and run our &lt;strong&gt;FIS experiment template&lt;/strong&gt; targeting the EC2 instance within this Auto Scaling setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Creating the AWS FIS Experiment Template
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use case&lt;/strong&gt;: “controlled” EC2 Termination in ASG&lt;/p&gt;

&lt;p&gt;As part of &lt;strong&gt;this LAB&lt;/strong&gt; we will now create our &lt;strong&gt;first AWS Fault Injection Service (FIS) experiment template&lt;/strong&gt;. This experiment is designed to test the behavior of an Auto Scaling Group (ASG) when &lt;strong&gt;50% of its EC2 instances are terminated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Navigate to FIS Experiment Templates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Open the &lt;strong&gt;AWS Management Console&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  Go to &lt;strong&gt;AWS Fault Injection Simulator (FIS)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  Click on &lt;strong&gt;“Experiment templates”&lt;/strong&gt; in the left navigation pane&lt;/li&gt;
&lt;li&gt;  Click &lt;strong&gt;“Create an experiment template”&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. General Settings&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Experiment Template Name&lt;/strong&gt;:
➤ Example: &lt;code&gt;fis-ec2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;AWS Account&lt;/strong&gt;: Select your current AWS account&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Description&lt;/strong&gt;:
➤ &lt;code&gt;"Terminate 50% of instances in the Auto Scaling Group"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Define the Action&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;An action&lt;/strong&gt; in AWS FIS defines*&lt;em&gt;what fault is injected during the experiment&lt;/em&gt;&lt;em&gt;. In our case, this will be **terminating EC2 instances&lt;/em&gt;&lt;em&gt;: **click&lt;/em&gt;* on “&lt;em&gt;Add Action&lt;/em&gt;” under the &lt;em&gt;“Actions and Targets&lt;/em&gt;”&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Action Name&lt;/strong&gt;: &lt;code&gt;TerminateEC2Instances&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Action Description&lt;/strong&gt;: &lt;code&gt;"Terminate EC2 instance(s) to test ASG recovery"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Service&lt;/strong&gt;: &lt;code&gt;EC2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Action Type&lt;/strong&gt;: select &lt;strong&gt;EC2&lt;/strong&gt; and choose &lt;code&gt;aws:ec2:terminate-instances&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Start after:&lt;/strong&gt; optional
&lt;strong&gt;Note&lt;/strong&gt;: You may also define action sequencing here (for multi-step experiments), but we will skip this for now.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Target&lt;/strong&gt;: now, leave as default&lt;/li&gt;
&lt;li&gt;  Click on “&lt;strong&gt;Save&lt;/strong&gt;”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Target Definition&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;We now specify&lt;/strong&gt;which EC2 instances** the action should apply to.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Click&lt;/strong&gt; on “&lt;em&gt;Instances-Target-1&lt;/em&gt;” on the created “diagram” and configure which instances will be affected&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Click on the dots&lt;/strong&gt; of the “&lt;em&gt;Instances-Target-1&lt;/em&gt;” and &lt;strong&gt;edit the target&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Target Name&lt;/strong&gt;: &lt;code&gt;fis-ec2-target&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Resource Type&lt;/strong&gt;: &lt;code&gt;aws:ec2:instance&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Target Method&lt;/strong&gt;: select the “&lt;em&gt;Resource tags, filters and parameters&lt;/em&gt;” and configure the tag (as you did it in the launch template): under the “&lt;em&gt;Resource tags&lt;/em&gt;” add the same &lt;strong&gt;key&lt;/strong&gt; and &lt;strong&gt;value&lt;br&gt;
Key&lt;/strong&gt;: &lt;code&gt;fis&lt;/code&gt;&lt;strong&gt;Value&lt;/strong&gt;: &lt;code&gt;true&lt;/code&gt;
➤ This ensures only EC2 instances explicitly marked as “fis-true” are targeted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Apply two filters&lt;/strong&gt;:&lt;br&gt;&lt;br&gt;
We will &lt;strong&gt;configure&lt;/strong&gt; this because we want to &lt;strong&gt;terminate 50% of running instances&lt;/strong&gt; that means we need a &lt;strong&gt;multiple filters&lt;/strong&gt; there.&lt;/p&gt;

&lt;p&gt;Under the “&lt;strong&gt;Resources filters&lt;/strong&gt;”, configure these filters:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Attribute&lt;/strong&gt;: &lt;code&gt;State.Name&lt;/code&gt;&lt;strong&gt;Value&lt;/strong&gt;: &lt;code&gt;running&lt;/code&gt;&lt;br&gt;&lt;br&gt;
➤ This ensures the experiment &lt;strong&gt;only affects&lt;/strong&gt; &lt;strong&gt;running instances&lt;/strong&gt;, skipping stopped, initializing, or terminated ones.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  and &lt;strong&gt;add one more filter:&lt;/strong&gt; &lt;code&gt;PERCENT(50)&lt;/code&gt; → this ensures &lt;strong&gt;only 50%&lt;/strong&gt; of matching instances will be affected from the all instances.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Click&lt;/strong&gt; on “&lt;em&gt;Save&lt;/em&gt;”.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Assign IAM Role&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Select the IAM role&lt;/strong&gt; that AWS FIS will assume when executing the experiment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Example: &lt;code&gt;FIS-EC2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  This &lt;strong&gt;role&lt;/strong&gt; (which has been created earlier) &lt;strong&gt;must have permissions&lt;/strong&gt; to terminate EC2 instances and write to CloudWatch Logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Leave&lt;/strong&gt; other options here on &lt;strong&gt;default&lt;/strong&gt;, Next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Add Stop Conditions&lt;/strong&gt; (Optional but Recommended, especially in &lt;strong&gt;Production/Live environment&lt;/strong&gt;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: &lt;br&gt;
Although we will &lt;strong&gt;not define stop conditions&lt;/strong&gt; in this lab, it’s considered &lt;strong&gt;best practice&lt;/strong&gt; in &lt;strong&gt;production&lt;/strong&gt; environments.&lt;br&gt;&lt;br&gt;
Stop conditions allow you to configure &lt;strong&gt;CloudWatch alarms&lt;/strong&gt; that will immediately &lt;strong&gt;stop the FIS experiment&lt;/strong&gt; if a critical threshold is breached (e.g., CPU drops below a threshold or latency spikes).&lt;/p&gt;

&lt;p&gt;In the “&lt;strong&gt;PROD/LIVE&lt;/strong&gt;” environment you should consider to configure it: “&lt;em&gt;AWS FIS helps you run experiments on your workloads safely. You can set a limit, known as a stop condition, to end the experiment if it reaches the threshold defined by a CloudWatch alarm. If a stop condition is reached during an experiment, you can’t resume the experiment&lt;/em&gt;.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. CloudWatch Logging&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Enable log delivery&lt;/strong&gt; to an existing CloudWatch Log Group: configure** the  created log group:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Log Group Name&lt;/strong&gt;: &lt;code&gt;test-fs&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This allows you to monitor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Action start/end times&lt;/li&gt;
&lt;li&gt;  Success/failure of the experiments&lt;/li&gt;
&lt;li&gt;  Target resource IDs and outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;8. Create the Template&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Scroll to the bottom and click &lt;strong&gt;“Create experiment template”&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  AWS &lt;strong&gt;may show a warning&lt;/strong&gt; that no stop condition was defined, &lt;strong&gt;acknowledge&lt;/strong&gt; this for the demo&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Type&lt;/strong&gt; “&lt;em&gt;create&lt;/em&gt;”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The template is now ready to use.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running Your First AWS FIS Experiment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Affected components&lt;/strong&gt;: EC2 + ASG (EC2 Termination)&lt;/p&gt;

&lt;p&gt;In this section, we will &lt;strong&gt;execute the AWS FIS experiment&lt;/strong&gt; that was created earlier. &lt;strong&gt;The goal is to test&lt;/strong&gt; how the Auto Scaling Group (ASG) handles EC2 instance termination and &lt;strong&gt;validate&lt;/strong&gt; whether our hypothesis holds true.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Review the Target Instances
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; triggering the experiment, it’s important to &lt;strong&gt;confirm which instances will be targeted&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Navigate to the &lt;strong&gt;EC2 Dashboard&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; Apply &lt;strong&gt;two filters&lt;/strong&gt;:
&lt;strong&gt;Instance state&lt;/strong&gt;: &lt;code&gt;running&lt;/code&gt;&lt;strong&gt;Tag filter&lt;/strong&gt;: &lt;code&gt;Key=fis&lt;/code&gt;, &lt;code&gt;Value=true&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In our example, a &lt;strong&gt;single EC2 instance&lt;/strong&gt; meets both conditions. This would be the affected instance if the experiment proceeds.&lt;/p&gt;

&lt;p&gt;Because we configured &lt;strong&gt;50% of the affected instance (&lt;/strong&gt;in the FIS template) we need to &lt;strong&gt;increase&lt;/strong&gt; the number of the instances from one to &lt;strong&gt;two.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adjusting ASG Capacity&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;The current setup&lt;/strong&gt; means there’s only &lt;strong&gt;one EC2 instance&lt;/strong&gt; running. If that instance is terminated, it takes time for a replacement to launch, leading to &lt;strong&gt;potential downtime&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modify&lt;/strong&gt; the affected Auto Scaling Group:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Set &lt;strong&gt;Desired capacity&lt;/strong&gt; = &lt;code&gt;2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  Set &lt;strong&gt;Minimum capacity&lt;/strong&gt; = &lt;code&gt;2&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures that &lt;strong&gt;two EC2 instances&lt;/strong&gt; are always running, and terminating one will not interrupt service.&lt;/p&gt;

&lt;p&gt;Wait for the &lt;strong&gt;second instance&lt;/strong&gt; to launch before continuing. &lt;strong&gt;Ensure that&lt;/strong&gt; both instances are up and running.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Start the Experiment
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; Go to &lt;strong&gt;AWS FIS&lt;/strong&gt; in the Console&lt;/li&gt;
&lt;li&gt; Select &lt;strong&gt;Experiment templates&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Locate&lt;/strong&gt; the experiment template created earlier, and have a look the “&lt;em&gt;Targets&lt;/em&gt;” &lt;strong&gt;tab&lt;/strong&gt;.
You can see an &lt;strong&gt;EC2 instance&lt;/strong&gt;, (as “&lt;em&gt;Resource&lt;/em&gt;”) under the “&lt;em&gt;Preview&lt;/em&gt;” : This will show &lt;strong&gt;which actual resource(s)&lt;/strong&gt; that &lt;strong&gt;is/are targeted&lt;/strong&gt; when you start this experiment.
The “T*arget information*” will be something like: “&lt;strong&gt;&lt;em&gt;arn:aws:ec2:-x:xxxxxxxxxx:instance/i-xxxxxxxxxxxx&lt;/em&gt;&lt;/strong&gt;”
&lt;strong&gt;If not (&lt;/strong&gt;if you can’t see anything) &lt;strong&gt;click&lt;/strong&gt; on the “&lt;em&gt;Generate preview&lt;/em&gt;” at the right and &lt;strong&gt;wait&lt;/strong&gt; a few seconds.&lt;/li&gt;
&lt;li&gt; Click &lt;strong&gt;Start experiment&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Confirm&lt;/strong&gt; the action (since it may cause disruption)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At this point, the experiment transitions to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;State&lt;/strong&gt;: &lt;code&gt;initiating&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Status&lt;/strong&gt;: &lt;code&gt;pending&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Validation
&lt;/h3&gt;

&lt;p&gt;Now, &lt;strong&gt;navigate&lt;/strong&gt; to “&lt;em&gt;EC2 Console&lt;/em&gt;” and check the affected instance(s). The affected instance &lt;strong&gt;should be terminated&lt;/strong&gt; now.&lt;/p&gt;

&lt;p&gt;You &lt;strong&gt;should now see:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;State&lt;/strong&gt;: &lt;code&gt;Running → Shutting-down → Terminated&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  The instance count drops from &lt;strong&gt;2 to 1&lt;/strong&gt; (one is terminated)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  because &lt;strong&gt;one instance remains running&lt;/strong&gt;, the application should continue functioning &lt;strong&gt;without interruption&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  due to configuration values of ASG, the &lt;strong&gt;ASG will start a new instance&lt;/strong&gt; (It will “replace” the terminated instance).&lt;/li&gt;
&lt;li&gt;  under the “&lt;em&gt;EC2/Auto Scaling groups/test-fs&lt;/em&gt;” &lt;strong&gt;navigate&lt;/strong&gt; to “I*nstance management*” and “&lt;em&gt;Activity&lt;/em&gt;” &lt;strong&gt;tab&lt;/strong&gt; and &lt;strong&gt;check&lt;/strong&gt; the information about the instance(s) and you can find the events there&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best Practices and Takeaways
&lt;/h2&gt;

&lt;p&gt;Implementing fault injection with AWS FIS is not just about testing for failure, &lt;strong&gt;it’s about building confidence in recovery&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Based on this covered scenario, here are the &lt;strong&gt;most important lessons and recommendations&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FIS Experiment Planning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Define a Clear Hypothesis&lt;/strong&gt;
Every FIS experiment must start with a clearly defined &lt;em&gt;expected behavior&lt;/em&gt; based on your architecture. Avoid blind testing.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Understand Your Architecture’s Limits&lt;/strong&gt;
Knowing that ASGs maintain desired capacity isn’t enough, understand &lt;strong&gt;boot time, warm-up latency&lt;/strong&gt;, and how it affects service availability.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Start Small&lt;/strong&gt;
Begin with &lt;strong&gt;safe experiments&lt;/strong&gt; that target only a portion of your resources (e.g., 50%) and gradually expand the scope once confidence builds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Targeting and Blast Radius Control&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Use Tags and Filters for Target Isolation&lt;/strong&gt;
Always apply precise filters (like &lt;code&gt;tag=fips:true&lt;/code&gt; and &lt;code&gt;state=running&lt;/code&gt;) to limit the impact of experiments to approved resources.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Leverage Selection Modes&lt;/strong&gt;
Use &lt;code&gt;PERCENT&lt;/code&gt; or &lt;code&gt;COUNT&lt;/code&gt; to control how many instances are affected, this is crucial for minimizing unintended disruptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security and Access&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Separate IAM Roles&lt;/strong&gt;
Use &lt;strong&gt;two&lt;/strong&gt; distinct IAM roles:
- One for users/automation to manage/run experiments
- One for FIS to assume during execution (with minimal required permissions)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Enable Least Privilege&lt;/strong&gt;
The FIS service role should only have the permissions it needs to execute the specific fault scenario (e.g., &lt;code&gt;ec2:TerminateInstances&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability and Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Use CloudWatch Logs and Alarms&lt;/strong&gt;
Send all FIS logs to a dedicated log group for traceability. In production, always define &lt;strong&gt;stop conditions&lt;/strong&gt; tied to CloudWatch alarms.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Preview Before Running&lt;/strong&gt;
Use the &lt;strong&gt;&lt;em&gt;Preview Target&lt;/em&gt; feature&lt;/strong&gt; to verify that FIS has resolved the correct resources, catch misconfigurations before runtime.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Iteration and Learning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Rerun with Adjustments&lt;/strong&gt;
Use each failed experiment as a learning opportunity. Adjust configurations (e.g., ASG size), then rerun and observe improvements.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Document Everything&lt;/strong&gt;
Maintain a knowledge base of all FIS templates, their assumptions, outcomes, and resulting architecture changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This initial FIS experiment with &lt;strong&gt;EC2 instances managed by an Auto Scaling Group&lt;/strong&gt; demonstrates how structured fault injection reveals architectural weaknesses, and helps teams improve system resilience with confidence.&lt;/p&gt;

&lt;p&gt;By applying best practices in targeting, IAM, observability, and post-experiment review, chaos engineering becomes a &lt;strong&gt;controlled, safe, and highly effective practice&lt;/strong&gt; in modern AWS environments.&lt;/p&gt;

&lt;p&gt;This is just the start. &lt;strong&gt;In future posts, I would like to explore&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Testing more complex architectures (like EKS, or RDS, multi-AZ)&lt;/li&gt;
&lt;li&gt;  Using stop conditions and CloudWatch alarms&lt;/li&gt;
&lt;li&gt;  Automating chaos experiments in CI/CD workflows&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;br&gt;
I’m &lt;strong&gt;Róbert Zsótér&lt;/strong&gt;, Kubernetes &amp;amp; AWS architect.&lt;br&gt;
If you’re into &lt;strong&gt;Kubernetes, EKS, Terraform&lt;/strong&gt;, and &lt;strong&gt;cloud-native security&lt;/strong&gt;, follow my latest posts here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/r%C3%B3bert-zs%C3%B3t%C3%A9r-34541464/" rel="noopener noreferrer"&gt;Róbert Zsótér&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Substack: &lt;a href="//cloudskillshu.substack.com"&gt;CSHU&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s build secure, scalable clusters, together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Originally&lt;/strong&gt; published on &lt;strong&gt;Medium&lt;/strong&gt;: &lt;a href="https://medium.com/@zs77.robert/chaos-engineering-eafd3d7af03d" rel="noopener noreferrer"&gt;Chaos Engineering on AWS - Part 1&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>chaosengineering</category>
      <category>fis</category>
      <category>security</category>
    </item>
  </channel>
</rss>
