<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Amar Tinawi</title>
    <description>The latest articles on DEV Community by Amar Tinawi (@amartinawi).</description>
    <link>https://dev.to/amartinawi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3835350%2F9a38aad8-4304-49b0-8ded-8b6c774cabf6.jpg</url>
      <title>DEV Community: Amar Tinawi</title>
      <link>https://dev.to/amartinawi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amartinawi"/>
    <language>en</language>
    <item>
      <title>EKS Diagnoses: The Swiss Knife</title>
      <dc:creator>Amar Tinawi</dc:creator>
      <pubDate>Fri, 20 Mar 2026 22:40:35 +0000</pubDate>
      <link>https://dev.to/aws-builders/eks-diagnoses-the-swiss-knife-1no9</link>
      <guid>https://dev.to/aws-builders/eks-diagnoses-the-swiss-knife-1no9</guid>
      <description>&lt;h2&gt;
  
  
  The 2 AM wake-up call every Kubernetes engineer dreads — and the tool I built to make it less painful
&lt;/h2&gt;

&lt;p&gt;It's 2 AM. PagerDuty fires. Pods are crashing across your EKS cluster. You SSH in, bleary-eyed, and start the ritual:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; Running
kubectl describe pod &amp;lt;that-one-pod&amp;gt;
kubectl get events &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.lastTimestamp
kubectl logs &amp;lt;pod&amp;gt; &lt;span class="nt"&gt;--previous&lt;/span&gt;
kubectl top nodes
kubectl describe node &amp;lt;node&amp;gt;
aws eks describe-cluster ...
aws logs filter-log-events ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Twenty minutes in, you're staring at a wall of YAML. You've checked six things. You still don't know the &lt;em&gt;root cause&lt;/em&gt;. You don't even know if the node pressure caused the evictions, or if the evictions caused the node pressure.&lt;/p&gt;

&lt;p&gt;After living this loop for two years across dozens of EKS clusters, I built a tool to automate the entire diagnostic process. It runs &lt;strong&gt;73 analysis methods in parallel&lt;/strong&gt;, correlates findings across data sources, identifies the root cause with a confidence score, and hands you a single interactive report — in about 60 seconds.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;EKS Comprehensive Debugger&lt;/strong&gt;, and here's why it exists and how it works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnlzbl04div4x7zqyym7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnlzbl04div4x7zqyym7.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem: EKS Troubleshooting Is a Scavenger Hunt
&lt;/h2&gt;

&lt;p&gt;Kubernetes failures are rarely isolated. A single root cause — say, a node running out of memory — cascades into a chain of symptoms: pod evictions, rescheduling failures, service endpoint gaps, and eventually user-facing 5xx errors. By the time you're paged, you're looking at the &lt;em&gt;end&lt;/em&gt; of that chain.&lt;/p&gt;

&lt;p&gt;The diagnostic challenge isn't running &lt;code&gt;kubectl&lt;/code&gt;. It's knowing &lt;strong&gt;which&lt;/strong&gt; of the 50+ things to check, &lt;strong&gt;in what order&lt;/strong&gt;, and then &lt;strong&gt;correlating&lt;/strong&gt; findings across completely different data sources — Kubernetes events, pod status, node conditions, CloudWatch metrics, control plane logs, VPC networking, IAM roles, and AWS service quotas — to find the one thing that started it all.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dqndr4x9z31xue4kpt3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dqndr4x9z31xue4kpt3.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most teams solve this in one of three ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tribal knowledge.&lt;/strong&gt; The senior engineer who's "seen this before" runs their mental playbook. Works great until they're on vacation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Runbooks.&lt;/strong&gt; Documented checklists. Better, but they go stale, they can't correlate across data sources, and nobody reads a 40-step runbook at 2 AM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Observability platforms.&lt;/strong&gt; Datadog, Grafana, New Relic. Excellent for monitoring, but they show you dashboards — they don't &lt;em&gt;diagnose&lt;/em&gt;. You still need to interpret the data and connect the dots yourself.&lt;/p&gt;

&lt;p&gt;None of these answer the question an on-call engineer actually needs answered: &lt;strong&gt;"What broke, why, and what do I do about it?"&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What the Tool Does
&lt;/h2&gt;

&lt;p&gt;The EKS Comprehensive Debugger is a single Python script that connects to your cluster and systematically checks everything. It pulls data from four sources simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes API (via &lt;code&gt;kubectl&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;AWS EKS API&lt;/li&gt;
&lt;li&gt;CloudWatch Logs&lt;/li&gt;
&lt;li&gt;CloudWatch Metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It runs 73 analysis methods in parallel, correlates findings to identify root causes, and generates two output files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;strong&gt;interactive HTML dashboard&lt;/strong&gt; for humans&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;LLM-ready JSON file&lt;/strong&gt; for AI analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One command, one minute, complete picture.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/amartinawi" rel="noopener noreferrer"&gt;
        amartinawi
      &lt;/a&gt; / &lt;a href="https://github.com/amartinawi/EKS_Dubugger" rel="noopener noreferrer"&gt;
        EKS_Dubugger
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Production-grade Python diagnostic tool for Amazon EKS cluster troubleshooting
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;EKS Health Check Dashboard&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href="https://www.python.org/downloads/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/bd7bcdc70784bad7073b66850c51f4fed5dc3b2fc782277551b9013c7d27f043/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e382b2d626c75652e737667" alt="Python 3.8+"&gt;&lt;/a&gt;
&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://aws.amazon.com/eks/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/c4b298e68a1680baa3694e791a271783162df364b877f02d48d0ec1a5b326ca2/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4157532d454b532d6f72616e67652e737667" alt="AWS EKS"&gt;&lt;/a&gt;
&lt;a href="https://github.com/amartinawi/EKS_Dubugger#catalog-coverage" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e80ec49eb622898f05ee5c727595462f0589a67f2698768570a297ccbe27338e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636174616c6f67253230636f7665726167652d3130302532352d677265656e2e737667" alt="Catalog Coverage"&gt;&lt;/a&gt;
&lt;a href="https://github.com/amartinawi/EKS_Dubugger#unit-tests" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/de532a4babc278b10679435c7bd59898c70bee4276af45185c406c7446d67a11/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d32303725323070617373696e672d627269676874677265656e2e737667" alt="Tests"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A production-grade Python diagnostic tool for Amazon EKS cluster troubleshooting. Analyzes pod evictions, node conditions, OOM kills, CloudWatch metrics, control plane logs, and generates interactive HTML reports with LLM-ready JSON for AI analysis.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version:&lt;/strong&gt; 3.8.0 | &lt;strong&gt;Analysis Methods:&lt;/strong&gt; 73 | &lt;strong&gt;Catalog Coverage:&lt;/strong&gt; 100% | &lt;strong&gt;Tests:&lt;/strong&gt; 215&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Comprehensive Issue Detection (73 Analysis Methods)&lt;/h3&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h4 class="heading-element"&gt;Pod &amp;amp; Workload Issues&lt;/h4&gt;

&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CrashLoopBackOff&lt;/strong&gt; - Container crash detection with exit code analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ImagePullBackOff&lt;/strong&gt; - Registry authentication, rate limits, network issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OOMKilled&lt;/strong&gt; - Memory limit exceeded detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pod Evictions&lt;/strong&gt; - Memory, disk, PID pressure analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probe Failures&lt;/strong&gt; - Liveness/readiness probe failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Init Container Failures&lt;/strong&gt; - Init container crash, timeout, dependency issues (v3.6.0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sidecar Health&lt;/strong&gt; - Istio, Envoy, Fluentd sidecar failures (v3.6.0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stuck Terminating&lt;/strong&gt; - Finalizer and volume detach issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment Rollouts&lt;/strong&gt; - ProgressDeadlineExceeded detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jobs/CronJobs&lt;/strong&gt; - BackoffLimitExceeded, missed schedules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StatefulSets&lt;/strong&gt; - PVC issues, ordinal failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDB Violations&lt;/strong&gt; - Pod Disruption Budget blocking drains…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/amartinawi/EKS_Dubugger" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  The 73 Checks: What It Actually Analyzes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6t21bmxo8mw1izipiyv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6t21bmxo8mw1izipiyv.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The tool covers the full EKS stack in categories that map to how failures actually cascade:&lt;/p&gt;

&lt;h3&gt;
  
  
  Pod &amp;amp; Workload Issues
&lt;/h3&gt;

&lt;p&gt;CrashLoopBackOff, OOMKilled, ImagePullBackOff, stuck terminating pods, failed init containers, broken sidecar proxies, deployment rollout failures, StatefulSet issues, PDB violations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node Health
&lt;/h3&gt;

&lt;p&gt;NotReady nodes, disk/memory/PID pressure, resource saturation &lt;em&gt;(a leading indicator at 90% allocation, before kubelet pressure triggers)&lt;/em&gt;, PLEG issues, container runtime health, kubelet version skew, outdated AMIs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Networking
&lt;/h3&gt;

&lt;p&gt;VPC CNI IP exhaustion, CoreDNS failures, DNS ndots:5 amplification, services with no endpoints, missing Ingress backends, ALB health, conntrack table exhaustion, security group misconfigurations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control Plane
&lt;/h3&gt;

&lt;p&gt;API server latency and rate limiting, etcd health, controller manager reconciliation failures, admission webhook timeouts, scheduler issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage
&lt;/h3&gt;

&lt;p&gt;Pending PVCs, EBS CSI attachment failures, EFS mount issues, failed volume snapshots.&lt;/p&gt;

&lt;h3&gt;
  
  
  IAM &amp;amp; Security
&lt;/h3&gt;

&lt;p&gt;RBAC errors, IRSA/Pod Identity credential failures, privileged containers, sensitive host path mounts, PSA violations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autoscaling
&lt;/h3&gt;

&lt;p&gt;Cluster Autoscaler issues, Karpenter provisioning and drift, HPA metrics source health, topology spread constraint violations.&lt;/p&gt;

&lt;p&gt;Each finding is classified as either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Historical Event&lt;/strong&gt; — something that happened during your scan window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current State&lt;/strong&gt; — what the cluster looks like right now&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This distinction separates "what's happening now" from "what happened during the incident window."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Part That Actually Matters: Root Cause Detection
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjaxdxv3h8303ypnjrzj8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjaxdxv3h8303ypnjrzj8.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Listing problems is easy. Any monitoring tool can tell you "5 pods crashed." The hard part is answering &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The debugger's correlation engine connects findings across data sources using a &lt;strong&gt;5-dimensional confidence scoring system&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Temporal&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;Did the cause happen &lt;em&gt;before&lt;/em&gt; the effect?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spatial&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;Same node, namespace, or pod?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mechanism&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;Known causal relationship?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exclusivity&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;Only plausible explanation?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reproducibility&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;Pattern occurred multiple times?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These combine into a composite confidence score mapped to a tier: &lt;strong&gt;high&lt;/strong&gt; (≥75%), &lt;strong&gt;medium&lt;/strong&gt; (≥50%), or &lt;strong&gt;low&lt;/strong&gt; (&amp;lt;50%).&lt;/p&gt;

&lt;p&gt;Here's what a real detection looks like in the JSON output — a cluster upgrade identified as root cause with 92% confidence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"potential_root_causes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"correlation_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cluster_upgrade"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"root_cause"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Cluster version upgrade in progress or recently completed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"confidence_tier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"composite_confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"confidence_5d"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"temporal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"spatial"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"mechanism"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"exclusivity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reproducibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;temporal: 1.0&lt;/code&gt; — the AWS API confirmed the upgrade timestamp preceded all other findings.&lt;br&gt;
&lt;code&gt;mechanism: 1.0&lt;/code&gt; — "cluster upgrade causes transient failures" is a well-established causal relationship.&lt;br&gt;
&lt;code&gt;reproducibility: 0.0&lt;/code&gt; — upgrades are one-time events. The other four dimensions still provide strong evidence.&lt;/p&gt;

&lt;p&gt;This is the reasoning a senior SRE does intuitively. The tool makes it systematic, consistent, and available at 2 AM without waking anyone up.&lt;/p&gt;


&lt;h2&gt;
  
  
  Actionable Output: Not Just What's Wrong — What to Do
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6dytn8zd69b0dn18m5r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6dytn8zd69b0dn18m5r.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every finding includes contextual remediation with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Diagnostic commands to investigate further&lt;/li&gt;
&lt;li&gt;Fix commands to resolve the issue&lt;/li&gt;
&lt;li&gt;Both &lt;strong&gt;pre-populated with actual resource names&lt;/strong&gt; from your cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The commands aren't generic templates — that's &lt;code&gt;sg-0af46ef489f81f6d0&lt;/code&gt;, the actual security group from the cluster. Copy, paste, run.&lt;/p&gt;


&lt;h2&gt;
  
  
  The LLM-Ready JSON: Built for AI Analysis
&lt;/h2&gt;

&lt;p&gt;The HTML report is for humans. The JSON is for AI.&lt;/p&gt;

&lt;p&gt;Every run produces a structured JSON file optimized for feeding into an LLM. The schema includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full analysis context (cluster, region, time range)&lt;/li&gt;
&lt;li&gt;Findings with severity classifications&lt;/li&gt;
&lt;li&gt;5D confidence-scored correlations&lt;/li&gt;
&lt;li&gt;Prioritized recommendations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Practical workflow: paste the JSON into Claude or GPT and ask, &lt;em&gt;"What's the most important thing to fix first and why?"&lt;/em&gt; The confidence tiers and spatial evidence give the model enough context to prioritize correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run the analysis&lt;/span&gt;
python eks_comprehensive_debugger.py &lt;span class="nt"&gt;--profile&lt;/span&gt; prod &lt;span class="nt"&gt;--region&lt;/span&gt; eu-west-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; production &lt;span class="nt"&gt;--days&lt;/span&gt; 1

&lt;span class="c"&gt;# Two files generated:&lt;/span&gt;
&lt;span class="c"&gt;# production-eks-report-20260301-035821.html    ← for humans&lt;/span&gt;
&lt;span class="c"&gt;# production-eks-findings-20260301-035821.json   ← for AI&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How to Run It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and install&lt;/span&gt;
git clone https://github.com/amartinawi/EKS_Dubugger
&lt;span class="nb"&gt;cd &lt;/span&gt;eks-debugger
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Basic usage (auto-detects cluster)&lt;/span&gt;
python eks_comprehensive_debugger.py &lt;span class="nt"&gt;--profile&lt;/span&gt; prod &lt;span class="nt"&gt;--region&lt;/span&gt; eu-west-1

&lt;span class="c"&gt;# Incident investigation: last 2 hours&lt;/span&gt;
python eks_comprehensive_debugger.py &lt;span class="nt"&gt;--profile&lt;/span&gt; prod &lt;span class="nt"&gt;--region&lt;/span&gt; eu-west-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; my-cluster &lt;span class="nt"&gt;--hours&lt;/span&gt; 2

&lt;span class="c"&gt;# Post-mortem: specific time window&lt;/span&gt;
python eks_comprehensive_debugger.py &lt;span class="nt"&gt;--profile&lt;/span&gt; prod &lt;span class="nt"&gt;--region&lt;/span&gt; eu-west-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-date&lt;/span&gt; &lt;span class="s2"&gt;"2026-01-26T08:00:00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-date&lt;/span&gt; &lt;span class="s2"&gt;"2026-01-27T18:00:00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timezone&lt;/span&gt; &lt;span class="s2"&gt;"America/New_York"&lt;/span&gt;

&lt;span class="c"&gt;# Private cluster via SSM tunnel&lt;/span&gt;
python eks_comprehensive_debugger.py &lt;span class="nt"&gt;--profile&lt;/span&gt; prod &lt;span class="nt"&gt;--region&lt;/span&gt; eu-west-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster-name&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--kube-context&lt;/span&gt; my-cluster-ssm-tunnel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; Python 3.8+, &lt;code&gt;kubectl&lt;/code&gt; configured, AWS CLI with credentials. Read-only access to EKS, CloudWatch, EC2 — no write permissions required.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned Building This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Most EKS issues are knowable from existing data.&lt;/strong&gt; The Kubernetes API and CloudWatch already have the information to diagnose 90% of problems. The bottleneck isn't data collection — it's knowing what to look for and how to connect the dots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause detection is about correlation, not classification.&lt;/strong&gt; Classifying a finding as "critical" is easy. Determining that &lt;em&gt;this&lt;/em&gt; node pressure caused &lt;em&gt;those&lt;/em&gt; pod evictions on &lt;em&gt;that&lt;/em&gt; node during &lt;em&gt;this&lt;/em&gt; time window requires reasoning across multiple dimensions. Spatial correlation — matching cause and effect by node/pod/namespace identity — was the single biggest accuracy improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tool is most valuable when nothing is wrong.&lt;/strong&gt; Running it proactively — after an upgrade, after a config change, as part of a weekly health check — catches issues before they page you. Version skew detection, deprecated API scanning, and resource saturation warnings are all leading indicators.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-ready output changes the workflow.&lt;/strong&gt; Structured JSON with confidence-scored root causes means you can ask an AI assistant to explain findings, draft an incident report, or suggest an architecture change — and it has enough structured evidence to do it well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffa8h40vtj7nx24rty1r8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffa8h40vtj7nx24rty1r8.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD integration&lt;/strong&gt; — Run as a post-deployment check; fail the deployment if critical issues are detected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled health reports&lt;/strong&gt; — Weekly automated runs with delta reporting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cluster support&lt;/strong&gt; — Aggregate findings across clusters for fleet-wide visibility.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If you're managing EKS clusters and spending too much time on diagnostics, give it a try. The tool is open source, a single Python file, no infrastructure dependencies — just point it at your cluster and run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/amartinawi/EKS_Dubugger" rel="noopener noreferrer"&gt;https://github.com/amartinawi/EKS_Dubugger&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Issues, PRs, and feature ideas welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>eks</category>
      <category>kubernetes</category>
      <category>k8s</category>
    </item>
  </channel>
</rss>
