DEV Community

EKS Diagnoses: The Swiss Knife

The 2 AM wake-up call every Kubernetes engineer dreads — and the tool I built to make it less painful

It's 2 AM. PagerDuty fires. Pods are crashing across your EKS cluster. You SSH in, bleary-eyed, and start the ritual:

kubectl get pods --all-namespaces | grep -v Running
kubectl describe pod <that-one-pod>
kubectl get events --sort-by=.lastTimestamp
kubectl logs <pod> --previous
kubectl top nodes
kubectl describe node <node>
aws eks describe-cluster ...
aws logs filter-log-events ...
Enter fullscreen mode Exit fullscreen mode

Twenty minutes in, you're staring at a wall of YAML. You've checked six things. You still don't know the root cause. You don't even know if the node pressure caused the evictions, or if the evictions caused the node pressure.

After living this loop for two years across dozens of EKS clusters, I built a tool to automate the entire diagnostic process. It runs 73 analysis methods in parallel, correlates findings across data sources, identifies the root cause with a confidence score, and hands you a single interactive report — in about 60 seconds.

This is the EKS Comprehensive Debugger, and here's why it exists and how it works.


The Problem: EKS Troubleshooting Is a Scavenger Hunt

Kubernetes failures are rarely isolated. A single root cause — say, a node running out of memory — cascades into a chain of symptoms: pod evictions, rescheduling failures, service endpoint gaps, and eventually user-facing 5xx errors. By the time you're paged, you're looking at the end of that chain.

The diagnostic challenge isn't running kubectl. It's knowing which of the 50+ things to check, in what order, and then correlating findings across completely different data sources — Kubernetes events, pod status, node conditions, CloudWatch metrics, control plane logs, VPC networking, IAM roles, and AWS service quotas — to find the one thing that started it all.

Most teams solve this in one of three ways:

1. Tribal knowledge. The senior engineer who's "seen this before" runs their mental playbook. Works great until they're on vacation.

2. Runbooks. Documented checklists. Better, but they go stale, they can't correlate across data sources, and nobody reads a 40-step runbook at 2 AM.

3. Observability platforms. Datadog, Grafana, New Relic. Excellent for monitoring, but they show you dashboards — they don't diagnose. You still need to interpret the data and connect the dots yourself.

None of these answer the question an on-call engineer actually needs answered: "What broke, why, and what do I do about it?"


What the Tool Does

The EKS Comprehensive Debugger is a single Python script that connects to your cluster and systematically checks everything. It pulls data from four sources simultaneously:

  • Kubernetes API (via kubectl)
  • AWS EKS API
  • CloudWatch Logs
  • CloudWatch Metrics

It runs 73 analysis methods in parallel, correlates findings to identify root causes, and generates two output files:

  • An interactive HTML dashboard for humans
  • An LLM-ready JSON file for AI analysis

One command, one minute, complete picture.

GitHub logo amartinawi / EKS_Dubugger

Production-grade Python diagnostic tool for Amazon EKS cluster troubleshooting

EKS Health Check Dashboard

Python 3.8+ License: MIT AWS EKS Catalog Coverage Tests

A production-grade Python diagnostic tool for Amazon EKS cluster troubleshooting. Analyzes pod evictions, node conditions, OOM kills, CloudWatch metrics, control plane logs, and generates interactive HTML reports with LLM-ready JSON for AI analysis.

Version: 3.8.0 | Analysis Methods: 73 | Catalog Coverage: 100% | Tests: 215


Features

Comprehensive Issue Detection (73 Analysis Methods)

Pod & Workload Issues

  • CrashLoopBackOff - Container crash detection with exit code analysis
  • ImagePullBackOff - Registry authentication, rate limits, network issues
  • OOMKilled - Memory limit exceeded detection
  • Pod Evictions - Memory, disk, PID pressure analysis
  • Probe Failures - Liveness/readiness probe failures
  • Init Container Failures - Init container crash, timeout, dependency issues (v3.6.0)
  • Sidecar Health - Istio, Envoy, Fluentd sidecar failures (v3.6.0)
  • Stuck Terminating - Finalizer and volume detach issues
  • Deployment Rollouts - ProgressDeadlineExceeded detection
  • Jobs/CronJobs - BackoffLimitExceeded, missed schedules
  • StatefulSets - PVC issues, ordinal failures
  • PDB Violations - Pod Disruption Budget blocking drains…

The 73 Checks: What It Actually Analyzes

The tool covers the full EKS stack in categories that map to how failures actually cascade:

Pod & Workload Issues

CrashLoopBackOff, OOMKilled, ImagePullBackOff, stuck terminating pods, failed init containers, broken sidecar proxies, deployment rollout failures, StatefulSet issues, PDB violations.

Node Health

NotReady nodes, disk/memory/PID pressure, resource saturation (a leading indicator at 90% allocation, before kubelet pressure triggers), PLEG issues, container runtime health, kubelet version skew, outdated AMIs.

Networking

VPC CNI IP exhaustion, CoreDNS failures, DNS ndots:5 amplification, services with no endpoints, missing Ingress backends, ALB health, conntrack table exhaustion, security group misconfigurations.

Control Plane

API server latency and rate limiting, etcd health, controller manager reconciliation failures, admission webhook timeouts, scheduler issues.

Storage

Pending PVCs, EBS CSI attachment failures, EFS mount issues, failed volume snapshots.

IAM & Security

RBAC errors, IRSA/Pod Identity credential failures, privileged containers, sensitive host path mounts, PSA violations.

Autoscaling

Cluster Autoscaler issues, Karpenter provisioning and drift, HPA metrics source health, topology spread constraint violations.

Each finding is classified as either:

  • Historical Event — something that happened during your scan window
  • Current State — what the cluster looks like right now

This distinction separates "what's happening now" from "what happened during the incident window."


The Part That Actually Matters: Root Cause Detection

Listing problems is easy. Any monitoring tool can tell you "5 pods crashed." The hard part is answering why.

The debugger's correlation engine connects findings across data sources using a 5-dimensional confidence scoring system:

Dimension Weight What It Measures
Temporal 30% Did the cause happen before the effect?
Spatial 20% Same node, namespace, or pod?
Mechanism 25% Known causal relationship?
Exclusivity 15% Only plausible explanation?
Reproducibility 10% Pattern occurred multiple times?

These combine into a composite confidence score mapped to a tier: high (≥75%), medium (≥50%), or low (<50%).

Here's what a real detection looks like in the JSON output — a cluster upgrade identified as root cause with 92% confidence:

{
  "potential_root_causes": [{
    "correlation_type": "cluster_upgrade",
    "root_cause": "Cluster version upgrade in progress or recently completed",
    "confidence_tier": "high",
    "composite_confidence": 0.92,
    "confidence_5d": {
      "temporal": 1.0,
      "spatial": 0.8,
      "mechanism": 1.0,
      "exclusivity": 0.9,
      "reproducibility": 0.0
    }
  }]
}
Enter fullscreen mode Exit fullscreen mode

temporal: 1.0 — the AWS API confirmed the upgrade timestamp preceded all other findings.
mechanism: 1.0 — "cluster upgrade causes transient failures" is a well-established causal relationship.
reproducibility: 0.0 — upgrades are one-time events. The other four dimensions still provide strong evidence.

This is the reasoning a senior SRE does intuitively. The tool makes it systematic, consistent, and available at 2 AM without waking anyone up.


Actionable Output: Not Just What's Wrong — What to Do

Every finding includes contextual remediation with:

  • Diagnostic commands to investigate further
  • Fix commands to resolve the issue
  • Both pre-populated with actual resource names from your cluster

The commands aren't generic templates — that's sg-0af46ef489f81f6d0, the actual security group from the cluster. Copy, paste, run.


The LLM-Ready JSON: Built for AI Analysis

The HTML report is for humans. The JSON is for AI.

Every run produces a structured JSON file optimized for feeding into an LLM. The schema includes:

  • Full analysis context (cluster, region, time range)
  • Findings with severity classifications
  • 5D confidence-scored correlations
  • Prioritized recommendations

Practical workflow: paste the JSON into Claude or GPT and ask, "What's the most important thing to fix first and why?" The confidence tiers and spatial evidence give the model enough context to prioritize correctly.

# Run the analysis
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
  --cluster-name production --days 1

# Two files generated:
# production-eks-report-20260301-035821.html    ← for humans
# production-eks-findings-20260301-035821.json   ← for AI
Enter fullscreen mode Exit fullscreen mode

How to Run It

# Clone and install
git clone https://github.com/amartinawi/EKS_Dubugger
cd eks-debugger
pip install -r requirements.txt

# Basic usage (auto-detects cluster)
python eks_comprehensive_debugger.py --profile prod --region eu-west-1

# Incident investigation: last 2 hours
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
  --cluster-name my-cluster --hours 2

# Post-mortem: specific time window
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
  --cluster-name my-cluster \
  --start-date "2026-01-26T08:00:00" \
  --end-date "2026-01-27T18:00:00" \
  --timezone "America/New_York"

# Private cluster via SSM tunnel
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
  --cluster-name my-cluster \
  --kube-context my-cluster-ssm-tunnel
Enter fullscreen mode Exit fullscreen mode

Prerequisites: Python 3.8+, kubectl configured, AWS CLI with credentials. Read-only access to EKS, CloudWatch, EC2 — no write permissions required.


What I Learned Building This

Most EKS issues are knowable from existing data. The Kubernetes API and CloudWatch already have the information to diagnose 90% of problems. The bottleneck isn't data collection — it's knowing what to look for and how to connect the dots.

Root cause detection is about correlation, not classification. Classifying a finding as "critical" is easy. Determining that this node pressure caused those pod evictions on that node during this time window requires reasoning across multiple dimensions. Spatial correlation — matching cause and effect by node/pod/namespace identity — was the single biggest accuracy improvement.

The tool is most valuable when nothing is wrong. Running it proactively — after an upgrade, after a config change, as part of a weekly health check — catches issues before they page you. Version skew detection, deprecated API scanning, and resource saturation warnings are all leading indicators.

AI-ready output changes the workflow. Structured JSON with confidence-scored root causes means you can ask an AI assistant to explain findings, draft an incident report, or suggest an architecture change — and it has enough structured evidence to do it well.


What's Next

  1. CI/CD integration — Run as a post-deployment check; fail the deployment if critical issues are detected.
  2. Scheduled health reports — Weekly automated runs with delta reporting.
  3. Multi-cluster support — Aggregate findings across clusters for fleet-wide visibility.

If you're managing EKS clusters and spending too much time on diagnostics, give it a try. The tool is open source, a single Python file, no infrastructure dependencies — just point it at your cluster and run.

GitHub: https://github.com/amartinawi/EKS_Dubugger

Issues, PRs, and feature ideas welcome.

Top comments (0)