The 2 AM wake-up call every Kubernetes engineer dreads — and the tool I built to make it less painful
It's 2 AM. PagerDuty fires. Pods are crashing across your EKS cluster. You SSH in, bleary-eyed, and start the ritual:
kubectl get pods --all-namespaces | grep -v Running
kubectl describe pod <that-one-pod>
kubectl get events --sort-by=.lastTimestamp
kubectl logs <pod> --previous
kubectl top nodes
kubectl describe node <node>
aws eks describe-cluster ...
aws logs filter-log-events ...
Twenty minutes in, you're staring at a wall of YAML. You've checked six things. You still don't know the root cause. You don't even know if the node pressure caused the evictions, or if the evictions caused the node pressure.
After living this loop for two years across dozens of EKS clusters, I built a tool to automate the entire diagnostic process. It runs 73 analysis methods in parallel, correlates findings across data sources, identifies the root cause with a confidence score, and hands you a single interactive report — in about 60 seconds.
This is the EKS Comprehensive Debugger, and here's why it exists and how it works.
The Problem: EKS Troubleshooting Is a Scavenger Hunt
Kubernetes failures are rarely isolated. A single root cause — say, a node running out of memory — cascades into a chain of symptoms: pod evictions, rescheduling failures, service endpoint gaps, and eventually user-facing 5xx errors. By the time you're paged, you're looking at the end of that chain.
The diagnostic challenge isn't running kubectl. It's knowing which of the 50+ things to check, in what order, and then correlating findings across completely different data sources — Kubernetes events, pod status, node conditions, CloudWatch metrics, control plane logs, VPC networking, IAM roles, and AWS service quotas — to find the one thing that started it all.
Most teams solve this in one of three ways:
1. Tribal knowledge. The senior engineer who's "seen this before" runs their mental playbook. Works great until they're on vacation.
2. Runbooks. Documented checklists. Better, but they go stale, they can't correlate across data sources, and nobody reads a 40-step runbook at 2 AM.
3. Observability platforms. Datadog, Grafana, New Relic. Excellent for monitoring, but they show you dashboards — they don't diagnose. You still need to interpret the data and connect the dots yourself.
None of these answer the question an on-call engineer actually needs answered: "What broke, why, and what do I do about it?"
What the Tool Does
The EKS Comprehensive Debugger is a single Python script that connects to your cluster and systematically checks everything. It pulls data from four sources simultaneously:
- Kubernetes API (via
kubectl) - AWS EKS API
- CloudWatch Logs
- CloudWatch Metrics
It runs 73 analysis methods in parallel, correlates findings to identify root causes, and generates two output files:
- An interactive HTML dashboard for humans
- An LLM-ready JSON file for AI analysis
One command, one minute, complete picture.
amartinawi
/
EKS_Dubugger
Production-grade Python diagnostic tool for Amazon EKS cluster troubleshooting
EKS Health Check Dashboard
A production-grade Python diagnostic tool for Amazon EKS cluster troubleshooting. Analyzes pod evictions, node conditions, OOM kills, CloudWatch metrics, control plane logs, and generates interactive HTML reports with LLM-ready JSON for AI analysis.
Version: 3.8.0 | Analysis Methods: 73 | Catalog Coverage: 100% | Tests: 215
Features
Comprehensive Issue Detection (73 Analysis Methods)
Pod & Workload Issues
- CrashLoopBackOff - Container crash detection with exit code analysis
- ImagePullBackOff - Registry authentication, rate limits, network issues
- OOMKilled - Memory limit exceeded detection
- Pod Evictions - Memory, disk, PID pressure analysis
- Probe Failures - Liveness/readiness probe failures
- Init Container Failures - Init container crash, timeout, dependency issues (v3.6.0)
- Sidecar Health - Istio, Envoy, Fluentd sidecar failures (v3.6.0)
- Stuck Terminating - Finalizer and volume detach issues
- Deployment Rollouts - ProgressDeadlineExceeded detection
- Jobs/CronJobs - BackoffLimitExceeded, missed schedules
- StatefulSets - PVC issues, ordinal failures
- PDB Violations - Pod Disruption Budget blocking drains…
The 73 Checks: What It Actually Analyzes
The tool covers the full EKS stack in categories that map to how failures actually cascade:
Pod & Workload Issues
CrashLoopBackOff, OOMKilled, ImagePullBackOff, stuck terminating pods, failed init containers, broken sidecar proxies, deployment rollout failures, StatefulSet issues, PDB violations.
Node Health
NotReady nodes, disk/memory/PID pressure, resource saturation (a leading indicator at 90% allocation, before kubelet pressure triggers), PLEG issues, container runtime health, kubelet version skew, outdated AMIs.
Networking
VPC CNI IP exhaustion, CoreDNS failures, DNS ndots:5 amplification, services with no endpoints, missing Ingress backends, ALB health, conntrack table exhaustion, security group misconfigurations.
Control Plane
API server latency and rate limiting, etcd health, controller manager reconciliation failures, admission webhook timeouts, scheduler issues.
Storage
Pending PVCs, EBS CSI attachment failures, EFS mount issues, failed volume snapshots.
IAM & Security
RBAC errors, IRSA/Pod Identity credential failures, privileged containers, sensitive host path mounts, PSA violations.
Autoscaling
Cluster Autoscaler issues, Karpenter provisioning and drift, HPA metrics source health, topology spread constraint violations.
Each finding is classified as either:
- Historical Event — something that happened during your scan window
- Current State — what the cluster looks like right now
This distinction separates "what's happening now" from "what happened during the incident window."
The Part That Actually Matters: Root Cause Detection
Listing problems is easy. Any monitoring tool can tell you "5 pods crashed." The hard part is answering why.
The debugger's correlation engine connects findings across data sources using a 5-dimensional confidence scoring system:
| Dimension | Weight | What It Measures |
|---|---|---|
| Temporal | 30% | Did the cause happen before the effect? |
| Spatial | 20% | Same node, namespace, or pod? |
| Mechanism | 25% | Known causal relationship? |
| Exclusivity | 15% | Only plausible explanation? |
| Reproducibility | 10% | Pattern occurred multiple times? |
These combine into a composite confidence score mapped to a tier: high (≥75%), medium (≥50%), or low (<50%).
Here's what a real detection looks like in the JSON output — a cluster upgrade identified as root cause with 92% confidence:
{
"potential_root_causes": [{
"correlation_type": "cluster_upgrade",
"root_cause": "Cluster version upgrade in progress or recently completed",
"confidence_tier": "high",
"composite_confidence": 0.92,
"confidence_5d": {
"temporal": 1.0,
"spatial": 0.8,
"mechanism": 1.0,
"exclusivity": 0.9,
"reproducibility": 0.0
}
}]
}
temporal: 1.0 — the AWS API confirmed the upgrade timestamp preceded all other findings.
mechanism: 1.0 — "cluster upgrade causes transient failures" is a well-established causal relationship.
reproducibility: 0.0 — upgrades are one-time events. The other four dimensions still provide strong evidence.
This is the reasoning a senior SRE does intuitively. The tool makes it systematic, consistent, and available at 2 AM without waking anyone up.
Actionable Output: Not Just What's Wrong — What to Do
Every finding includes contextual remediation with:
- Diagnostic commands to investigate further
- Fix commands to resolve the issue
- Both pre-populated with actual resource names from your cluster
The commands aren't generic templates — that's sg-0af46ef489f81f6d0, the actual security group from the cluster. Copy, paste, run.
The LLM-Ready JSON: Built for AI Analysis
The HTML report is for humans. The JSON is for AI.
Every run produces a structured JSON file optimized for feeding into an LLM. The schema includes:
- Full analysis context (cluster, region, time range)
- Findings with severity classifications
- 5D confidence-scored correlations
- Prioritized recommendations
Practical workflow: paste the JSON into Claude or GPT and ask, "What's the most important thing to fix first and why?" The confidence tiers and spatial evidence give the model enough context to prioritize correctly.
# Run the analysis
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
--cluster-name production --days 1
# Two files generated:
# production-eks-report-20260301-035821.html ← for humans
# production-eks-findings-20260301-035821.json ← for AI
How to Run It
# Clone and install
git clone https://github.com/amartinawi/EKS_Dubugger
cd eks-debugger
pip install -r requirements.txt
# Basic usage (auto-detects cluster)
python eks_comprehensive_debugger.py --profile prod --region eu-west-1
# Incident investigation: last 2 hours
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
--cluster-name my-cluster --hours 2
# Post-mortem: specific time window
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
--cluster-name my-cluster \
--start-date "2026-01-26T08:00:00" \
--end-date "2026-01-27T18:00:00" \
--timezone "America/New_York"
# Private cluster via SSM tunnel
python eks_comprehensive_debugger.py --profile prod --region eu-west-1 \
--cluster-name my-cluster \
--kube-context my-cluster-ssm-tunnel
Prerequisites: Python 3.8+, kubectl configured, AWS CLI with credentials. Read-only access to EKS, CloudWatch, EC2 — no write permissions required.
What I Learned Building This
Most EKS issues are knowable from existing data. The Kubernetes API and CloudWatch already have the information to diagnose 90% of problems. The bottleneck isn't data collection — it's knowing what to look for and how to connect the dots.
Root cause detection is about correlation, not classification. Classifying a finding as "critical" is easy. Determining that this node pressure caused those pod evictions on that node during this time window requires reasoning across multiple dimensions. Spatial correlation — matching cause and effect by node/pod/namespace identity — was the single biggest accuracy improvement.
The tool is most valuable when nothing is wrong. Running it proactively — after an upgrade, after a config change, as part of a weekly health check — catches issues before they page you. Version skew detection, deprecated API scanning, and resource saturation warnings are all leading indicators.
AI-ready output changes the workflow. Structured JSON with confidence-scored root causes means you can ask an AI assistant to explain findings, draft an incident report, or suggest an architecture change — and it has enough structured evidence to do it well.
What's Next
- CI/CD integration — Run as a post-deployment check; fail the deployment if critical issues are detected.
- Scheduled health reports — Weekly automated runs with delta reporting.
- Multi-cluster support — Aggregate findings across clusters for fleet-wide visibility.
If you're managing EKS clusters and spending too much time on diagnostics, give it a try. The tool is open source, a single Python file, no infrastructure dependencies — just point it at your cluster and run.
GitHub: https://github.com/amartinawi/EKS_Dubugger
Issues, PRs, and feature ideas welcome.






Top comments (0)