The DevOps vs. Forensics Mindset: Tracing Unauthorized kubectl Access on EKS

#aws #kubernetes #security #devops

Most days, I live in the world of high availability, pipelines, and speed. But lately, I’ve been wearing two caps: one as a Senior DevOps Engineer keeping production alive, and the other as a Master’s student in Digital Forensic Science at UNIFESSPA (Marabá, Brazil).

In DevOps, we are trained to restore service at all costs. If a server acts up, we kill it and replace it. But the Forensics side of my brain has started to whisper:

"Wait. If you delete that server now, you delete the evidence forever."

This is the story of a real incident on an AWS EKS cluster where those two worlds collided.

The Alert: "Someone is in the house"

Date: 2026-01-20 | Region: us-east-1

It started with a CloudWatch alert. The system detected unusual commands in our payment-processing namespace.

For the non-experts: In Kubernetes, a command called kubectl exec allows you to log directly into a running container. In a production environment, this is rarely necessary. It is like a bank teller bypassing the counter and walking directly into the vault.

Even worse, the intruder was listing Kubernetes Secrets, essentially reading the passwords for our databases.

My DevOps instinct kicked in: "Stop the bleeding. Kill the pods. Rotate the keys."

But my Forensic training stopped me. If I nuked the environment, I would never find out how they got in. The attacker would just come back tomorrow. I had to investigate while the trail was still warm.

Step 1: Freezing the Crime Scene

If a burglar breaks into your house, you don't start cleaning up the broken glass immediately. You take photos.

Before I changed a single firewall rule or blocked a user, I captured the cluster's current reality. I needed to see exactly what was running before the attacker (or our auto-scaling tools) changed it.

# I saved the timeline of events to a file
kubectl get events -A --sort-by=.lastTimestamp > incident_timeline.txt

# I took a snapshot of every running pod
kubectl get pods -o wide -A > pod_snapshot.txt

# I checked if they gave themselves permanent admin rights (Backdoors)
kubectl get clusterrolebindings -o yaml > crb_snapshot.yaml

Killing a pod is easy. Explaining to your CTO how the attacker got there, after you deleted the evidence, is impossible.

Step 2: Dusting for Fingerprints (The Logs)

Kubernetes activity is just API traffic. Every time you type a command, the server writes it down. Since we had EKS Control Plane Logging enabled, every command left a fingerprint in AWS CloudWatch.

I didn't need a fancy security tool. I just needed to ask the logs the right question. I used CloudWatch Logs Insights to look for the "connect" command (which is how K8s logs an exec attempt).

SQL
fields @timestamp, @message
| parse @message '"username":"*"' as user
| parse @message '"verb":"*"' as action
| parse @message '"objectRef":{"resource":"*"' as res
| filter action in ["connect", "create", "patch", "delete"] 
| filter res in ["pods", "secrets", "clusterrolebindings"]
| filter user not like "system:node"
| sort @timestamp desc

The results were clear. An IAM role was being used to open these connections: arn:aws:sts::123456789012:assumed-role/ReadOnly-Developer/Session-XYZ

Step 3: Finding the Person Behind the Role

This is where it gets tricky. Kubernetes knew which key was used (the IAM Role), but it didn't know who was holding the key.

To find the human, I had to switch to AWS CloudTrail. This is the master log of everything happening in your AWS account. I needed to find the exact moment someone "put on the mask" of that Developer Role.

SQL
SELECT eventTime, sourceIPAddress, userIdentity.arn, requestParameters.roleArn
FROM cloudtrail_logs
WHERE eventName = 'AssumeRole'
  AND requestParameters.roleSessionName = 'Session-XYZ'

The Twist: The source IP address didn't come from a hacker in a hoodie halfway across the world. It wasn't one of our VPN addresses either.

It turned out that a developer’s local laptop credentials had been compromised. The attacker wasn't breaking in through a window; they were walking through the front door using a stolen keycard.

Step 4: The Fix (Moving to Modern Security)

In the past, we managed access using a file called aws-auth. It’s messy, hard to read, and easy to break.

To fix this properly, I moved us to EKS Access Entries. Think of this like upgrading from a physical guestbook to a digital badge system. It allows us to grant (and revoke) access directly through the AWS API.

Here is the Terraform code I used to lock it down:

// 1. We create a "Security Boundary" (Access Entry)
resource "aws_eks_access_entry" "security_boundary" {
  cluster_name  = "prod-cluster"
  principal_arn = "arn:aws:iam::123456789012:role/ReadOnly-Developer"
  type          = "STANDARD"
}

// 2. We attach a strictly limited policy (View Only)
resource "aws_eks_access_policy_association" "view_only" {
  cluster_name  = "prod-cluster"
  policy_arn    = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy"
  principal_arn = aws_eks_access_entry.security_boundary.principal_arn
  access_scope {
    type = "cluster"
  }
}

Now, if this role gets compromised again, I can revoke it with a single API call, and that action is instantly logged.

Step 5: The Alarm System

I never want to manually hunt for this again. I set up a "tripwire." If anyone tries to use kubectl exec in production, effectively opening a shell into a container, I get a Slack notification immediately.

We used a CloudWatch Metric Filter for this:

resource "aws_cloudwatch_log_metric_filter" "exec_tripwire" {
  name           = "EKSExecDetection"
  // This pattern looks for the 'connect' verb on the 'exec' subresource
  pattern        = "{ ($.verb = \"connect\") && ($.objectRef.subresource = \"exec\") }"
  log_group_name = "/aws/eks/prod-cluster/cluster"

  metric_transformation {
    name      = "ExecAttempt"
    namespace = "ClusterSecurity"
    value     = "1"
  }
}

The Takeaway
Wearing two caps—DevOps and Forensics is a balancing act.

DevOps is about building fast and keeping the lights on.

Forensics is about proving what happened and ensuring it doesn't happen again.

If you only wear the DevOps cap, you might "fix" the problem but leave the door wide open. If you can’t answer Who did it, Where they came from, and What they changed, you haven't really finished the job.

Next time you see something weird in your cluster, don't just kubectl delete. Stop. Capture. Then fix.