Frank Osasere Idugboe for AWS Community Builders

Posted on Jan 30

Automation Gone Wrong: Our Cleanup Lambda Deleted Rancher’s EBS Volume (and How Velero Saved Us)

#kubernetes #aws #devops #incident

A real-world incident where an automated cleanup Lambda deleted our Rancher's EBS volume in our development environment. Here's how we recovered using Velero and prevented it from happening again.

You know that Nigerian saying: “The same broom you use to sweep your house can sweep away your blessings”?

Well, our cost-optimization Lambda function just swept away our entire Rancher installation at midnight. 😅

It was 4:40 AM when I got the alert. Rancher UI showing 502 errors. My first thought? “Ah, network wahala again.” But this time, it was different. Our automation had turned against us.

Let me tell you how we went from “Jesus take the wheel” to “we move!” in just 20 minutes.

🔥 The Incident: When Automation Bites Back

The Setup:

Rancher running on AWS EKS in eu-west-1
A Lambda function scheduled to run at midnight to delete unused EBS volumes (saving costs, you know the drill)
Everything working fine... until it wasn’t

The Symptoms:

Rancher UI returning 502 Bad Gateway
Rancher pod stuck in ContainerCreating for 8+ hours
My coffee getting cold while I investigated ☕️

🔍 Investigation: Finding the Culprit

First things first, let’s see what’s happening with our Rancher pod:

kubectl get pods -n cattle-system -o wide

NAME                      READY   STATUS            RESTARTS   AGE
rancher-57774d445-9hqcb   0/1     ContainerCreating   0          8h

Eight hours in ContainerCreating? Something is seriously wrong. Time to dig deeper:

kubectl describe pod rancher-57774d445-9hqcb -n cattle-system | grep -A 20 "Events:"

And there it was, the smoking gun:

Warning  FailedAttachVolume  AttachVolume.Attach failed for volume "pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00"
rpc error: code = Internal desc = Could not attach volume "vol-0c9a7de28533e72a4" to node
api error InvalidVolume.NotFound: The volume 'vol-0c9a7de28533e72a4' does not exist.

The volume doesn’t exist. 😱

At this point, I remembered our Lambda function. The one that was supposed to save us money by cleaning up unused volumes. The same one that runs at midnight...

Let’s check the PVC:

kubectl get pvc -n cattle-system
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES
rancher-pvc   Bound    pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00   10Gi       RWO

The PVC thinks it’s bound to a volume that AWS says doesn’t exist. Classic case of “the automation did exactly what we told it to do.”

🕵️ The Root Cause: Too Smart for Our Own Good

Here’s what our Lambda function looked like:
import boto3

def lambda_handler(event, context):
ec2_client = boto3.client('ec2', region_name='eu-west-1')

# Get all available (unused) volumes
response = ec2_client.describe_volumes(
    Filters=[{'Name': 'status', 'Values': ['available']}]
)

# Delete them all! What could go wrong? 🤷‍♂️
for volume in response['Volumes']:
    volume_id = volume['VolumeId']
    print(f"Deleting {volume_id}...")
    ec2_client.delete_volume(VolumeId=volume_id)

The problem? It deleted ALL volumes with status available, including Kubernetes persistent volumes that were temporarily detached during pod restarts or node maintenance.

In Nigerian parlance: We used a sledgehammer to kill a mosquito, and we broke the wall. 😂

🚑 The Recovery: From Panic to Peace in 20 Minutes
Step 1: Check for Backups (Please, Please Have Backups!)

First question in any disaster: Do we have backups?
We were using Velero for cluster backups:

kubectl get backups -n velero
NAME                        AGE
daily-backup-20260130010007 6h47m
daily-backup-20260129040219 27h

Thank God! We have a backup from 01:00 AM. Let’s verify it includes our cattle-system namespace:

kubectl get backup daily-backup-20260130010007 -n velero \
  -o jsonpath='{.spec.includedNamespaces}'

["cattle-system","data-workload","observability"]

Perfect! ✅ Time to bring Rancher back from the dead.

Step 2: Clear the Broken Resources

Before we can restore, we need to clean up the broken PVC and PV.

Scale down Rancher:

kubectl scale deployment rancher -n cattle-system --replicas=0

Now delete the broken resources:

kubectl delete pvc rancher-pvc -n cattle-system
kubectl delete pv pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00 --force --grace-period=0

Tip: If the PV refuses to delete due to finalizers, remove them:

kubectl patch pv pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00 -p '{"metadata":{"finalizers":null}}'

Step 3: Velero to the Rescue

Time to restore from our backup. Create a Velero Restore:

kubectl create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: rancher-restore-$(date +%Y%m%d-%H%M%S)
  namespace: velero
spec:
  backupName: daily-backup-20260130010007
  includedNamespaces:
  - cattle-system
  restorePVs: true
  existingResourcePolicy: update
EOF

Check the restore status:

kubectl get restore -n velero

Verify the PVC was recreated:

kubectl get pvc -n cattle-system

NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES
rancher-pvc   Bound    pvc-e4849291-fdc3-4a07-a8b9-e14fa93dbbd1   10Gi       RWO

New volume, new life! Now scale Rancher back up:

kubectl scale deployment rancher -n cattle-system --replicas=1

Wait about a minute and check:

kubectl get pods -n cattle-system -l app=rancher

NAME                      READY   STATUS    RESTARTS   AGE
rancher-57774d445-z9wj9   1/1     Running   0          3m7s

🎉 Rancher is back! We move!

🛡️ Prevention: Making Sure This Never Happens Again

Now that we’ve recovered, let’s fix that Lambda function so it doesn’t delete our infrastructure again.

Key insight: Kubernetes (and the EBS CSI driver) tags EBS volumes. We can use those tags to protect K8s-managed volumes.

Here’s the updated Lambda function (with guardrails):

import boto3
import json
from datetime import datetime

def lambda_handler(event, context):
ec2_client = boto3.client('ec2', region_name='eu-west-1')

deleted_volumes = []
skipped_volumes = []
total_size_gb = 0

response = ec2_client.describe_volumes(
    Filters=[{'Name': 'status', 'Values': ['available']}]
)

volumes = response['Volumes']
print(f"Found {len(volumes)} available volumes")

for volume in volumes:
    volume_id = volume['VolumeId']
    volume_size = volume['Size']
    volume_type = volume['VolumeType']

    tags = {tag['Key']: tag['Value'] for tag in volume.get('Tags', [])}
    volume_name = tags.get('Name', 'N/A')

    # 🛡️ PROTECTION RULE 1: Skip Kubernetes volumes
    if any(key.startswith('kubernetes.io/') for key in tags.keys()):
        print(f"⊘ Skipping Kubernetes volume: {volume_id} ({volume_name})")
        skipped_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Reason': 'Kubernetes volume'})
        continue

    # 🛡️ PROTECTION RULE 2: Skip volumes tagged as critical
    if tags.get('critical') == 'true':
        print(f"⊘ Skipping critical volume: {volume_id} ({volume_name})")
        skipped_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Reason': 'Critical volume'})
        continue

    # 🛡️ PROTECTION RULE 3: Skip production volumes
    if tags.get('environment') == 'production':
        print(f"⊘ Skipping production volume: {volume_id} ({volume_name})")
        skipped_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Reason': 'Production environment'})
        continue

    # Safe to delete
    try:
        print(f"Deleting: {volume_id} ({volume_name}) - {volume_size}GB")
        ec2_client.delete_volume(VolumeId=volume_id)

        deleted_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Size': volume_size, 'Type': volume_type})
        total_size_gb += volume_size
        print(f"✓ Deleted: {volume_id}")
    except Exception as e:
        print(f"✗ Failed: {volume_id} - {str(e)}")

summary = {
    'Status': 'Success',
    'VolumesDeleted': len(deleted_volumes),
    'VolumesSkipped': len(skipped_volumes),
    'TotalSizeDeleted_GB': total_size_gb,
    'MonthlySavings_USD': round(total_size_gb * 0.11, 2),
    'Timestamp': datetime.utcnow().isoformat()
}

print(f"\n✓ Successfully deleted {len(deleted_volumes)} volumes ({total_size_gb}GB)")
print(f"⊘ Skipped {len(skipped_volumes)} protected volumes")
print(f"💰 Monthly savings: ~${round(total_size_gb * 0.11, 2)}")

return {'statusCode': 200, 'body': json.dumps(summary, default=str)}

Let’s confirm our new Rancher volume has Kubernetes tags:

aws ec2 describe-volumes --volume-ids vol-05c365c66b6c1fed0 \
  --region eu-west-1 --query 'Volumes[0].Tags'

Example tags you’ll typically see:

kubernetes.io/created-for/pvc/namespace: cattle-system
kubernetes.io/created-for/pvc/name: rancher-pvc
kubernetes.io/created-for/pv/name: pvc-e4849291-fdc3-4a07-a8b9-e14fa93dbbd1
ebs.csi.aws.com/cluster: true

Perfect — our updated Lambda will now skip these volumes.

📊 The Timeline: How Fast Can You Recover?

00:00 — Lambda deletes the volume (while we’re sleeping)

01:00 — Velero backup runs (thank God for automation!)

04:40 — Issue discovered (502 errors everywhere)

04:48 — Velero restore initiated

04:55 — Restore completed, new volume created

04:58 — Rancher fully operational (1/1 Ready)

05:04 — Lambda function patched with protection rules

05:06 — Incident documentation completed

Total downtime: ~5 hours (mostly unnoticed)
Active recovery time: ~20 minutes
Lessons learned: priceless

💡 Key Takeaways (The Real Gist)
1) Backups Are Your Best Friend

Without Velero backups, this would have been a “start updating your CV” moment.

Action: Set up automated backups TODAY. Test restores regularly.

2) Automation Without Guardrails Is Dangerous

Our Lambda did exactly what we told it to do. We forgot to tell it what NOT to do.

Action: Add explicit exclusion rules + dry runs + approvals for destructive actions.

3) Tag Everything Like Your Job Depends On It

Use Kubernetes tags, and add your own:

critical=true

environment=production

managed-by=kubernetes

backup=daily

Action: Enforce tagging via AWS Config / SCPs / policy-as-code.

4) Test Your Disaster Recovery Plan

We got lucky. Don’t rely on luck.

Action: Run quarterly DR drills. Restore to a test environment. Time it.

5) Monitor Everything

We could’ve caught this earlier.

Action: Alert on:

PV attach failures

Pods stuck in ContainerCreating > 5 minutes

Lambda deletion logs

Backup failures

6) Document Your Incidents

This post exists because we documented during recovery.

Action: Create incident templates + runbooks.

🎯 Your Action Plan (Start Today!)

Audit your cleanup scripts: what can they delete accidentally?

Install Velero (or any backup tool) and schedule automated backups.

Test restore procedures (don’t just “have backups”, verify them).

Enforce tagging for critical/prod resources.

Add monitoring for K8s storage + automation jobs.

Document runbooks and incident learnings.

🤔 Discussion: What Would You Do?

Have you had an “automation gone wrong” moment? How did you handle it? What guardrails do you use?

Drop your stories in the comments 👇

📚 Resources

Velero Documentation: https://velero.io/docs/

Kubernetes Persistent Volumes: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

AWS EC2 Tagging: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html

AWS Lambda Best Practices: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

Remember: The best disaster recovery plan is the one you’ve tested. The second best is the one you have. The worst is the one you wish you had.

Stay safe out there, and may your volumes never be accidentally deleted! 🙏

P.S. If this saved you from a similar disaster, share it with your team, prevention is better than cure.

Top comments (1)

Warren Parad AWS Community Builders • Jan 31

You never want to tag the things "not to delete" because you will always forget something and cause a problem. Instead you want to tag the things "to delete". Automation should always be opt in, not opt out.