A real-world incident where an automated cleanup Lambda deleted our Rancher's EBS volume in our development environment. Here's how we recovered using Velero and prevented it from happening again.
You know that Nigerian saying: “The same broom you use to sweep your house can sweep away your blessings”?
Well, our cost-optimization Lambda function just swept away our entire Rancher installation at midnight. 😅
It was 4:40 AM when I got the alert. Rancher UI showing 502 errors. My first thought? “Ah, network wahala again.” But this time, it was different. Our automation had turned against us.
Let me tell you how we went from “Jesus take the wheel” to “we move!” in just 20 minutes.
🔥 The Incident: When Automation Bites Back
The Setup:
- Rancher running on AWS EKS in
eu-west-1 - A Lambda function scheduled to run at midnight to delete unused EBS volumes (saving costs, you know the drill)
- Everything working fine... until it wasn’t
The Symptoms:
- Rancher UI returning 502 Bad Gateway
- Rancher pod stuck in
ContainerCreatingfor 8+ hours - My coffee getting cold while I investigated ☕️
🔍 Investigation: Finding the Culprit
First things first, let’s see what’s happening with our Rancher pod:
kubectl get pods -n cattle-system -o wide
NAME READY STATUS RESTARTS AGE
rancher-57774d445-9hqcb 0/1 ContainerCreating 0 8h
Eight hours in ContainerCreating? Something is seriously wrong. Time to dig deeper:
kubectl describe pod rancher-57774d445-9hqcb -n cattle-system | grep -A 20 "Events:"
And there it was, the smoking gun:
Warning FailedAttachVolume AttachVolume.Attach failed for volume "pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00"
rpc error: code = Internal desc = Could not attach volume "vol-0c9a7de28533e72a4" to node
api error InvalidVolume.NotFound: The volume 'vol-0c9a7de28533e72a4' does not exist.
The volume doesn’t exist. 😱
At this point, I remembered our Lambda function. The one that was supposed to save us money by cleaning up unused volumes. The same one that runs at midnight...
Let’s check the PVC:
kubectl get pvc -n cattle-system
NAME STATUS VOLUME CAPACITY ACCESS MODES
rancher-pvc Bound pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00 10Gi RWO
The PVC thinks it’s bound to a volume that AWS says doesn’t exist. Classic case of “the automation did exactly what we told it to do.”
🕵️ The Root Cause: Too Smart for Our Own Good
Here’s what our Lambda function looked like:
import boto3
def lambda_handler(event, context):
ec2_client = boto3.client('ec2', region_name='eu-west-1')
# Get all available (unused) volumes
response = ec2_client.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
# Delete them all! What could go wrong? 🤷♂️
for volume in response['Volumes']:
volume_id = volume['VolumeId']
print(f"Deleting {volume_id}...")
ec2_client.delete_volume(VolumeId=volume_id)
The problem? It deleted ALL volumes with status available, including Kubernetes persistent volumes that were temporarily detached during pod restarts or node maintenance.
In Nigerian parlance: We used a sledgehammer to kill a mosquito, and we broke the wall. 😂
🚑 The Recovery: From Panic to Peace in 20 Minutes
Step 1: Check for Backups (Please, Please Have Backups!)
First question in any disaster: Do we have backups?
We were using Velero for cluster backups:
kubectl get backups -n velero
NAME AGE
daily-backup-20260130010007 6h47m
daily-backup-20260129040219 27h
Thank God! We have a backup from 01:00 AM. Let’s verify it includes our cattle-system namespace:
kubectl get backup daily-backup-20260130010007 -n velero \
-o jsonpath='{.spec.includedNamespaces}'
["cattle-system","data-workload","observability"]
Perfect! ✅ Time to bring Rancher back from the dead.
Step 2: Clear the Broken Resources
Before we can restore, we need to clean up the broken PVC and PV.
Scale down Rancher:
kubectl scale deployment rancher -n cattle-system --replicas=0
Now delete the broken resources:
kubectl delete pvc rancher-pvc -n cattle-system
kubectl delete pv pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00 --force --grace-period=0
Tip: If the PV refuses to delete due to finalizers, remove them:
kubectl patch pv pvc-c343abb9-0df7-4c91-b6fa-11da4a13ac00 -p '{"metadata":{"finalizers":null}}'
Step 3: Velero to the Rescue
Time to restore from our backup. Create a Velero Restore:
kubectl create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: rancher-restore-$(date +%Y%m%d-%H%M%S)
namespace: velero
spec:
backupName: daily-backup-20260130010007
includedNamespaces:
- cattle-system
restorePVs: true
existingResourcePolicy: update
EOF
Check the restore status:
kubectl get restore -n velero
Verify the PVC was recreated:
kubectl get pvc -n cattle-system
NAME STATUS VOLUME CAPACITY ACCESS MODES
rancher-pvc Bound pvc-e4849291-fdc3-4a07-a8b9-e14fa93dbbd1 10Gi RWO
New volume, new life! Now scale Rancher back up:
kubectl scale deployment rancher -n cattle-system --replicas=1
Wait about a minute and check:
kubectl get pods -n cattle-system -l app=rancher
NAME READY STATUS RESTARTS AGE
rancher-57774d445-z9wj9 1/1 Running 0 3m7s
🎉 Rancher is back! We move!
🛡️ Prevention: Making Sure This Never Happens Again
Now that we’ve recovered, let’s fix that Lambda function so it doesn’t delete our infrastructure again.
Key insight: Kubernetes (and the EBS CSI driver) tags EBS volumes. We can use those tags to protect K8s-managed volumes.
Here’s the updated Lambda function (with guardrails):
import boto3
import json
from datetime import datetime
def lambda_handler(event, context):
ec2_client = boto3.client('ec2', region_name='eu-west-1')
deleted_volumes = []
skipped_volumes = []
total_size_gb = 0
response = ec2_client.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
volumes = response['Volumes']
print(f"Found {len(volumes)} available volumes")
for volume in volumes:
volume_id = volume['VolumeId']
volume_size = volume['Size']
volume_type = volume['VolumeType']
tags = {tag['Key']: tag['Value'] for tag in volume.get('Tags', [])}
volume_name = tags.get('Name', 'N/A')
# 🛡️ PROTECTION RULE 1: Skip Kubernetes volumes
if any(key.startswith('kubernetes.io/') for key in tags.keys()):
print(f"⊘ Skipping Kubernetes volume: {volume_id} ({volume_name})")
skipped_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Reason': 'Kubernetes volume'})
continue
# 🛡️ PROTECTION RULE 2: Skip volumes tagged as critical
if tags.get('critical') == 'true':
print(f"⊘ Skipping critical volume: {volume_id} ({volume_name})")
skipped_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Reason': 'Critical volume'})
continue
# 🛡️ PROTECTION RULE 3: Skip production volumes
if tags.get('environment') == 'production':
print(f"⊘ Skipping production volume: {volume_id} ({volume_name})")
skipped_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Reason': 'Production environment'})
continue
# Safe to delete
try:
print(f"Deleting: {volume_id} ({volume_name}) - {volume_size}GB")
ec2_client.delete_volume(VolumeId=volume_id)
deleted_volumes.append({'VolumeId': volume_id, 'Name': volume_name, 'Size': volume_size, 'Type': volume_type})
total_size_gb += volume_size
print(f"✓ Deleted: {volume_id}")
except Exception as e:
print(f"✗ Failed: {volume_id} - {str(e)}")
summary = {
'Status': 'Success',
'VolumesDeleted': len(deleted_volumes),
'VolumesSkipped': len(skipped_volumes),
'TotalSizeDeleted_GB': total_size_gb,
'MonthlySavings_USD': round(total_size_gb * 0.11, 2),
'Timestamp': datetime.utcnow().isoformat()
}
print(f"\n✓ Successfully deleted {len(deleted_volumes)} volumes ({total_size_gb}GB)")
print(f"⊘ Skipped {len(skipped_volumes)} protected volumes")
print(f"💰 Monthly savings: ~${round(total_size_gb * 0.11, 2)}")
return {'statusCode': 200, 'body': json.dumps(summary, default=str)}
Let’s confirm our new Rancher volume has Kubernetes tags:
aws ec2 describe-volumes --volume-ids vol-05c365c66b6c1fed0 \
--region eu-west-1 --query 'Volumes[0].Tags'
Example tags you’ll typically see:
kubernetes.io/created-for/pvc/namespace: cattle-system
kubernetes.io/created-for/pvc/name: rancher-pvc
kubernetes.io/created-for/pv/name: pvc-e4849291-fdc3-4a07-a8b9-e14fa93dbbd1
ebs.csi.aws.com/cluster: true
Perfect — our updated Lambda will now skip these volumes.
📊 The Timeline: How Fast Can You Recover?
00:00 — Lambda deletes the volume (while we’re sleeping)
01:00 — Velero backup runs (thank God for automation!)
04:40 — Issue discovered (502 errors everywhere)
04:48 — Velero restore initiated
04:55 — Restore completed, new volume created
04:58 — Rancher fully operational (1/1 Ready)
05:04 — Lambda function patched with protection rules
05:06 — Incident documentation completed
Total downtime: ~5 hours (mostly unnoticed)
Active recovery time: ~20 minutes
Lessons learned: priceless
💡 Key Takeaways (The Real Gist)
1) Backups Are Your Best Friend
Without Velero backups, this would have been a “start updating your CV” moment.
Action: Set up automated backups TODAY. Test restores regularly.
2) Automation Without Guardrails Is Dangerous
Our Lambda did exactly what we told it to do. We forgot to tell it what NOT to do.
Action: Add explicit exclusion rules + dry runs + approvals for destructive actions.
3) Tag Everything Like Your Job Depends On It
Use Kubernetes tags, and add your own:
critical=true
environment=production
managed-by=kubernetes
backup=daily
Action: Enforce tagging via AWS Config / SCPs / policy-as-code.
4) Test Your Disaster Recovery Plan
We got lucky. Don’t rely on luck.
Action: Run quarterly DR drills. Restore to a test environment. Time it.
5) Monitor Everything
We could’ve caught this earlier.
Action: Alert on:
PV attach failures
Pods stuck in ContainerCreating > 5 minutes
Lambda deletion logs
Backup failures
6) Document Your Incidents
This post exists because we documented during recovery.
Action: Create incident templates + runbooks.
🎯 Your Action Plan (Start Today!)
Audit your cleanup scripts: what can they delete accidentally?
Install Velero (or any backup tool) and schedule automated backups.
Test restore procedures (don’t just “have backups”, verify them).
Enforce tagging for critical/prod resources.
Add monitoring for K8s storage + automation jobs.
Document runbooks and incident learnings.
🤔 Discussion: What Would You Do?
Have you had an “automation gone wrong” moment? How did you handle it? What guardrails do you use?
Drop your stories in the comments 👇
📚 Resources
Velero Documentation: https://velero.io/docs/
Kubernetes Persistent Volumes: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
AWS EC2 Tagging: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html
AWS Lambda Best Practices: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
Remember: The best disaster recovery plan is the one you’ve tested. The second best is the one you have. The worst is the one you wish you had.
Stay safe out there, and may your volumes never be accidentally deleted! 🙏
P.S. If this saved you from a similar disaster, share it with your team, prevention is better than cure.
Top comments (0)