Step-by-step guide to implement Disaster Recovery for Azure Kubernetes Service (AKS) clusters with detailed commands and explanations:
Step 1: Prerequisites Setup
- Install Required Tools
# Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# Kubernetes CLI
sudo az aks install-cli
# Velero CLI (Linux/Mac)
curl -s https://velero.io/install.sh | bash
- Login to Azure
az login
az account set --subscription "YOUR_SUBSCRIPTION_NAME"
Step 2: Create Backup Resources
- Create Storage Account for Backups
RESOURCE_GROUP="DR-RG"
LOCATION="eastus"
STORAGE_ACCOUNT="velerobackups$(date +%s)" # Unique name
CONTAINER="aks-backups"
az group create -n $RESOURCE_GROUP -l $LOCATION
az storage account create -g $RESOURCE_GROUP -n $STORAGE_ACCOUNT
az storage container create -n $CONTAINER --account-name $STORAGE_ACCOUNT
- Generate Velero Credentials File
STORAGE_KEY=$(az storage account keys list --account-name $STORAGE_ACCOUNT --query [0].value -o tsv)
cat << EOF > credentials-velero
AZURE_STORAGE_ACCOUNT_ACCESS_KEY=$STORAGE_KEY
AZURE_CLOUD_NAME=AzurePublicCloud
EOF
Step 3: Install Velero on Primary Cluster
- Connect to AKS Cluster
az aks get-credentials -g $RESOURCE_GROUP -n primary-aks-cluster
- Install Velero
velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.6.0 \
--bucket $CONTAINER \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=$RESOURCE_GROUP,storageAccount=$STORAGE_ACCOUNT \
--snapshot-location-config apiTimeout=15m
- Verify Installation
velero version
kubectl get pods -n velero
Step 4: Configure Backups
- Schedule Daily Backups
velero schedule create daily-backup \
--schedule="@every 24h" \
--include-namespaces=prod \
--snapshot-volumes
- Manual Backup (Test)
velero backup create manual-backup-01 \
--include-namespaces=prod \
--snapshot-volumes
- Check Backups
velero backup get
Step 5: Set Up DR Cluster
- Create DR Cluster in Secondary Region
DR_LOCATION="westus"
DR_CLUSTER="dr-aks-cluster"
az aks create \
-g $RESOURCE_GROUP \
-n $DR_CLUSTER \
-l $DR_LOCATION \
--node-count 1 \
--node-vm-size Standard_B2s
- Connect to DR Cluster
az aks get-credentials -g $RESOURCE_GROUP -n $DR_CLUSTER
- Install Velero on DR Cluster
velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.6.0 \
--bucket $CONTAINER \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=$RESOURCE_GROUP,storageAccount=$STORAGE_ACCOUNT \
--no-restore-on-startup
Step 6: Configure DNS Failover
- Create Traffic Manager Profile
TRAFFIC_MANAGER="aks-dr-profile"
az network traffic-manager profile create \
-g $RESOURCE_GROUP \
-n $TRAFFIC_MANAGER \
--routing-method Priority \
--unique-dns-name $TRAFFIC_MANAGER
- Add Primary Cluster Endpoint
PRIMARY_IP=$(kubectl get svc myapp-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}' --context=primary-aks-cluster)
az network traffic-manager endpoint create \
-g $RESOURCE_GROUP \
--profile-name $TRAFFIC_MANAGER \
-n "primary-endpoint" \
--type externalEndpoints \
--target $PRIMARY_IP \
--priority 1
- Add DR Cluster Endpoint
DR_IP=$(kubectl get svc myapp-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}' --context=dr-aks-cluster)
az network traffic-manager endpoint create \
-g $RESOURCE_GROUP \
--profile-name $TRAFFIC_MANAGER \
-n "dr-endpoint" \
--type externalEndpoints \
--target $DR_IP \
--priority 2
Step 7: Failover Simulation (Disaster Recovery)
- Trigger Failover
# Disable primary endpoint
az network traffic-manager endpoint update \
-g $RESOURCE_GROUP \
--profile-name $TRAFFIC_MANAGER \
-n primary-endpoint \
--type externalEndpoints \
--endpoint-status Disabled
# Scale DR cluster
az aks nodepool scale \
--cluster-name $DR_CLUSTER \
-g $RESOURCE_GROUP \
-n nodepool1 \
--node-count 3
- Restore Backup on DR Cluster
# Get latest backup name
BACKUP_NAME=$(velero backup get -o json | jq -r '.items[0].metadata.name')
# Restore application
velero restore create \
--from-backup $BACKUP_NAME \
--namespace-mappings prod:prod-dr
- Verify Deployment
kubectl get pods -n prod-dr
kubectl get svc -n prod-dr
Step 8: Failback to Primary
- Re-enable Primary Endpoint
az network traffic-manager endpoint update \
-g $RESOURCE_GROUP \
--profile-name $TRAFFIC_MANAGER \
-n primary-endpoint \
--type externalEndpoints \
--endpoint-status Enabled
- Scale Down DR Cluster
az aks nodepool scale \
--cluster-name $DR_CLUSTER \
-g $RESOURCE_GROUP \
-n nodepool1 \
--node-count 1
Step 9: Test Your DR Plan
- Test Backup Restoration
velero restore create --from-backup daily-backup-20230804
- Simulate Node Failure
# Randomly delete a pod
kubectl delete pod -n prod --force $(kubectl get pods -n prod -o jsonpath='{.items[0].metadata.name}')
- Monitor Traffic Manager
az network traffic-manager endpoint list \
-g $RESOURCE_GROUP \
--profile-name $TRAFFIC_MANAGER \
-o table
Key Maintenance Tasks
- Check Backup Status Daily
velero backup get
-
Test DR Quarterly
- Repeat Steps 7-8 during maintenance windows
Update Backup Configuration
velero schedule update daily-backup \
--include-namespaces=prod,staging
Troubleshooting Tips
- Failed Backups
velero backup describe BACKUP_NAME
velero backup logs BACKUP_NAME
- Restoration Issues
kubectl get resticrepos -n velero
velero restore describe RESTORE_NAME
- Connectivity Problems
az network traffic-manager profile show \
-g $RESOURCE_GROUP \
-n $TRAFFIC_MANAGER \
--query endpoints
Note: Replace placeholder names (
prod
,myapp-service
, etc.) with your actual resource names. Test all commands in a non-production environment first.
This end-to-end guide provides a production-ready DR solution for AKS. Remember to:
- Regularly validate backups
- Document procedures for your team
- Automate where possible using Azure Pipelines
- Monitor cluster health with Azure Monitor
Top comments (0)