DEV Community

Goodluck Ekeoma Adiole
Goodluck Ekeoma Adiole

Posted on

Disaster Recovery for Azure Kubernetes Service (AKS) clusters

Step-by-step guide to implement Disaster Recovery for Azure Kubernetes Service (AKS) clusters with detailed commands and explanations:


Step 1: Prerequisites Setup

  1. Install Required Tools
   # Azure CLI
   curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

   # Kubernetes CLI
   sudo az aks install-cli

   # Velero CLI (Linux/Mac)
   curl -s https://velero.io/install.sh | bash
Enter fullscreen mode Exit fullscreen mode
  1. Login to Azure
   az login
   az account set --subscription "YOUR_SUBSCRIPTION_NAME"
Enter fullscreen mode Exit fullscreen mode

Step 2: Create Backup Resources

  1. Create Storage Account for Backups
   RESOURCE_GROUP="DR-RG"
   LOCATION="eastus"
   STORAGE_ACCOUNT="velerobackups$(date +%s)"  # Unique name
   CONTAINER="aks-backups"

   az group create -n $RESOURCE_GROUP -l $LOCATION
   az storage account create -g $RESOURCE_GROUP -n $STORAGE_ACCOUNT
   az storage container create -n $CONTAINER --account-name $STORAGE_ACCOUNT
Enter fullscreen mode Exit fullscreen mode
  1. Generate Velero Credentials File
   STORAGE_KEY=$(az storage account keys list --account-name $STORAGE_ACCOUNT --query [0].value -o tsv)
   cat << EOF > credentials-velero
   AZURE_STORAGE_ACCOUNT_ACCESS_KEY=$STORAGE_KEY
   AZURE_CLOUD_NAME=AzurePublicCloud
   EOF
Enter fullscreen mode Exit fullscreen mode

Step 3: Install Velero on Primary Cluster

  1. Connect to AKS Cluster
   az aks get-credentials -g $RESOURCE_GROUP -n primary-aks-cluster
Enter fullscreen mode Exit fullscreen mode
  1. Install Velero
   velero install \
     --provider azure \
     --plugins velero/velero-plugin-for-microsoft-azure:v1.6.0 \
     --bucket $CONTAINER \
     --secret-file ./credentials-velero \
     --backup-location-config resourceGroup=$RESOURCE_GROUP,storageAccount=$STORAGE_ACCOUNT \
     --snapshot-location-config apiTimeout=15m
Enter fullscreen mode Exit fullscreen mode
  1. Verify Installation
   velero version
   kubectl get pods -n velero
Enter fullscreen mode Exit fullscreen mode

Step 4: Configure Backups

  1. Schedule Daily Backups
   velero schedule create daily-backup \
     --schedule="@every 24h" \
     --include-namespaces=prod \
     --snapshot-volumes
Enter fullscreen mode Exit fullscreen mode
  1. Manual Backup (Test)
   velero backup create manual-backup-01 \
     --include-namespaces=prod \
     --snapshot-volumes
Enter fullscreen mode Exit fullscreen mode
  1. Check Backups
   velero backup get
Enter fullscreen mode Exit fullscreen mode

Step 5: Set Up DR Cluster

  1. Create DR Cluster in Secondary Region
   DR_LOCATION="westus"
   DR_CLUSTER="dr-aks-cluster"

   az aks create \
     -g $RESOURCE_GROUP \
     -n $DR_CLUSTER \
     -l $DR_LOCATION \
     --node-count 1 \
     --node-vm-size Standard_B2s
Enter fullscreen mode Exit fullscreen mode
  1. Connect to DR Cluster
   az aks get-credentials -g $RESOURCE_GROUP -n $DR_CLUSTER
Enter fullscreen mode Exit fullscreen mode
  1. Install Velero on DR Cluster
   velero install \
     --provider azure \
     --plugins velero/velero-plugin-for-microsoft-azure:v1.6.0 \
     --bucket $CONTAINER \
     --secret-file ./credentials-velero \
     --backup-location-config resourceGroup=$RESOURCE_GROUP,storageAccount=$STORAGE_ACCOUNT \
     --no-restore-on-startup
Enter fullscreen mode Exit fullscreen mode

Step 6: Configure DNS Failover

  1. Create Traffic Manager Profile
   TRAFFIC_MANAGER="aks-dr-profile"
   az network traffic-manager profile create \
     -g $RESOURCE_GROUP \
     -n $TRAFFIC_MANAGER \
     --routing-method Priority \
     --unique-dns-name $TRAFFIC_MANAGER
Enter fullscreen mode Exit fullscreen mode
  1. Add Primary Cluster Endpoint
   PRIMARY_IP=$(kubectl get svc myapp-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}' --context=primary-aks-cluster)

   az network traffic-manager endpoint create \
     -g $RESOURCE_GROUP \
     --profile-name $TRAFFIC_MANAGER \
     -n "primary-endpoint" \
     --type externalEndpoints \
     --target $PRIMARY_IP \
     --priority 1
Enter fullscreen mode Exit fullscreen mode
  1. Add DR Cluster Endpoint
   DR_IP=$(kubectl get svc myapp-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}' --context=dr-aks-cluster)

   az network traffic-manager endpoint create \
     -g $RESOURCE_GROUP \
     --profile-name $TRAFFIC_MANAGER \
     -n "dr-endpoint" \
     --type externalEndpoints \
     --target $DR_IP \
     --priority 2
Enter fullscreen mode Exit fullscreen mode

Step 7: Failover Simulation (Disaster Recovery)

  1. Trigger Failover
   # Disable primary endpoint
   az network traffic-manager endpoint update \
     -g $RESOURCE_GROUP \
     --profile-name $TRAFFIC_MANAGER \
     -n primary-endpoint \
     --type externalEndpoints \
     --endpoint-status Disabled

   # Scale DR cluster
   az aks nodepool scale \
     --cluster-name $DR_CLUSTER \
     -g $RESOURCE_GROUP \
     -n nodepool1 \
     --node-count 3
Enter fullscreen mode Exit fullscreen mode
  1. Restore Backup on DR Cluster
   # Get latest backup name
   BACKUP_NAME=$(velero backup get -o json | jq -r '.items[0].metadata.name')

   # Restore application
   velero restore create \
     --from-backup $BACKUP_NAME \
     --namespace-mappings prod:prod-dr
Enter fullscreen mode Exit fullscreen mode
  1. Verify Deployment
   kubectl get pods -n prod-dr
   kubectl get svc -n prod-dr
Enter fullscreen mode Exit fullscreen mode

Step 8: Failback to Primary

  1. Re-enable Primary Endpoint
   az network traffic-manager endpoint update \
     -g $RESOURCE_GROUP \
     --profile-name $TRAFFIC_MANAGER \
     -n primary-endpoint \
     --type externalEndpoints \
     --endpoint-status Enabled
Enter fullscreen mode Exit fullscreen mode
  1. Scale Down DR Cluster
   az aks nodepool scale \
     --cluster-name $DR_CLUSTER \
     -g $RESOURCE_GROUP \
     -n nodepool1 \
     --node-count 1
Enter fullscreen mode Exit fullscreen mode

Step 9: Test Your DR Plan

  1. Test Backup Restoration
   velero restore create --from-backup daily-backup-20230804
Enter fullscreen mode Exit fullscreen mode
  1. Simulate Node Failure
   # Randomly delete a pod
   kubectl delete pod -n prod --force $(kubectl get pods -n prod -o jsonpath='{.items[0].metadata.name}')
Enter fullscreen mode Exit fullscreen mode
  1. Monitor Traffic Manager
   az network traffic-manager endpoint list \
     -g $RESOURCE_GROUP \
     --profile-name $TRAFFIC_MANAGER \
     -o table
Enter fullscreen mode Exit fullscreen mode

Key Maintenance Tasks

  1. Check Backup Status Daily
   velero backup get
Enter fullscreen mode Exit fullscreen mode
  1. Test DR Quarterly

    • Repeat Steps 7-8 during maintenance windows
  2. Update Backup Configuration

   velero schedule update daily-backup \
     --include-namespaces=prod,staging
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Tips

  1. Failed Backups
   velero backup describe BACKUP_NAME
   velero backup logs BACKUP_NAME
Enter fullscreen mode Exit fullscreen mode
  1. Restoration Issues
   kubectl get resticrepos -n velero
   velero restore describe RESTORE_NAME
Enter fullscreen mode Exit fullscreen mode
  1. Connectivity Problems
   az network traffic-manager profile show \
     -g $RESOURCE_GROUP \
     -n $TRAFFIC_MANAGER \
     --query endpoints
Enter fullscreen mode Exit fullscreen mode

Note: Replace placeholder names (prod, myapp-service, etc.) with your actual resource names. Test all commands in a non-production environment first.

This end-to-end guide provides a production-ready DR solution for AKS. Remember to:

  1. Regularly validate backups
  2. Document procedures for your team
  3. Automate where possible using Azure Pipelines
  4. Monitor cluster health with Azure Monitor

Top comments (0)