DEV Community

Cover image for Part 10: Resilience — Karpenter, HPA, Argo Rollouts, and Velero
Matthew
Matthew

Posted on

Part 10: Resilience — Karpenter, HPA, Argo Rollouts, and Velero

Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS


Introduction

Resilience engineering is about building systems that degrade gracefully, recover automatically, and deploy safely. This final part brings together four capabilities:

  • Karpenter — automatically provisions the right EC2 instances for pending pods, and removes them when no longer needed
  • HPA — scales pod replicas based on CPU/memory pressure
  • Argo Rollouts — deploys new versions with controlled canary traffic and automatic rollback on errors
  • Velero — backs up Kubernetes resources and PVC data to S3 for disaster recovery
┌─────────────────────────────────────────────────────────────────────┐
│  RESILIENCE LAYERS                                                  │
│                                                                     │
│  Traffic Spike                                                      │
│  └─► HPA: scale pods from 3 → 8                                     │
│       └─► Karpenter: provision new nodes to fit pending pods        │
│                                                                     │
│  New Deployment                                                     │
│  └─► Argo Rollouts: 20% canary → analysis → promote or rollback     │
│                                                                     │
│  Cluster Disaster                                                   │
│  └─► Velero: restore from S3 backup to new cluster                  │
└─────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Karpenter — Next-Generation Node Autoscaler

Why Karpenter over Cluster Autoscaler?

Feature Cluster Autoscaler Karpenter
Node selection Pre-defined ASG instance types Picks cheapest EC2 that fits pod requests
Spot support Limited Native, with fallback ordering
Speed 3–5 minutes 30–90 seconds
Consolidation Basic Active: moves pods to fewer nodes, terminates unused
Configuration Per-ASG scaling groups Declarative NodePool CRD

Karpenter provisions EC2 instances directly via the EC2 Fleet API — no Auto Scaling Group required for scaling decisions. It launches the exact instance type that fits your pending pods' resource requests, which means you pay only for what you actually need.

Installation (Production Only)

Karpenter is deployed only on production clusters where cost optimization and fast scaling matter. Dev/staging use fixed 2-node groups.

# infrastructure/karpenter/applicationset.yaml
source:
  repoURL:        https://charts.karpenter.sh
  chart:          karpenter
  targetRevision: "0.37.0"
  helm:
    values: |
      serviceAccount:
        annotations:
          eks.amazonaws.com/role-arn: "{{karpenterRoleArn}}"
      settings:
        clusterName: "{{cluster}}"
        clusterEndpoint: "{{clusterEndpoint}}"
        interruptionQueue: "{{cluster}}-interruption"
      controller:
        resources:
          requests:
            cpu: 1
            memory: 1Gi
Enter fullscreen mode Exit fullscreen mode

Karpenter IAM

# _modules/karpenter/main.tf

resource "aws_iam_role_policy" "karpenter" {
  name = "karpenter-policy"
  role = aws_iam_role.karpenter.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "EC2Management"
        Effect = "Allow"
        Action = [
          "ec2:RunInstances",
          "ec2:TerminateInstances",
          "ec2:DescribeInstances",
          "ec2:DescribeInstanceTypes",
          "ec2:DescribeSubnets",
          "ec2:DescribeSecurityGroups",
          "ec2:DescribeLaunchTemplates",
          "ec2:CreateLaunchTemplate",
          "ec2:CreateFleet",
          "ec2:CreateTags",
          "ec2:DescribeSpotPriceHistory"
        ]
        Resource = "*"
      },
      {
        Sid    = "EKSDescribe"
        Effect = "Allow"
        Action = [
          # REQUIRED: Karpenter needs this to discover the cluster endpoint and CA
          # Without it Karpenter cannot configure new nodes to join the cluster
          "eks:DescribeCluster"
        ]
        Resource = "arn:aws:eks:*:${var.account_id}:cluster/${var.cluster_name}"
      },
      {
        Sid    = "IAMPassRole"
        Effect = "Allow"
        Action = ["iam:PassRole"]
        Resource = var.node_role_arn
      },
      {
        Sid    = "SQSInterruption"
        Effect = "Allow"
        Action = [
          "sqs:DeleteMessage",
          "sqs:GetQueueUrl",
          "sqs:ReceiveMessage"
        ]
        Resource = aws_sqs_queue.interruption.arn
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

Lesson learned: eks:DescribeCluster is not optional. Without it, Karpenter cannot discover the cluster endpoint and certificate authority, so newly provisioned nodes cannot join the cluster. Pods remain Pending indefinitely. Always include this permission.

NodePool CRD

# infrastructure/karpenter/nodepools/templates/nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general
spec:
  template:
    metadata:
      labels:
        provisioner: karpenter
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: general

      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot, on-demand]   # Prefer spot, fall back to on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: [c, m, r]           # Compute, memory, balanced families
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: [nano, micro, small] # Minimum medium instances

  limits:
    cpu: "100"
    memory: 400Gi

  disruption:
    consolidationPolicy: WhenUnderutilized   # Actively consolidate idle nodes
    consolidateAfter: 30s
    expireAfter: 720h   # Rotate nodes every 30 days (security hygiene)
Enter fullscreen mode Exit fullscreen mode

EC2NodeClass CRD

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: general
spec:
  amiFamily: AL2   # Amazon Linux 2 — EKS-optimized AMI

  # Karpenter discovers subnets and security groups by tags
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "{{cluster}}"

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "{{cluster}}"

  instanceProfile: "{{cluster}}-karpenter-node"

  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        encrypted: true
        kmsKeyID: "{{kmsKeyArn}}"
Enter fullscreen mode Exit fullscreen mode

Verifying Karpenter

# Watch for new NodeClaims as pods scale up
kubectl --context prod-use1 get nodeclaims -w

# Check NodePool status
kubectl --context prod-use1 get nodepool general

# See which nodes Karpenter provisioned vs the initial managed node group
kubectl --context prod-use1 get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type
Enter fullscreen mode Exit fullscreen mode

HPA — Horizontal Pod Autoscaler

HPA watches CPU and memory metrics from the metrics-server and scales pod replicas up or down automatically.

# apps/myapp/templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ include "myapp.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  scaleTargetRef:
    apiVersion: {{ if .Values.rollout.enabled }}argoproj.io/v1alpha1{{ else }}apps/v1{{ end }}
    kind: {{ if .Values.rollout.enabled }}Rollout{{ else }}Deployment{{ end }}
    name: {{ include "myapp.fullname" . }}
  minReplicas: {{ .Values.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage | default 70 }}
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
{{- end }}
Enter fullscreen mode Exit fullscreen mode

Production values:

# values-production.yaml
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 60
Enter fullscreen mode Exit fullscreen mode

Note the conditional scaleTargetRef: In production, HPA targets a Rollout resource (Argo Rollouts). In dev/staging, it targets a standard Deployment. The Helm template handles both cases via .Values.rollout.enabled.

HPA + Karpenter Interaction

Traffic spike arrives
      │
      ▼
HPA: CPU > 60% → scale pods from 3 to 8
      │
      ▼
5 new pods: Pending (not enough node capacity)
      │
      ▼
Karpenter: detects Pending pods → evaluates requests
           → finds cheapest EC2 that fits → provisions 2x m5.large
      │
      ▼ ~60 seconds
New nodes join cluster → pods schedule → Running
      │
Traffic normalizes
      │
HPA: CPU < 60% → scale pods from 8 back to 3
      │
5 pods terminate
      │
Karpenter: 2 nodes underutilized → consolidate → terminate nodes
Enter fullscreen mode Exit fullscreen mode

Argo Rollouts — Canary Deployments

Argo Rollouts replaces the standard Kubernetes Deployment with a Rollout resource that supports progressive delivery strategies. In production, every deployment goes through a canary phase.

Installation (Production Only)

# infrastructure/argo-rollouts/applicationset.yaml
source:
  repoURL:        https://argoproj.github.io/argo-helm
  chart:          argo-rollouts
  targetRevision: "2.35.3"
  helm:
    values: |
      installCRDs: true
      dashboard:
        enabled: true
        service:
          type: ClusterIP
Enter fullscreen mode Exit fullscreen mode

Rollout CRD (replaces Deployment in production)

# apps/myapp/templates/deployment.yaml
{{- if .Values.rollout.enabled }}
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: {{ include "myapp.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "myapp.selectorLabels" . | nindent 8 }}
    spec:
      containers:
        - name: myapp
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          ports:
            - containerPort: 8080
              name: http
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi

  strategy:
    canary:
      canaryService: {{ include "myapp.fullname" . }}-canary
      stableService: {{ include "myapp.fullname" . }}
      steps:
        - setWeight: 20          # Step 1: 20% of traffic to canary
        - pause: { duration: 5m } # Step 2: Bake for 5 minutes
        - analysis:               # Step 3: Automated metric check
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: {{ include "myapp.fullname" . }}-canary
        - setWeight: 100         # Step 4: Promote to 100%

{{- else }}
# Standard Deployment for dev/staging
apiVersion: apps/v1
kind: Deployment
...
{{- end }}
Enter fullscreen mode Exit fullscreen mode

AnalysisTemplate — Automated Promotion Gate

# apps/myapp/templates/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: {{ .Release.Namespace }}
spec:
  args:
    - name: service-name

  metrics:
    - name: success-rate
      interval: 60s
      count: 5             # Run 5 measurements (5 minutes total)
      failureLimit: 1      # One failure triggers rollback

      provider:
        prometheus:
          address: http://prometheus-operated.monitoring.svc.cluster.local:9090
          query: |
            sum(
              rate(myapp_http_requests_total{
                service="{{`{{args.service-name}}`}}",
                status_code!~"5.."
              }[5m])
            )
            /
            sum(
              rate(myapp_http_requests_total{
                service="{{`{{args.service-name}}`}}"
              }[5m])
            )

      successCondition: result[0] >= 0.99   # 99%+ success rate required
      failureCondition: result[0] < 0.99
Enter fullscreen mode Exit fullscreen mode

Canary Service

Traffic splitting requires two Services: stable (regular Service) and canary (routes only to canary pods).

# apps/myapp/templates/service-canary.yaml
{{- if .Values.rollout.enabled }}
apiVersion: v1
kind: Service
metadata:
  name: {{ include "myapp.fullname" . }}-canary
  namespace: {{ .Release.Namespace }}
spec:
  selector:
    {{- include "myapp.selectorLabels" . | nindent 4 }}
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
{{- end }}
Enter fullscreen mode Exit fullscreen mode

Canary Deployment Walkthrough

1. CI pushes new image: sha-abc123 → myapp-gitops updated
2. ArgoCD detects diff → triggers Rollout update
3. Argo Rollouts creates new ReplicaSet with sha-abc123

   Traffic: 80% → stable (sha-xyz789), 20% → canary (sha-abc123)

4. 5-minute pause
   → Monitor Grafana: error rate on canary service?
   → Logs in CloudWatch: any exceptions?

5. AnalysisRun queries Prometheus (5 measurements × 1 min)
   → success_rate = 0.997 (99.7% success) ✓ PASS

6. setWeight: 100% → all traffic to sha-abc123
7. Old ReplicaSet (sha-xyz789) scaled to 0 after stability window
Enter fullscreen mode Exit fullscreen mode

Manual Intervention

# Watch rollout progress
kubectl argo rollouts status myapp -n myapp --watch

# Manually promote (skip remaining pause/analysis steps)
kubectl argo rollouts promote myapp -n myapp

# Manually abort (rollback to stable immediately)
kubectl argo rollouts abort myapp -n myapp

# Access the Argo Rollouts dashboard
kubectl port-forward svc/argo-rollouts-dashboard -n argo-rollouts 3100:3100
# Open http://localhost:3100
Enter fullscreen mode Exit fullscreen mode

Velero — Backup and Disaster Recovery

Velero backs up Kubernetes resource definitions and EBS volume snapshots to S3. If a cluster is accidentally deleted or corrupted, you can restore everything to a new cluster.

What Velero Backs Up

  • All Kubernetes objects (Deployments, Services, ConfigMaps, Secrets, etc.)
  • PersistentVolumeClaim snapshots (Prometheus data, Grafana dashboards, etc.)
  • Namespace structure

Installation

# infrastructure/velero/applicationset.yaml
source:
  repoURL:        https://vmware-tanzu.github.io/helm-charts
  chart:          velero
  targetRevision: "6.4.0"
  helm:
    values: |
      serviceAccount:
        server:
          annotations:
            eks.amazonaws.com/role-arn: "{{veleroRoleArn}}"
      configuration:
        backupStorageLocation:
          - name: default
            provider: aws
            bucket: "myapp-velero-{{cluster}}"
            config:
              region: "{{region}}"
        volumeSnapshotLocation:
          - name: default
            provider: aws
            config:
              region: "{{region}}"
      initContainers:
        - name: velero-plugin-for-aws
          image: velero/velero-plugin-for-aws:v1.9.0
          volumeMounts:
            - mountPath: /target
              name: plugins
Enter fullscreen mode Exit fullscreen mode

Scheduled Backup

# infrastructure/velero/schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"   # 2 AM UTC daily
  template:
    includedNamespaces:
      - myapp
      - monitoring
      - argocd
    excludedResources:
      - events
      - events.events.k8s.io
    snapshotVolumes: true
    ttl: 720h   # 30 days retention
    storageLocation: default
    volumeSnapshotLocations:
      - default
Enter fullscreen mode Exit fullscreen mode

Restore Procedure

# List available backups
velero backup get

# Restore from a specific backup
velero restore create --from-backup daily-backup-20260308020000

# Watch restore progress
velero restore describe <restore-name> --details

# Verify restored resources
kubectl get all -n myapp
kubectl get pvc -n monitoring
Enter fullscreen mode Exit fullscreen mode

Critical reminder: An untested backup is not a backup. Run a restore drill at least quarterly into a temporary cluster. The restore procedure should be documented and rehearsed so it is not being learned for the first time during an actual outage.


PodDisruptionBudget

Karpenter's consolidation can evict pods to move them to fewer nodes. Without a PDB, it might evict too many pods at once and cause downtime.

# apps/myapp/templates/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: {{ include "myapp.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  minAvailable: 2   # Always keep at least 2 pods running during disruption
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
Enter fullscreen mode Exit fullscreen mode

With minReplicas: 3 and minAvailable: 2, Karpenter can only evict one pod at a time. The remaining two continue serving traffic while the evicted pod reschedules on a new node.


Complete Resilience Picture

NORMAL OPERATION
─────────────────
3 pods (minReplicas) on 2 managed nodes
HPA watching CPU (target: 60%)
Karpenter watching for pending pods
Velero backing up daily at 2 AM UTC

TRAFFIC SPIKE
─────────────────
CPU > 60% → HPA scales to 8 pods
2 pods pending (no capacity) → Karpenter provisions m5.large
All 8 pods running → serving traffic
Spike ends → HPA scales to 3 pods
2 pods terminate → Karpenter consolidates → terminates extra node

NEW DEPLOYMENT (production)
─────────────────
ArgoCD syncs new image → Rollout starts
20% canary → 5min bake → AnalysisRun → 100% promote
OR: error rate > 1% → automatic rollback to previous version

DISASTER RECOVERY
─────────────────
Cluster accidentally deleted → restore from Velero backup
velero restore create --from-backup <last-good-backup>
Kubernetes objects restored → EBS snapshots attached → running in ~15 min
Enter fullscreen mode Exit fullscreen mode

Summary

By the end of Part 10 — and the entire series — you have:

  • Karpenter provisioning right-sized EC2 instances on demand, consolidating when idle
  • HPA scaling pods 3→10 based on CPU utilization, targeting a Rollout in production
  • Argo Rollouts deploying every production change as a canary with automated Prometheus-based promotion gates
  • Velero running scheduled daily backups with 30-day retention to S3
  • PodDisruptionBudget preventing Karpenter from evicting too many pods at once

Series Conclusion

You have now built a complete production-grade DevSecOps platform:

Layer What You Built
Foundation AWS Organizations, 4 accounts, SSO, SCPs
Infrastructure Terraform modules, Terragrunt DRY configs, 6 VPCs
Compute 6 EKS clusters (k8s 1.29) across 3 environments and 2 regions
GitOps ArgoCD hub-spoke, 35+ ApplicationSets, automated sync
CI/CD GitHub Actions + OIDC + Trivy + Cosign + ECR
Secrets AWS Secrets Manager + ESO + IRSA
Security Kyverno policies + Falco runtime detection + WAF + GuardDuty
Observability Prometheus + Grafana + Fluent Bit + CloudWatch
Resilience Karpenter + HPA + Argo Rollouts canary + Velero DR
Networking Route53 latency routing + ACM TLS + ALB + NetworkPolicies

This is the platform that a growing engineering team with 50–500 developers would build and operate. Each component was chosen for a reason, wired to the others, and tested against real failures.


Screenshot Placeholders

SCREENSHOT: kubectl get hpa showing current/desired replicas scaling
Show in frame: The HPA showing TARGETS: 15%/70%, MIN: 3, MAX: 10, REPLICAS: 3. This confirms autoscaling is wired up.<br>
Screenshot 10-3: Velero Backup Schedule

SCREENSHOT: ArgoCD — full applications view showing all 35+ apps across 6 clusters
 ArgoCD — full applications view showing all 35+ apps across 6 clusters


Thank you for following this series. Source code:

Top comments (0)