Matthew

Posted on May 20

Part 10: Resilience — Karpenter, HPA, Argo Rollouts, and Velero

#kubernetes #aws #devops #cloud

Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS

Introduction

Resilience engineering is about building systems that degrade gracefully, recover automatically, and deploy safely. This final part brings together four capabilities:

Karpenter — automatically provisions the right EC2 instances for pending pods, and removes them when no longer needed
HPA — scales pod replicas based on CPU/memory pressure
Argo Rollouts — deploys new versions with controlled canary traffic and automatic rollback on errors
Velero — backs up Kubernetes resources and PVC data to S3 for disaster recovery

┌─────────────────────────────────────────────────────────────────────┐
│  RESILIENCE LAYERS                                                  │
│                                                                     │
│  Traffic Spike                                                      │
│  └─► HPA: scale pods from 3 → 8                                     │
│       └─► Karpenter: provision new nodes to fit pending pods        │
│                                                                     │
│  New Deployment                                                     │
│  └─► Argo Rollouts: 20% canary → analysis → promote or rollback     │
│                                                                     │
│  Cluster Disaster                                                   │
│  └─► Velero: restore from S3 backup to new cluster                  │
└─────────────────────────────────────────────────────────────────────┘

Karpenter — Next-Generation Node Autoscaler

Why Karpenter over Cluster Autoscaler?

Feature	Cluster Autoscaler	Karpenter
Node selection	Pre-defined ASG instance types	Picks cheapest EC2 that fits pod requests
Spot support	Limited	Native, with fallback ordering
Speed	3–5 minutes	30–90 seconds
Consolidation	Basic	Active: moves pods to fewer nodes, terminates unused
Configuration	Per-ASG scaling groups	Declarative NodePool CRD

Karpenter provisions EC2 instances directly via the EC2 Fleet API — no Auto Scaling Group required for scaling decisions. It launches the exact instance type that fits your pending pods' resource requests, which means you pay only for what you actually need.

Installation (Production Only)

Karpenter is deployed only on production clusters where cost optimization and fast scaling matter. Dev/staging use fixed 2-node groups.

# infrastructure/karpenter/applicationset.yaml
source:
  repoURL:        https://charts.karpenter.sh
  chart:          karpenter
  targetRevision: "0.37.0"
  helm:
    values: |
      serviceAccount:
        annotations:
          eks.amazonaws.com/role-arn: "{{karpenterRoleArn}}"
      settings:
        clusterName: "{{cluster}}"
        clusterEndpoint: "{{clusterEndpoint}}"
        interruptionQueue: "{{cluster}}-interruption"
      controller:
        resources:
          requests:
            cpu: 1
            memory: 1Gi

Karpenter IAM

# _modules/karpenter/main.tf

resource "aws_iam_role_policy" "karpenter" {
  name = "karpenter-policy"
  role = aws_iam_role.karpenter.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "EC2Management"
        Effect = "Allow"
        Action = [
          "ec2:RunInstances",
          "ec2:TerminateInstances",
          "ec2:DescribeInstances",
          "ec2:DescribeInstanceTypes",
          "ec2:DescribeSubnets",
          "ec2:DescribeSecurityGroups",
          "ec2:DescribeLaunchTemplates",
          "ec2:CreateLaunchTemplate",
          "ec2:CreateFleet",
          "ec2:CreateTags",
          "ec2:DescribeSpotPriceHistory"
        ]
        Resource = "*"
      },
      {
        Sid    = "EKSDescribe"
        Effect = "Allow"
        Action = [
          # REQUIRED: Karpenter needs this to discover the cluster endpoint and CA
          # Without it Karpenter cannot configure new nodes to join the cluster
          "eks:DescribeCluster"
        ]
        Resource = "arn:aws:eks:*:${var.account_id}:cluster/${var.cluster_name}"
      },
      {
        Sid    = "IAMPassRole"
        Effect = "Allow"
        Action = ["iam:PassRole"]
        Resource = var.node_role_arn
      },
      {
        Sid    = "SQSInterruption"
        Effect = "Allow"
        Action = [
          "sqs:DeleteMessage",
          "sqs:GetQueueUrl",
          "sqs:ReceiveMessage"
        ]
        Resource = aws_sqs_queue.interruption.arn
      }
    ]
  })
}

Lesson learned: eks:DescribeCluster is not optional. Without it, Karpenter cannot discover the cluster endpoint and certificate authority, so newly provisioned nodes cannot join the cluster. Pods remain Pending indefinitely. Always include this permission.

NodePool CRD

# infrastructure/karpenter/nodepools/templates/nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general
spec:
  template:
    metadata:
      labels:
        provisioner: karpenter
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: general

      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot, on-demand]   # Prefer spot, fall back to on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: [c, m, r]           # Compute, memory, balanced families
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: [nano, micro, small] # Minimum medium instances

  limits:
    cpu: "100"
    memory: 400Gi

  disruption:
    consolidationPolicy: WhenUnderutilized   # Actively consolidate idle nodes
    consolidateAfter: 30s
    expireAfter: 720h   # Rotate nodes every 30 days (security hygiene)

EC2NodeClass CRD

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: general
spec:
  amiFamily: AL2   # Amazon Linux 2 — EKS-optimized AMI

  # Karpenter discovers subnets and security groups by tags
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "{{cluster}}"

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "{{cluster}}"

  instanceProfile: "{{cluster}}-karpenter-node"

  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        encrypted: true
        kmsKeyID: "{{kmsKeyArn}}"

Verifying Karpenter

# Watch for new NodeClaims as pods scale up
kubectl --context prod-use1 get nodeclaims -w

# Check NodePool status
kubectl --context prod-use1 get nodepool general

# See which nodes Karpenter provisioned vs the initial managed node group
kubectl --context prod-use1 get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type

HPA — Horizontal Pod Autoscaler

HPA watches CPU and memory metrics from the metrics-server and scales pod replicas up or down automatically.

# apps/myapp/templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ include "myapp.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  scaleTargetRef:
    apiVersion: {{ if .Values.rollout.enabled }}argoproj.io/v1alpha1{{ else }}apps/v1{{ end }}
    kind: {{ if .Values.rollout.enabled }}Rollout{{ else }}Deployment{{ end }}
    name: {{ include "myapp.fullname" . }}
  minReplicas: {{ .Values.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage | default 70 }}
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
{{- end }}

Production values:

# values-production.yaml
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 60

Note the conditional scaleTargetRef: In production, HPA targets a Rollout resource (Argo Rollouts). In dev/staging, it targets a standard Deployment. The Helm template handles both cases via .Values.rollout.enabled.

HPA + Karpenter Interaction

Traffic spike arrives
      │
      ▼
HPA: CPU > 60% → scale pods from 3 to 8
      │
      ▼
5 new pods: Pending (not enough node capacity)
      │
      ▼
Karpenter: detects Pending pods → evaluates requests
           → finds cheapest EC2 that fits → provisions 2x m5.large
      │
      ▼ ~60 seconds
New nodes join cluster → pods schedule → Running
      │
Traffic normalizes
      │
HPA: CPU < 60% → scale pods from 8 back to 3
      │
5 pods terminate
      │
Karpenter: 2 nodes underutilized → consolidate → terminate nodes

Argo Rollouts — Canary Deployments

Argo Rollouts replaces the standard Kubernetes Deployment with a Rollout resource that supports progressive delivery strategies. In production, every deployment goes through a canary phase.

Installation (Production Only)

# infrastructure/argo-rollouts/applicationset.yaml
source:
  repoURL:        https://argoproj.github.io/argo-helm
  chart:          argo-rollouts
  targetRevision: "2.35.3"
  helm:
    values: |
      installCRDs: true
      dashboard:
        enabled: true
        service:
          type: ClusterIP

Rollout CRD (replaces Deployment in production)

# apps/myapp/templates/deployment.yaml
{{- if .Values.rollout.enabled }}
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: {{ include "myapp.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "myapp.selectorLabels" . | nindent 8 }}
    spec:
      containers:
        - name: myapp
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          ports:
            - containerPort: 8080
              name: http
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi

  strategy:
    canary:
      canaryService: {{ include "myapp.fullname" . }}-canary
      stableService: {{ include "myapp.fullname" . }}
      steps:
        - setWeight: 20          # Step 1: 20% of traffic to canary
        - pause: { duration: 5m } # Step 2: Bake for 5 minutes
        - analysis:               # Step 3: Automated metric check
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: {{ include "myapp.fullname" . }}-canary
        - setWeight: 100         # Step 4: Promote to 100%

{{- else }}
# Standard Deployment for dev/staging
apiVersion: apps/v1
kind: Deployment
...
{{- end }}

AnalysisTemplate — Automated Promotion Gate

# apps/myapp/templates/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: {{ .Release.Namespace }}
spec:
  args:
    - name: service-name

  metrics:
    - name: success-rate
      interval: 60s
      count: 5             # Run 5 measurements (5 minutes total)
      failureLimit: 1      # One failure triggers rollback

      provider:
        prometheus:
          address: http://prometheus-operated.monitoring.svc.cluster.local:9090
          query: |
            sum(
              rate(myapp_http_requests_total{
                service="{{`{{args.service-name}}`}}",
                status_code!~"5.."
              }[5m])
            )
            /
            sum(
              rate(myapp_http_requests_total{
                service="{{`{{args.service-name}}`}}"
              }[5m])
            )

      successCondition: result[0] >= 0.99   # 99%+ success rate required
      failureCondition: result[0] < 0.99

Canary Service

Traffic splitting requires two Services: stable (regular Service) and canary (routes only to canary pods).

# apps/myapp/templates/service-canary.yaml
{{- if .Values.rollout.enabled }}
apiVersion: v1
kind: Service
metadata:
  name: {{ include "myapp.fullname" . }}-canary
  namespace: {{ .Release.Namespace }}
spec:
  selector:
    {{- include "myapp.selectorLabels" . | nindent 4 }}
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
{{- end }}

Canary Deployment Walkthrough

1. CI pushes new image: sha-abc123 → myapp-gitops updated
2. ArgoCD detects diff → triggers Rollout update
3. Argo Rollouts creates new ReplicaSet with sha-abc123

   Traffic: 80% → stable (sha-xyz789), 20% → canary (sha-abc123)

4. 5-minute pause
   → Monitor Grafana: error rate on canary service?
   → Logs in CloudWatch: any exceptions?

5. AnalysisRun queries Prometheus (5 measurements × 1 min)
   → success_rate = 0.997 (99.7% success) ✓ PASS

6. setWeight: 100% → all traffic to sha-abc123
7. Old ReplicaSet (sha-xyz789) scaled to 0 after stability window

Manual Intervention

# Watch rollout progress
kubectl argo rollouts status myapp -n myapp --watch

# Manually promote (skip remaining pause/analysis steps)
kubectl argo rollouts promote myapp -n myapp

# Manually abort (rollback to stable immediately)
kubectl argo rollouts abort myapp -n myapp

# Access the Argo Rollouts dashboard
kubectl port-forward svc/argo-rollouts-dashboard -n argo-rollouts 3100:3100
# Open http://localhost:3100

Velero — Backup and Disaster Recovery

Velero backs up Kubernetes resource definitions and EBS volume snapshots to S3. If a cluster is accidentally deleted or corrupted, you can restore everything to a new cluster.

What Velero Backs Up

All Kubernetes objects (Deployments, Services, ConfigMaps, Secrets, etc.)
PersistentVolumeClaim snapshots (Prometheus data, Grafana dashboards, etc.)
Namespace structure

Installation

# infrastructure/velero/applicationset.yaml
source:
  repoURL:        https://vmware-tanzu.github.io/helm-charts
  chart:          velero
  targetRevision: "6.4.0"
  helm:
    values: |
      serviceAccount:
        server:
          annotations:
            eks.amazonaws.com/role-arn: "{{veleroRoleArn}}"
      configuration:
        backupStorageLocation:
          - name: default
            provider: aws
            bucket: "myapp-velero-{{cluster}}"
            config:
              region: "{{region}}"
        volumeSnapshotLocation:
          - name: default
            provider: aws
            config:
              region: "{{region}}"
      initContainers:
        - name: velero-plugin-for-aws
          image: velero/velero-plugin-for-aws:v1.9.0
          volumeMounts:
            - mountPath: /target
              name: plugins

Scheduled Backup

# infrastructure/velero/schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"   # 2 AM UTC daily
  template:
    includedNamespaces:
      - myapp
      - monitoring
      - argocd
    excludedResources:
      - events
      - events.events.k8s.io
    snapshotVolumes: true
    ttl: 720h   # 30 days retention
    storageLocation: default
    volumeSnapshotLocations:
      - default

Restore Procedure

# List available backups
velero backup get

# Restore from a specific backup
velero restore create --from-backup daily-backup-20260308020000

# Watch restore progress
velero restore describe <restore-name> --details

# Verify restored resources
kubectl get all -n myapp
kubectl get pvc -n monitoring

Critical reminder: An untested backup is not a backup. Run a restore drill at least quarterly into a temporary cluster. The restore procedure should be documented and rehearsed so it is not being learned for the first time during an actual outage.

PodDisruptionBudget

Karpenter's consolidation can evict pods to move them to fewer nodes. Without a PDB, it might evict too many pods at once and cause downtime.

# apps/myapp/templates/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: {{ include "myapp.fullname" . }}
  namespace: {{ .Release.Namespace }}
spec:
  minAvailable: 2   # Always keep at least 2 pods running during disruption
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}

With minReplicas: 3 and minAvailable: 2, Karpenter can only evict one pod at a time. The remaining two continue serving traffic while the evicted pod reschedules on a new node.

Complete Resilience Picture

NORMAL OPERATION
─────────────────
3 pods (minReplicas) on 2 managed nodes
HPA watching CPU (target: 60%)
Karpenter watching for pending pods
Velero backing up daily at 2 AM UTC

TRAFFIC SPIKE
─────────────────
CPU > 60% → HPA scales to 8 pods
2 pods pending (no capacity) → Karpenter provisions m5.large
All 8 pods running → serving traffic
Spike ends → HPA scales to 3 pods
2 pods terminate → Karpenter consolidates → terminates extra node

NEW DEPLOYMENT (production)
─────────────────
ArgoCD syncs new image → Rollout starts
20% canary → 5min bake → AnalysisRun → 100% promote
OR: error rate > 1% → automatic rollback to previous version

DISASTER RECOVERY
─────────────────
Cluster accidentally deleted → restore from Velero backup
velero restore create --from-backup <last-good-backup>
Kubernetes objects restored → EBS snapshots attached → running in ~15 min

Summary

By the end of Part 10 — and the entire series — you have:

✅ Karpenter provisioning right-sized EC2 instances on demand, consolidating when idle
✅ HPA scaling pods 3→10 based on CPU utilization, targeting a Rollout in production
✅ Argo Rollouts deploying every production change as a canary with automated Prometheus-based promotion gates
✅ Velero running scheduled daily backups with 30-day retention to S3
✅ PodDisruptionBudget preventing Karpenter from evicting too many pods at once

Series Conclusion

You have now built a complete production-grade DevSecOps platform:

Layer	What You Built
Foundation	AWS Organizations, 4 accounts, SSO, SCPs
Infrastructure	Terraform modules, Terragrunt DRY configs, 6 VPCs
Compute	6 EKS clusters (k8s 1.29) across 3 environments and 2 regions
GitOps	ArgoCD hub-spoke, 35+ ApplicationSets, automated sync
CI/CD	GitHub Actions + OIDC + Trivy + Cosign + ECR
Secrets	AWS Secrets Manager + ESO + IRSA
Security	Kyverno policies + Falco runtime detection + WAF + GuardDuty
Observability	Prometheus + Grafana + Fluent Bit + CloudWatch
Resilience	Karpenter + HPA + Argo Rollouts canary + Velero DR
Networking	Route53 latency routing + ACM TLS + ALB + NetworkPolicies

This is the platform that a growing engineering team with 50–500 developers would build and operate. Each component was chosen for a reason, wired to the others, and tested against real failures.

Screenshot Placeholders

SCREENSHOT: kubectl get hpa showing current/desired replicas scaling

SCREENSHOT: ArgoCD — full applications view showing all 35+ apps across 6 clusters

Thank you for following this series. Source code:

Infrastructure: github.com/MatthewDipo/myapp-infra
GitOps manifests: github.com/MatthewDipo/myapp-gitops
Application: github.com/MatthewDipo/myapp

DEV Community

Part 10: Resilience — Karpenter, HPA, Argo Rollouts, and Velero

Introduction

Karpenter — Next-Generation Node Autoscaler

Why Karpenter over Cluster Autoscaler?

Installation (Production Only)

Karpenter IAM

NodePool CRD

EC2NodeClass CRD

Verifying Karpenter

HPA — Horizontal Pod Autoscaler

HPA + Karpenter Interaction

Argo Rollouts — Canary Deployments

Installation (Production Only)

Rollout CRD (replaces Deployment in production)

AnalysisTemplate — Automated Promotion Gate

Canary Service

Canary Deployment Walkthrough

Manual Intervention

Velero — Backup and Disaster Recovery

What Velero Backs Up

Installation

Scheduled Backup

Restore Procedure

PodDisruptionBudget

Complete Resilience Picture

Summary

Series Conclusion

Screenshot Placeholders

Top comments (0)