Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS
Introduction
Resilience engineering is about building systems that degrade gracefully, recover automatically, and deploy safely. This final part brings together four capabilities:
- Karpenter — automatically provisions the right EC2 instances for pending pods, and removes them when no longer needed
- HPA — scales pod replicas based on CPU/memory pressure
- Argo Rollouts — deploys new versions with controlled canary traffic and automatic rollback on errors
- Velero — backs up Kubernetes resources and PVC data to S3 for disaster recovery
┌─────────────────────────────────────────────────────────────────────┐
│ RESILIENCE LAYERS │
│ │
│ Traffic Spike │
│ └─► HPA: scale pods from 3 → 8 │
│ └─► Karpenter: provision new nodes to fit pending pods │
│ │
│ New Deployment │
│ └─► Argo Rollouts: 20% canary → analysis → promote or rollback │
│ │
│ Cluster Disaster │
│ └─► Velero: restore from S3 backup to new cluster │
└─────────────────────────────────────────────────────────────────────┘
Karpenter — Next-Generation Node Autoscaler
Why Karpenter over Cluster Autoscaler?
| Feature | Cluster Autoscaler | Karpenter |
|---|---|---|
| Node selection | Pre-defined ASG instance types | Picks cheapest EC2 that fits pod requests |
| Spot support | Limited | Native, with fallback ordering |
| Speed | 3–5 minutes | 30–90 seconds |
| Consolidation | Basic | Active: moves pods to fewer nodes, terminates unused |
| Configuration | Per-ASG scaling groups | Declarative NodePool CRD |
Karpenter provisions EC2 instances directly via the EC2 Fleet API — no Auto Scaling Group required for scaling decisions. It launches the exact instance type that fits your pending pods' resource requests, which means you pay only for what you actually need.
Installation (Production Only)
Karpenter is deployed only on production clusters where cost optimization and fast scaling matter. Dev/staging use fixed 2-node groups.
# infrastructure/karpenter/applicationset.yaml
source:
repoURL: https://charts.karpenter.sh
chart: karpenter
targetRevision: "0.37.0"
helm:
values: |
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "{{karpenterRoleArn}}"
settings:
clusterName: "{{cluster}}"
clusterEndpoint: "{{clusterEndpoint}}"
interruptionQueue: "{{cluster}}-interruption"
controller:
resources:
requests:
cpu: 1
memory: 1Gi
Karpenter IAM
# _modules/karpenter/main.tf
resource "aws_iam_role_policy" "karpenter" {
name = "karpenter-policy"
role = aws_iam_role.karpenter.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "EC2Management"
Effect = "Allow"
Action = [
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups",
"ec2:DescribeLaunchTemplates",
"ec2:CreateLaunchTemplate",
"ec2:CreateFleet",
"ec2:CreateTags",
"ec2:DescribeSpotPriceHistory"
]
Resource = "*"
},
{
Sid = "EKSDescribe"
Effect = "Allow"
Action = [
# REQUIRED: Karpenter needs this to discover the cluster endpoint and CA
# Without it Karpenter cannot configure new nodes to join the cluster
"eks:DescribeCluster"
]
Resource = "arn:aws:eks:*:${var.account_id}:cluster/${var.cluster_name}"
},
{
Sid = "IAMPassRole"
Effect = "Allow"
Action = ["iam:PassRole"]
Resource = var.node_role_arn
},
{
Sid = "SQSInterruption"
Effect = "Allow"
Action = [
"sqs:DeleteMessage",
"sqs:GetQueueUrl",
"sqs:ReceiveMessage"
]
Resource = aws_sqs_queue.interruption.arn
}
]
})
}
Lesson learned:
eks:DescribeClusteris not optional. Without it, Karpenter cannot discover the cluster endpoint and certificate authority, so newly provisioned nodes cannot join the cluster. Pods remainPendingindefinitely. Always include this permission.
NodePool CRD
# infrastructure/karpenter/nodepools/templates/nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general
spec:
template:
metadata:
labels:
provisioner: karpenter
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: general
requirements:
- key: kubernetes.io/arch
operator: In
values: [amd64]
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand] # Prefer spot, fall back to on-demand
- key: karpenter.k8s.aws/instance-category
operator: In
values: [c, m, r] # Compute, memory, balanced families
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values: [nano, micro, small] # Minimum medium instances
limits:
cpu: "100"
memory: 400Gi
disruption:
consolidationPolicy: WhenUnderutilized # Actively consolidate idle nodes
consolidateAfter: 30s
expireAfter: 720h # Rotate nodes every 30 days (security hygiene)
EC2NodeClass CRD
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: general
spec:
amiFamily: AL2 # Amazon Linux 2 — EKS-optimized AMI
# Karpenter discovers subnets and security groups by tags
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "{{cluster}}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "{{cluster}}"
instanceProfile: "{{cluster}}-karpenter-node"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
kmsKeyID: "{{kmsKeyArn}}"
Verifying Karpenter
# Watch for new NodeClaims as pods scale up
kubectl --context prod-use1 get nodeclaims -w
# Check NodePool status
kubectl --context prod-use1 get nodepool general
# See which nodes Karpenter provisioned vs the initial managed node group
kubectl --context prod-use1 get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type
HPA — Horizontal Pod Autoscaler
HPA watches CPU and memory metrics from the metrics-server and scales pod replicas up or down automatically.
# apps/myapp/templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "myapp.fullname" . }}
namespace: {{ .Release.Namespace }}
spec:
scaleTargetRef:
apiVersion: {{ if .Values.rollout.enabled }}argoproj.io/v1alpha1{{ else }}apps/v1{{ end }}
kind: {{ if .Values.rollout.enabled }}Rollout{{ else }}Deployment{{ end }}
name: {{ include "myapp.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage | default 70 }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
{{- end }}
Production values:
# values-production.yaml
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 60
Note the conditional
scaleTargetRef: In production, HPA targets aRolloutresource (Argo Rollouts). In dev/staging, it targets a standardDeployment. The Helm template handles both cases via.Values.rollout.enabled.
HPA + Karpenter Interaction
Traffic spike arrives
│
▼
HPA: CPU > 60% → scale pods from 3 to 8
│
▼
5 new pods: Pending (not enough node capacity)
│
▼
Karpenter: detects Pending pods → evaluates requests
→ finds cheapest EC2 that fits → provisions 2x m5.large
│
▼ ~60 seconds
New nodes join cluster → pods schedule → Running
│
Traffic normalizes
│
HPA: CPU < 60% → scale pods from 8 back to 3
│
5 pods terminate
│
Karpenter: 2 nodes underutilized → consolidate → terminate nodes
Argo Rollouts — Canary Deployments
Argo Rollouts replaces the standard Kubernetes Deployment with a Rollout resource that supports progressive delivery strategies. In production, every deployment goes through a canary phase.
Installation (Production Only)
# infrastructure/argo-rollouts/applicationset.yaml
source:
repoURL: https://argoproj.github.io/argo-helm
chart: argo-rollouts
targetRevision: "2.35.3"
helm:
values: |
installCRDs: true
dashboard:
enabled: true
service:
type: ClusterIP
Rollout CRD (replaces Deployment in production)
# apps/myapp/templates/deployment.yaml
{{- if .Values.rollout.enabled }}
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: {{ include "myapp.fullname" . }}
namespace: {{ .Release.Namespace }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "myapp.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "myapp.selectorLabels" . | nindent 8 }}
spec:
containers:
- name: myapp
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
strategy:
canary:
canaryService: {{ include "myapp.fullname" . }}-canary
stableService: {{ include "myapp.fullname" . }}
steps:
- setWeight: 20 # Step 1: 20% of traffic to canary
- pause: { duration: 5m } # Step 2: Bake for 5 minutes
- analysis: # Step 3: Automated metric check
templates:
- templateName: success-rate
args:
- name: service-name
value: {{ include "myapp.fullname" . }}-canary
- setWeight: 100 # Step 4: Promote to 100%
{{- else }}
# Standard Deployment for dev/staging
apiVersion: apps/v1
kind: Deployment
...
{{- end }}
AnalysisTemplate — Automated Promotion Gate
# apps/myapp/templates/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
namespace: {{ .Release.Namespace }}
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
count: 5 # Run 5 measurements (5 minutes total)
failureLimit: 1 # One failure triggers rollback
provider:
prometheus:
address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |
sum(
rate(myapp_http_requests_total{
service="{{`{{args.service-name}}`}}",
status_code!~"5.."
}[5m])
)
/
sum(
rate(myapp_http_requests_total{
service="{{`{{args.service-name}}`}}"
}[5m])
)
successCondition: result[0] >= 0.99 # 99%+ success rate required
failureCondition: result[0] < 0.99
Canary Service
Traffic splitting requires two Services: stable (regular Service) and canary (routes only to canary pods).
# apps/myapp/templates/service-canary.yaml
{{- if .Values.rollout.enabled }}
apiVersion: v1
kind: Service
metadata:
name: {{ include "myapp.fullname" . }}-canary
namespace: {{ .Release.Namespace }}
spec:
selector:
{{- include "myapp.selectorLabels" . | nindent 4 }}
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
{{- end }}
Canary Deployment Walkthrough
1. CI pushes new image: sha-abc123 → myapp-gitops updated
2. ArgoCD detects diff → triggers Rollout update
3. Argo Rollouts creates new ReplicaSet with sha-abc123
Traffic: 80% → stable (sha-xyz789), 20% → canary (sha-abc123)
4. 5-minute pause
→ Monitor Grafana: error rate on canary service?
→ Logs in CloudWatch: any exceptions?
5. AnalysisRun queries Prometheus (5 measurements × 1 min)
→ success_rate = 0.997 (99.7% success) ✓ PASS
6. setWeight: 100% → all traffic to sha-abc123
7. Old ReplicaSet (sha-xyz789) scaled to 0 after stability window
Manual Intervention
# Watch rollout progress
kubectl argo rollouts status myapp -n myapp --watch
# Manually promote (skip remaining pause/analysis steps)
kubectl argo rollouts promote myapp -n myapp
# Manually abort (rollback to stable immediately)
kubectl argo rollouts abort myapp -n myapp
# Access the Argo Rollouts dashboard
kubectl port-forward svc/argo-rollouts-dashboard -n argo-rollouts 3100:3100
# Open http://localhost:3100
Velero — Backup and Disaster Recovery
Velero backs up Kubernetes resource definitions and EBS volume snapshots to S3. If a cluster is accidentally deleted or corrupted, you can restore everything to a new cluster.
What Velero Backs Up
- All Kubernetes objects (Deployments, Services, ConfigMaps, Secrets, etc.)
- PersistentVolumeClaim snapshots (Prometheus data, Grafana dashboards, etc.)
- Namespace structure
Installation
# infrastructure/velero/applicationset.yaml
source:
repoURL: https://vmware-tanzu.github.io/helm-charts
chart: velero
targetRevision: "6.4.0"
helm:
values: |
serviceAccount:
server:
annotations:
eks.amazonaws.com/role-arn: "{{veleroRoleArn}}"
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: "myapp-velero-{{cluster}}"
config:
region: "{{region}}"
volumeSnapshotLocation:
- name: default
provider: aws
config:
region: "{{region}}"
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.9.0
volumeMounts:
- mountPath: /target
name: plugins
Scheduled Backup
# infrastructure/velero/schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM UTC daily
template:
includedNamespaces:
- myapp
- monitoring
- argocd
excludedResources:
- events
- events.events.k8s.io
snapshotVolumes: true
ttl: 720h # 30 days retention
storageLocation: default
volumeSnapshotLocations:
- default
Restore Procedure
# List available backups
velero backup get
# Restore from a specific backup
velero restore create --from-backup daily-backup-20260308020000
# Watch restore progress
velero restore describe <restore-name> --details
# Verify restored resources
kubectl get all -n myapp
kubectl get pvc -n monitoring
Critical reminder: An untested backup is not a backup. Run a restore drill at least quarterly into a temporary cluster. The restore procedure should be documented and rehearsed so it is not being learned for the first time during an actual outage.
PodDisruptionBudget
Karpenter's consolidation can evict pods to move them to fewer nodes. Without a PDB, it might evict too many pods at once and cause downtime.
# apps/myapp/templates/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: {{ include "myapp.fullname" . }}
namespace: {{ .Release.Namespace }}
spec:
minAvailable: 2 # Always keep at least 2 pods running during disruption
selector:
matchLabels:
{{- include "myapp.selectorLabels" . | nindent 6 }}
With minReplicas: 3 and minAvailable: 2, Karpenter can only evict one pod at a time. The remaining two continue serving traffic while the evicted pod reschedules on a new node.
Complete Resilience Picture
NORMAL OPERATION
─────────────────
3 pods (minReplicas) on 2 managed nodes
HPA watching CPU (target: 60%)
Karpenter watching for pending pods
Velero backing up daily at 2 AM UTC
TRAFFIC SPIKE
─────────────────
CPU > 60% → HPA scales to 8 pods
2 pods pending (no capacity) → Karpenter provisions m5.large
All 8 pods running → serving traffic
Spike ends → HPA scales to 3 pods
2 pods terminate → Karpenter consolidates → terminates extra node
NEW DEPLOYMENT (production)
─────────────────
ArgoCD syncs new image → Rollout starts
20% canary → 5min bake → AnalysisRun → 100% promote
OR: error rate > 1% → automatic rollback to previous version
DISASTER RECOVERY
─────────────────
Cluster accidentally deleted → restore from Velero backup
velero restore create --from-backup <last-good-backup>
Kubernetes objects restored → EBS snapshots attached → running in ~15 min
Summary
By the end of Part 10 — and the entire series — you have:
- ✅ Karpenter provisioning right-sized EC2 instances on demand, consolidating when idle
- ✅ HPA scaling pods 3→10 based on CPU utilization, targeting a Rollout in production
- ✅ Argo Rollouts deploying every production change as a canary with automated Prometheus-based promotion gates
- ✅ Velero running scheduled daily backups with 30-day retention to S3
- ✅ PodDisruptionBudget preventing Karpenter from evicting too many pods at once
Series Conclusion
You have now built a complete production-grade DevSecOps platform:
| Layer | What You Built |
|---|---|
| Foundation | AWS Organizations, 4 accounts, SSO, SCPs |
| Infrastructure | Terraform modules, Terragrunt DRY configs, 6 VPCs |
| Compute | 6 EKS clusters (k8s 1.29) across 3 environments and 2 regions |
| GitOps | ArgoCD hub-spoke, 35+ ApplicationSets, automated sync |
| CI/CD | GitHub Actions + OIDC + Trivy + Cosign + ECR |
| Secrets | AWS Secrets Manager + ESO + IRSA |
| Security | Kyverno policies + Falco runtime detection + WAF + GuardDuty |
| Observability | Prometheus + Grafana + Fluent Bit + CloudWatch |
| Resilience | Karpenter + HPA + Argo Rollouts canary + Velero DR |
| Networking | Route53 latency routing + ACM TLS + ALB + NetworkPolicies |
This is the platform that a growing engineering team with 50–500 developers would build and operate. Each component was chosen for a reason, wired to the others, and tested against real failures.
Screenshot Placeholders
SCREENSHOT: kubectl get hpa showing current/desired replicas scaling
SCREENSHOT: ArgoCD — full applications view showing all 35+ apps across 6 clusters
Thank you for following this series. Source code:
- Infrastructure: github.com/MatthewDipo/myapp-infra
- GitOps manifests: github.com/MatthewDipo/myapp-gitops
- Application: github.com/MatthewDipo/myapp


Top comments (0)