Matthew

Posted on May 28

Production DevSecOps Pipeline — The Complete Day-2 Operations Runbook

#devops #cicd #kubernetes #terraform

DevSecOps Pipeline — Completion Runbook

All code is written and pushed to GitHub. This runbook covers the remaining
operational steps: Terraform applies, GitOps ARN updates, and ArgoCD deployment.

Prerequisites

Install these tools if not already present:

# AWS CLI v2
winget install Amazon.AWSCLI

# Terraform 1.6+
winget install HashiCorp.Terraform

# Terragrunt
# Download from https://github.com/gruntwork-io/terragrunt/releases
# Place in C:\Windows\System32\ or add to PATH

# kubectl
winget install Kubernetes.kubectl

# ArgoCD CLI
winget install argoproj.argocd

AWS Profile Setup

The root terragrunt.hcl uses profiles named myapp-{env}-{region_alias}.
Configure them in ~/.aws/config:

[profile myapp-production-use1]
region = us-east-1
role_arn = arn:aws:iam::591120834781:role/AdministratorAccess
source_profile = default

[profile myapp-production-usw2]
region = us-west-2
role_arn = arn:aws:iam::591120834781:role/AdministratorAccess
source_profile = default

[profile myapp-staging-use1]
region = us-east-1
role_arn = arn:aws:iam::690687753178:role/AdministratorAccess
source_profile = default

[profile myapp-staging-usw2]
region = us-west-2
role_arn = arn:aws:iam::690687753178:role/AdministratorAccess
source_profile = default

[profile myapp-dev-use1]
region = us-east-1
role_arn = arn:aws:iam::557702566877:role/AdministratorAccess
source_profile = default

[profile myapp-dev-usw2]
region = us-west-2
role_arn = arn:aws:iam::557702566877:role/AdministratorAccess
source_profile = default

PHASE 1 — Terraform Applies

Work from the myapp-infra/ directory. Run in the order shown — capture outputs
for updating GitOps files in Phase 2.

1.1 WAF (production + staging)

# Production us-east-1
terragrunt apply --terragrunt-working-dir live/production/us-east-1/waf
# Output → webacl_arn  (copy this value)

# Production us-west-2
terragrunt apply --terragrunt-working-dir live/production/us-west-2/waf
# Output → webacl_arn  (copy this value)

# Staging (no GitOps ARN needed, but good to have)
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/waf
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/waf

1.2 GuardDuty (all regions — no outputs needed)

terragrunt apply --terragrunt-working-dir live/production/us-east-1/guardduty
terragrunt apply --terragrunt-working-dir live/production/us-west-2/guardduty
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/guardduty
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/guardduty

GuardDuty has no GitOps dependency. Alerts appear in the AWS console and
optionally in CloudWatch.

1.3 ESO IRSA for Staging

# Staging us-east-1
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/eso-irsa
# Output → role_arn  (copy → used in environments/staging/applicationset.yaml)

# Staging us-west-2
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/eso-irsa
# Output → role_arn  (copy → used in environments/staging/applicationset.yaml)

NOTE: The ESO operator ApplicationSet (infrastructure/eso/applicationset.yaml)
already includes staging clusters. Once ESO is running on staging and the
ExternalSecret IRSA role is set, ExternalSecrets will sync automatically.

1.4 Fluent Bit IRSA (all 6 clusters)

terragrunt apply --terragrunt-working-dir live/production/us-east-1/fluent-bit-irsa
# → role_arn for myapp-production-use1

terragrunt apply --terragrunt-working-dir live/production/us-west-2/fluent-bit-irsa
# → role_arn for myapp-production-usw2

terragrunt apply --terragrunt-working-dir live/staging/us-east-1/fluent-bit-irsa
# → role_arn for myapp-staging-use1

terragrunt apply --terragrunt-working-dir live/staging/us-west-2/fluent-bit-irsa
# → role_arn for myapp-staging-usw2

terragrunt apply --terragrunt-working-dir live/dev/us-east-1/fluent-bit-irsa
# → role_arn for myapp-dev-use1

terragrunt apply --terragrunt-working-dir live/dev/us-west-2/fluent-bit-irsa
# → role_arn for myapp-dev-usw2

1.5 Karpenter (production only)

terragrunt apply --terragrunt-working-dir live/production/us-east-1/karpenter
# Outputs:
#   controller_role_arn   → for karpenter applicationset.yaml
#   node_role_arn         → for verification (name = myapp-production-use1-karpenter-node)
#   node_instance_profile → for verification
#   interruption_queue_name → should be "myapp-production-use1-karpenter"

terragrunt apply --terragrunt-working-dir live/production/us-west-2/karpenter
# Outputs same structure for usw2

The nodeRoleName values in karpenter/nodepool-applicationset.yaml are
pre-set to myapp-production-use1-karpenter-node and myapp-production-usw2-karpenter-node.
These match what Terraform creates so no update needed there.

1.6 Velero (all 6 clusters)

# Production
terragrunt apply --terragrunt-working-dir live/production/us-east-1/velero
# → role_arn for myapp-production-use1

terragrunt apply --terragrunt-working-dir live/production/us-west-2/velero
# → role_arn for myapp-production-usw2

# Staging
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/velero
# → role_arn for myapp-staging-use1

terragrunt apply --terragrunt-working-dir live/staging/us-west-2/velero
# → role_arn for myapp-staging-usw2

# Dev
terragrunt apply --terragrunt-working-dir live/dev/us-east-1/velero
# → role_arn for myapp-dev-use1

terragrunt apply --terragrunt-working-dir live/dev/us-west-2/velero
# → role_arn for myapp-dev-usw2

PHASE 2 — Update GitOps ARNs

After collecting all outputs from Phase 1, update the GitOps repo
(myapp-gitops/) and push.

2.1 Production WAF ARNs

Edit environments/production/applicationset.yaml — replace "PENDING" with
real WAF ACL ARNs from Step 1.1:

elements:
  - cluster: myapp-production-use1
    ...
    wafAclArn: "arn:aws:wafv2:us-east-1:591120834781:regional/webacl/myapp-production-use1-waf/XXXXXXXX"
  - cluster: myapp-production-usw2
    ...
    wafAclArn: "arn:aws:wafv2:us-west-2:591120834781:regional/webacl/myapp-production-usw2-waf/XXXXXXXX"

2.2 Staging ESO IRSA ARNs

Edit environments/staging/applicationset.yaml — replace "PENDING" with
role ARNs from Step 1.3:

elements:
  - cluster: myapp-staging-use1
    ...
    irsaRoleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-eso"
  - cluster: myapp-staging-usw2
    ...
    irsaRoleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-eso"

2.3 Fluent Bit IRSA ARNs

Edit infrastructure/logging/applicationset.yaml — replace all 6 "PENDING" values:

elements:
  - cluster: myapp-production-use1  roleArn: "arn:aws:iam::591120834781:role/myapp-production-use1-fluent-bit"
  - cluster: myapp-production-usw2  roleArn: "arn:aws:iam::591120834781:role/myapp-production-usw2-fluent-bit"
  - cluster: myapp-staging-use1     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-fluent-bit"
  - cluster: myapp-staging-usw2     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-fluent-bit"
  - cluster: myapp-dev-use1         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-use1-fluent-bit"
  - cluster: myapp-dev-usw2         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-usw2-fluent-bit"

TIP: Role names follow the pattern {cluster_name}-fluent-bit. Verify with
terragrunt output role_arn in each fluent-bit-irsa directory.

2.4 Karpenter Controller Role ARNs

Edit infrastructure/karpenter/applicationset.yaml — replace 2 "PENDING" values:

elements:
  - cluster: myapp-production-use1  controllerRole: "arn:aws:iam::591120834781:role/myapp-production-use1-karpenter"
  - cluster: myapp-production-usw2  controllerRole: "arn:aws:iam::591120834781:role/myapp-production-usw2-karpenter"

2.5 Velero Role ARNs

Edit infrastructure/velero/applicationset.yaml — replace all 6 "PENDING" values:

elements:
  - cluster: myapp-production-use1  roleArn: "arn:aws:iam::591120834781:role/myapp-production-use1-velero"
  - cluster: myapp-production-usw2  roleArn: "arn:aws:iam::591120834781:role/myapp-production-usw2-velero"
  - cluster: myapp-staging-use1     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-velero"
  - cluster: myapp-staging-usw2     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-velero"
  - cluster: myapp-dev-use1         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-use1-velero"
  - cluster: myapp-dev-usw2         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-usw2-velero"

2.6 Slack Webhooks + Grafana Password

Edit infrastructure/monitoring/prometheus-values.yaml:

Replace both https://hooks.slack.com/services/CHANGE_ME with real Slack incoming webhook URLs
Replace change-me-grafana with a real password (or use an ExternalSecret)

2.7 Commit + Push GitOps changes

cd myapp-gitops
git add environments/ infrastructure/
git commit -m "chore: fill in real ARNs from terraform outputs"
git push origin HEAD:main

2.8 Create staging Secrets Manager secret

Run this once to seed the staging ExternalSecret:

AWS_PROFILE=myapp-staging-use1 aws secretsmanager create-secret \
  --name staging/myapp/db-password \
  --secret-string '{"password":"change-me-staging"}' \
  --region us-east-1

AWS_PROFILE=myapp-staging-usw2 aws secretsmanager create-secret \
  --name staging/myapp/db-password \
  --secret-string '{"password":"change-me-staging"}' \
  --region us-west-2

PHASE 3 — ArgoCD Setup

3.1 Bootstrap ArgoCD (App of Apps)

The argocd/ directory in myapp-gitops now contains the AppProject and a
bootstrap Application. Apply the bootstrap once — after that ArgoCD manages
itself and will also pick up the AppProject automatically.

# Point kubectl at production cluster (where ArgoCD runs)
kubectl config use-context myapp-production-use1

cd myapp-gitops

# One-time bootstrap — creates the self-managing Application
kubectl apply -f argocd/bootstrap.yaml -n argocd

# ArgoCD will now sync argocd/project-production.yaml automatically.
# Watch until it's healthy:
argocd app wait bootstrap --health

The argocd/project-production.yaml AppProject already includes every
namespace and source repo needed by all components. No kubectl patch needed.

3.2 Apply new ApplicationSets to ArgoCD

After the bootstrap Application syncs (it only manages the argocd/ directory),
apply the infrastructure ApplicationSets manually once:

cd myapp-gitops

kubectl apply -f infrastructure/eso/applicationset.yaml
kubectl apply -f infrastructure/monitoring/applicationset.yaml
kubectl apply -f infrastructure/monitoring/alert-rules-applicationset.yaml
kubectl apply -f infrastructure/logging/applicationset.yaml
kubectl apply -f infrastructure/karpenter/applicationset.yaml
kubectl apply -f infrastructure/karpenter/nodepool-applicationset.yaml
kubectl apply -f infrastructure/velero/applicationset.yaml
kubectl apply -f infrastructure/falco/applicationset.yaml
kubectl apply -f infrastructure/argo-rollouts/applicationset.yaml

After this, ArgoCD self-manages all ApplicationSets via the automated sync
on the generated Applications.

PHASE 4 — ArgoCD Sync Order (Production)

Sync in this exact order to respect CRD dependencies:

# Step 1: Prometheus stack (creates CRDs for PrometheusRule, ServiceMonitor, etc.)
argocd app sync prometheus-myapp-production-use1 prometheus-myapp-production-usw2
argocd app wait prometheus-myapp-production-use1 --health
argocd app wait prometheus-myapp-production-usw2 --health

# Step 2: Alert rules (needs Prometheus CRDs)
argocd app sync alert-rules-myapp-production-use1 alert-rules-myapp-production-usw2

# Step 3: Parallel infra components (no inter-dependency)
argocd app sync \
  fluent-bit-myapp-production-use1 fluent-bit-myapp-production-usw2 \
  velero-myapp-production-use1 velero-myapp-production-usw2 \
  falco-myapp-production-use1 falco-myapp-production-usw2

# Step 4: Karpenter controller (needs ECR access to pull image from public.ecr.aws)
argocd app sync karpenter-myapp-production-use1 karpenter-myapp-production-usw2
argocd app wait karpenter-myapp-production-use1 --health

# Step 5: Karpenter NodePools (needs Karpenter CRDs installed by Step 4)
argocd app sync karpenter-nodepool-myapp-production-use1 karpenter-nodepool-myapp-production-usw2

# Step 6: Argo Rollouts controller
argocd app sync argo-rollouts-myapp-production-use1 argo-rollouts-myapp-production-usw2
argocd app wait argo-rollouts-myapp-production-use1 --health

# Step 7: App (uses Rollout CR — needs argo-rollouts controller running)
argocd app sync myapp-production-myapp-production-use1 myapp-production-myapp-production-usw2

Staging sync (can run in parallel with production steps 3+)

argocd app sync \
  eso-myapp-staging-use1 eso-myapp-staging-usw2 \
  fluent-bit-myapp-staging-use1 fluent-bit-myapp-staging-usw2 \
  velero-myapp-staging-use1 velero-myapp-staging-usw2 \
  falco-myapp-staging-use1 falco-myapp-staging-usw2 \
  prometheus-myapp-staging-use1 prometheus-myapp-staging-usw2

# After staging ESO is healthy, ExternalSecrets will sync automatically
argocd app sync myapp-staging-myapp-staging-use1 myapp-staging-myapp-staging-usw2

PHASE 5 — Verification

Monitoring

kubectl get pods -n monitoring --context myapp-production-use1
kubectl get prometheusrule -n monitoring --context myapp-production-use1
kubectl get alertmanager -n monitoring --context myapp-production-use1
# Access Grafana: kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

Logging

kubectl get pods -n logging --context myapp-production-use1
# Verify log groups were created:
AWS_PROFILE=myapp-production-use1 aws logs describe-log-groups \
  --log-group-name-prefix /eks/myapp-production-use1 --region us-east-1

Karpenter

kubectl get pods -n karpenter --context myapp-production-use1
kubectl get nodepool --context myapp-production-use1
kubectl get ec2nodeclass --context myapp-production-use1
# Trigger a scale test:
kubectl scale deploy/stress --replicas=50 -n default --context myapp-production-use1
kubectl get nodes -w --context myapp-production-use1

Velero

kubectl get pods -n velero --context myapp-production-use1
kubectl get schedule -n velero --context myapp-production-use1
# Trigger manual backup:
velero backup create manual-test --context myapp-production-use1
velero backup describe manual-test --context myapp-production-use1

Falco

kubectl get pods -n falco --context myapp-production-use1
# Check CloudWatch for events:
AWS_PROFILE=myapp-production-use1 aws logs describe-log-groups \
  --log-group-name-prefix /falco --region us-east-1

Argo Rollouts (canary deploy)

kubectl get rollout -n production --context myapp-production-use1
kubectl argo rollouts get rollout myapp-production-use1-myapp -n production \
  --context myapp-production-use1 --watch

ESO Staging

kubectl get externalsecret -n staging --context myapp-staging-use1
kubectl describe externalsecret myapp-production-use1-myapp-secrets -n staging \
  --context myapp-staging-use1

WAF

AWS_PROFILE=myapp-production-use1 aws wafv2 list-web-acls \
  --scope REGIONAL --region us-east-1 | grep myapp

GuardDuty

AWS_PROFILE=myapp-production-use1 aws guardduty list-detectors --region us-east-1
AWS_PROFILE=myapp-production-usw2 aws guardduty list-detectors --region us-west-2

Troubleshooting Notes

Issue	Fix
Karpenter fails to pull image	Ensure the node IAM role has ECR pull-through cache configured or use `public.ecr.aws` directly. Karpenter controller image is on `public.ecr.aws/karpenter/karpenter`.
Falco `modern_ebpf` not supported	Some EKS AMIs/kernel versions don't support eBPF. Fall back to `driver.kind: ebpf` or `driver.kind: module` in `infrastructure/falco/values.yaml`.
Velero backup fails	Ensure S3 bucket lifecycle rule and encryption config applied. Check IRSA trust policy `sub` matches `system:serviceaccount:velero:velero`.
Alert rules not picked up	The PrometheusRule must have label `release: kube-prometheus-stack` (already set in `alert-rules.yaml`). Verify with `kubectl get prometheusrule -n monitoring -o yaml`.
Rollout stuck at 20%	Check AnalysisTemplate — if `myapp_http_requests_total` metric doesn't exist yet (app not instrumented), the analysis will fail. Set `failureLimit: 3` or temporarily disable analysis by removing the `analysis` step from the canary steps.
Karpenter NodePool not scheduling	Verify subnet and SG tags: `aws ec2 describe-subnets --filters "Name=tag:karpenter.sh/discovery,Values=myapp-production-use1"`.

Day-2 Operations Runbook

For: Anyone operating this pipeline after initial setup is complete
Live system: https://api.matthewoladipupo.dev/health

Quick Reference

URLs and Credentials

Service	URL	Notes
Application	`https://api.matthewoladipupo.dev/health`	Public
ArgoCD UI	`http://a0c3c1ea43b294c4d8f5c2a7c514f6f2-1678928976.us-east-1.elb.amazonaws.com`	admin / see Secrets Manager
Grafana	`https://grafana.matthewoladipupo.dev`	admin / see Secrets Manager
AWS SSO Portal	`https://d-9a6757fb3c.awsapps.com/start`	IAM Identity Center

Cluster → Profile Map

Cluster	kubectl context	AWS Profile	Endpoint
`myapp-production-use1`	`myapp-production-use1`	`myapp-prod-use1`	private
`myapp-production-usw2`	`myapp-production-usw2`	`myapp-prod-usw2`	private
`myapp-staging-use1`	`myapp-staging-use1`	`myapp-staging-use1`	private
`myapp-staging-usw2`	`myapp-staging-usw2`	`myapp-staging-usw2`	private
`myapp-dev-use1`	`myapp-dev-use1`	`myapp-dev-use1`	public
`myapp-dev-usw2`	`myapp-dev-usw2`	`myapp-dev-usw2`	public

OPS-1: Start of Session

Run every time you open a new terminal. SSO tokens last 8 hours.

# 1. Authenticate
aws sso login --sso-session admin --no-browser
# → browser opens → click Allow → wait for "Successfully logged in"

# 2. Verify
aws sts get-caller-identity --profile myapp-prod-use1
# → should return Account: 591120834781

# 3. Quick health check
curl https://api.matthewoladipupo.dev/health
# → {"status":"healthy","region":"us-east-1"}

If you see Token has expired and refresh failed at any point, re-run step 1.

OPS-2: Accessing Private Clusters (Production + Staging)

Production and staging clusters have endpointPublicAccess: false. You must temporarily enable public access, do your work, then lock it back. Never leave production with a public endpoint.

# Enable (wait ~3 minutes after running this)
aws eks update-cluster-config \
  --name myapp-production-use1 \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --resources-vpc-config endpointPublicAccess=true,endpointPrivateAccess=true,publicAccessCidrs=0.0.0.0/0

# Confirm it's ready
aws eks describe-cluster \
  --name myapp-production-use1 \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --query 'cluster.resourcesVpcConfig.endpointPublicAccess'
# Must return: true

# --- Do your kubectl work here ---

# Lock back immediately after
aws eks update-cluster-config \
  --name myapp-production-use1 \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --resources-vpc-config endpointPublicAccess=false,endpointPrivateAccess=true

Replace myapp-production-use1 / myapp-prod-use1 / us-east-1 with the appropriate values for other private clusters.

OPS-3: Standard Deployment

Normal deployments are fully automated — zero manual steps required:

Developer pushes to main branch of MatthewDipo/myapp
GitHub Actions: test → Trivy scan → build → push ECR → Cosign sign → update gitops values
ArgoCD detects gitops change within 3 minutes → syncs
In production: Argo Rollout starts canary (20% traffic for 5 minutes → analysis → 100%)

Monitor progress:

# Watch rollout steps (enable public endpoint first)
kubectl get rollouts -n production --context myapp-production-use1 -w

# Detailed view (requires kubectl-argo-rollouts plugin)
kubectl argo rollouts get rollout myapp -n production \
  --context myapp-production-use1 --watch

OPS-4: Promote or Abort a Canary

Promote immediately (skip the 5-minute pause):

kubectl argo rollouts promote myapp -n production --context myapp-production-use1

Abort (shift all traffic back to stable version instantly):

kubectl argo rollouts abort myapp -n production --context myapp-production-use1

# After abort the rollout shows Degraded — retry to return to Healthy
kubectl argo rollouts retry rollout myapp -n production --context myapp-production-use1

OPS-5: Roll Back a Deployment

Option A — GitOps revert (preferred, keeps git history clean):

cd /path/to/myapp-gitops
git revert HEAD --no-edit
git push origin main
# ArgoCD auto-syncs the revert within 3 minutes

Option B — ArgoCD rollback to a previous revision:

argocd app history myapp-production-myapp-production-use1
argocd app rollback myapp-production-myapp-production-use1 <revision-number>

Option C — Emergency direct image update (bypasses GitOps, use only in outage):

# Get previous image tag from ECR
aws ecr describe-images \
  --repository-name myapp \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --query 'sort_by(imageDetails,&imagePushedAt)[-2].imageTags' \
  --output json

# Force update the rollout
kubectl argo rollouts set image myapp \
  myapp=206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:sha-<previous-sha> \
  -n production --context myapp-production-use1

# After incident: update values-production.yaml in gitops to match, then push

OPS-6: Rotate a Secret

The External Secrets Operator (ESO) syncs from AWS Secrets Manager on a 1-hour cycle.

Step 1 — Update value in Secrets Manager:

aws secretsmanager put-secret-value \
  --secret-id production/myapp/db-password \
  --secret-string '{"password":"new-value-here"}' \
  --region us-east-1 \
  --profile myapp-prod-use1

Step 2 — Force immediate ESO refresh:

# Enable public endpoint first (OPS-2)
kubectl annotate externalsecret myapp-db-password \
  force-sync=$(date +%s) -n production \
  --context myapp-production-use1 --overwrite

# Verify
kubectl get externalsecret myapp-db-password -n production \
  --context myapp-production-use1
# STATUS: SecretSynced

Step 3 — Restart pods to pick up new value:

kubectl argo rollouts restart myapp -n production --context myapp-production-use1

OPS-7: Incident Response

Falco Security Alert

# 1. Identify what triggered the alert
kubectl logs -n falco daemonset/falco --context myapp-production-use1 \
  | grep -E "Warning|Critical" | tail -20

# 2. Inspect the affected pod
kubectl describe pod <pod-name> -n production --context myapp-production-use1
kubectl logs <pod-name> -n production --context myapp-production-use1 --tail=100

# 3. Contain if confirmed malicious — delete pod (Rollout replaces with clean copy)
kubectl delete pod <pod-name> -n production --context myapp-production-use1

# 4. Preserve logs before deletion
kubectl logs <pod-name> -n production --context myapp-production-use1 \
  > /tmp/incident-$(date +%Y%m%d-%H%M%S).log

# 5. Check GuardDuty for correlated findings
aws guardduty list-findings \
  --detector-id $(aws guardduty list-detectors \
    --region us-east-1 --profile myapp-prod-use1 \
    --query 'DetectorIds[0]' --output text) \
  --region us-east-1 --profile myapp-prod-use1

Kyverno Blocked a Pod

# See the rejection reason
kubectl describe pod <pod-name> -n <namespace> --context myapp-production-use1
# Look for: "admission webhook" in Events

# List policy violations
kubectl get policyreport -n <namespace> --context myapp-production-use1

Common violations and fixes:

Policy	Violation	Fix
`block-privileged`	`privileged: true` in spec	Remove the privileged flag
`require-non-root`	Running as root	Add `runAsNonRoot: true`, `runAsUser: 1000`
`block-host-path`	hostPath volume	Replace with PVC
`require-resource-limits`	No CPU/memory limits	Add `resources.limits`
`verify-image-signature`	Image not Cosign-signed	Must go through CI/CD pipeline

Application Down

# Check pods
kubectl get pods -n production --context myapp-production-use1

# Check pod logs
kubectl logs -n production -l app=myapp --context myapp-production-use1 --tail=50

# Check rollout
kubectl get rollouts -n production --context myapp-production-use1

# Check HPA — is it maxed out?
kubectl get hpa -n production --context myapp-production-use1

# Check if Karpenter is provisioning nodes
kubectl get nodes --context myapp-production-use1 -w

OPS-8: Routine Health Check

# Application
curl https://api.matthewoladipupo.dev/health

# ArgoCD — show only non-Synced apps (empty = all good)
argocd app list | grep -v "Synced.*Healthy"

# Enable public endpoint, then:
kubectl get nodes --context myapp-production-use1
kubectl get pods -n production --context myapp-production-use1
kubectl get hpa -n production --context myapp-production-use1
kubectl get externalsecret -n production --context myapp-production-use1
kubectl get schedule -n velero --context myapp-production-use1

# Lock endpoint back

OPS-9: Restore from Velero Backup

# Enable public endpoint first (OPS-2), then:

# List available backups
kubectl get backups -n velero --context myapp-production-use1

# Restore a namespace
kubectl create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-$(date +%Y%m%d-%H%M)
  namespace: velero
spec:
  backupName: <backup-name-from-list>
  includedNamespaces:
    - production
  restorePVs: true
EOF

# Watch progress
kubectl get restore -n velero --context myapp-production-use1 -w

OPS-10: Common Error Reference

Error	Cause	Fix
`Token has expired and refresh failed`	SSO session expired	`aws sso login --sso-session admin --no-browser`
`dial tcp 10.x.x.x:443: i/o timeout`	Private cluster endpoint	Enable public access temporarily (OPS-2)
`config profile (X) could not be found`	Wrong profile name	Use `myapp-prod-use1` not `production`
`unknown command "argo" for "kubectl"`	Plugin not installed	Install `kubectl-argo-rollouts` binary
`SecretSynced: False` on ExternalSecret	IRSA role or secret missing	Check IAM role exists, check secret path in Secrets Manager
Pod stuck in `Pending`	Karpenter provisioning	Wait 2 min; check `kubectl get nodeclaims`
ArgoCD app `OutOfSync` after ESO sync	ESO writes `status.refreshTime`	Known false positive — safe to ignore or force-sync