DEV Community

Matthew
Matthew

Posted on

Production DevSecOps Pipeline — The Complete Day-2 Operations Runbook

DevSecOps Pipeline — Completion Runbook

All code is written and pushed to GitHub. This runbook covers the remaining
operational steps: Terraform applies, GitOps ARN updates, and ArgoCD deployment.


Prerequisites

Install these tools if not already present:

# AWS CLI v2
winget install Amazon.AWSCLI

# Terraform 1.6+
winget install HashiCorp.Terraform

# Terragrunt
# Download from https://github.com/gruntwork-io/terragrunt/releases
# Place in C:\Windows\System32\ or add to PATH

# kubectl
winget install Kubernetes.kubectl

# ArgoCD CLI
winget install argoproj.argocd
Enter fullscreen mode Exit fullscreen mode

AWS Profile Setup

The root terragrunt.hcl uses profiles named myapp-{env}-{region_alias}.
Configure them in ~/.aws/config:

[profile myapp-production-use1]
region = us-east-1
role_arn = arn:aws:iam::591120834781:role/AdministratorAccess
source_profile = default

[profile myapp-production-usw2]
region = us-west-2
role_arn = arn:aws:iam::591120834781:role/AdministratorAccess
source_profile = default

[profile myapp-staging-use1]
region = us-east-1
role_arn = arn:aws:iam::690687753178:role/AdministratorAccess
source_profile = default

[profile myapp-staging-usw2]
region = us-west-2
role_arn = arn:aws:iam::690687753178:role/AdministratorAccess
source_profile = default

[profile myapp-dev-use1]
region = us-east-1
role_arn = arn:aws:iam::557702566877:role/AdministratorAccess
source_profile = default

[profile myapp-dev-usw2]
region = us-west-2
role_arn = arn:aws:iam::557702566877:role/AdministratorAccess
source_profile = default
Enter fullscreen mode Exit fullscreen mode

PHASE 1 — Terraform Applies

Work from the myapp-infra/ directory. Run in the order shown — capture outputs
for updating GitOps files in Phase 2.

1.1 WAF (production + staging)

# Production us-east-1
terragrunt apply --terragrunt-working-dir live/production/us-east-1/waf
# Output → webacl_arn  (copy this value)

# Production us-west-2
terragrunt apply --terragrunt-working-dir live/production/us-west-2/waf
# Output → webacl_arn  (copy this value)

# Staging (no GitOps ARN needed, but good to have)
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/waf
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/waf
Enter fullscreen mode Exit fullscreen mode

1.2 GuardDuty (all regions — no outputs needed)

terragrunt apply --terragrunt-working-dir live/production/us-east-1/guardduty
terragrunt apply --terragrunt-working-dir live/production/us-west-2/guardduty
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/guardduty
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/guardduty
Enter fullscreen mode Exit fullscreen mode

GuardDuty has no GitOps dependency. Alerts appear in the AWS console and
optionally in CloudWatch.

1.3 ESO IRSA for Staging

# Staging us-east-1
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/eso-irsa
# Output → role_arn  (copy → used in environments/staging/applicationset.yaml)

# Staging us-west-2
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/eso-irsa
# Output → role_arn  (copy → used in environments/staging/applicationset.yaml)
Enter fullscreen mode Exit fullscreen mode

NOTE: The ESO operator ApplicationSet (infrastructure/eso/applicationset.yaml)
already includes staging clusters. Once ESO is running on staging and the
ExternalSecret IRSA role is set, ExternalSecrets will sync automatically.

1.4 Fluent Bit IRSA (all 6 clusters)

terragrunt apply --terragrunt-working-dir live/production/us-east-1/fluent-bit-irsa
# → role_arn for myapp-production-use1

terragrunt apply --terragrunt-working-dir live/production/us-west-2/fluent-bit-irsa
# → role_arn for myapp-production-usw2

terragrunt apply --terragrunt-working-dir live/staging/us-east-1/fluent-bit-irsa
# → role_arn for myapp-staging-use1

terragrunt apply --terragrunt-working-dir live/staging/us-west-2/fluent-bit-irsa
# → role_arn for myapp-staging-usw2

terragrunt apply --terragrunt-working-dir live/dev/us-east-1/fluent-bit-irsa
# → role_arn for myapp-dev-use1

terragrunt apply --terragrunt-working-dir live/dev/us-west-2/fluent-bit-irsa
# → role_arn for myapp-dev-usw2
Enter fullscreen mode Exit fullscreen mode

1.5 Karpenter (production only)

terragrunt apply --terragrunt-working-dir live/production/us-east-1/karpenter
# Outputs:
#   controller_role_arn   → for karpenter applicationset.yaml
#   node_role_arn         → for verification (name = myapp-production-use1-karpenter-node)
#   node_instance_profile → for verification
#   interruption_queue_name → should be "myapp-production-use1-karpenter"

terragrunt apply --terragrunt-working-dir live/production/us-west-2/karpenter
# Outputs same structure for usw2
Enter fullscreen mode Exit fullscreen mode

The nodeRoleName values in karpenter/nodepool-applicationset.yaml are
pre-set to myapp-production-use1-karpenter-node and myapp-production-usw2-karpenter-node.
These match what Terraform creates so no update needed there.

1.6 Velero (all 6 clusters)

# Production
terragrunt apply --terragrunt-working-dir live/production/us-east-1/velero
# → role_arn for myapp-production-use1

terragrunt apply --terragrunt-working-dir live/production/us-west-2/velero
# → role_arn for myapp-production-usw2

# Staging
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/velero
# → role_arn for myapp-staging-use1

terragrunt apply --terragrunt-working-dir live/staging/us-west-2/velero
# → role_arn for myapp-staging-usw2

# Dev
terragrunt apply --terragrunt-working-dir live/dev/us-east-1/velero
# → role_arn for myapp-dev-use1

terragrunt apply --terragrunt-working-dir live/dev/us-west-2/velero
# → role_arn for myapp-dev-usw2
Enter fullscreen mode Exit fullscreen mode

PHASE 2 — Update GitOps ARNs

After collecting all outputs from Phase 1, update the GitOps repo
(myapp-gitops/) and push.

2.1 Production WAF ARNs

Edit environments/production/applicationset.yaml — replace "PENDING" with
real WAF ACL ARNs from Step 1.1:

elements:
  - cluster: myapp-production-use1
    ...
    wafAclArn: "arn:aws:wafv2:us-east-1:591120834781:regional/webacl/myapp-production-use1-waf/XXXXXXXX"
  - cluster: myapp-production-usw2
    ...
    wafAclArn: "arn:aws:wafv2:us-west-2:591120834781:regional/webacl/myapp-production-usw2-waf/XXXXXXXX"
Enter fullscreen mode Exit fullscreen mode

2.2 Staging ESO IRSA ARNs

Edit environments/staging/applicationset.yaml — replace "PENDING" with
role ARNs from Step 1.3:

elements:
  - cluster: myapp-staging-use1
    ...
    irsaRoleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-eso"
  - cluster: myapp-staging-usw2
    ...
    irsaRoleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-eso"
Enter fullscreen mode Exit fullscreen mode

2.3 Fluent Bit IRSA ARNs

Edit infrastructure/logging/applicationset.yaml — replace all 6 "PENDING" values:

elements:
  - cluster: myapp-production-use1  roleArn: "arn:aws:iam::591120834781:role/myapp-production-use1-fluent-bit"
  - cluster: myapp-production-usw2  roleArn: "arn:aws:iam::591120834781:role/myapp-production-usw2-fluent-bit"
  - cluster: myapp-staging-use1     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-fluent-bit"
  - cluster: myapp-staging-usw2     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-fluent-bit"
  - cluster: myapp-dev-use1         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-use1-fluent-bit"
  - cluster: myapp-dev-usw2         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-usw2-fluent-bit"
Enter fullscreen mode Exit fullscreen mode

TIP: Role names follow the pattern {cluster_name}-fluent-bit. Verify with
terragrunt output role_arn in each fluent-bit-irsa directory.

2.4 Karpenter Controller Role ARNs

Edit infrastructure/karpenter/applicationset.yaml — replace 2 "PENDING" values:

elements:
  - cluster: myapp-production-use1  controllerRole: "arn:aws:iam::591120834781:role/myapp-production-use1-karpenter"
  - cluster: myapp-production-usw2  controllerRole: "arn:aws:iam::591120834781:role/myapp-production-usw2-karpenter"
Enter fullscreen mode Exit fullscreen mode

2.5 Velero Role ARNs

Edit infrastructure/velero/applicationset.yaml — replace all 6 "PENDING" values:

elements:
  - cluster: myapp-production-use1  roleArn: "arn:aws:iam::591120834781:role/myapp-production-use1-velero"
  - cluster: myapp-production-usw2  roleArn: "arn:aws:iam::591120834781:role/myapp-production-usw2-velero"
  - cluster: myapp-staging-use1     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-velero"
  - cluster: myapp-staging-usw2     roleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-velero"
  - cluster: myapp-dev-use1         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-use1-velero"
  - cluster: myapp-dev-usw2         roleArn: "arn:aws:iam::557702566877:role/myapp-dev-usw2-velero"
Enter fullscreen mode Exit fullscreen mode

2.6 Slack Webhooks + Grafana Password

Edit infrastructure/monitoring/prometheus-values.yaml:

  • Replace both https://hooks.slack.com/services/CHANGE_ME with real Slack incoming webhook URLs
  • Replace change-me-grafana with a real password (or use an ExternalSecret)

2.7 Commit + Push GitOps changes

cd myapp-gitops
git add environments/ infrastructure/
git commit -m "chore: fill in real ARNs from terraform outputs"
git push origin HEAD:main
Enter fullscreen mode Exit fullscreen mode

2.8 Create staging Secrets Manager secret

Run this once to seed the staging ExternalSecret:

AWS_PROFILE=myapp-staging-use1 aws secretsmanager create-secret \
  --name staging/myapp/db-password \
  --secret-string '{"password":"change-me-staging"}' \
  --region us-east-1

AWS_PROFILE=myapp-staging-usw2 aws secretsmanager create-secret \
  --name staging/myapp/db-password \
  --secret-string '{"password":"change-me-staging"}' \
  --region us-west-2
Enter fullscreen mode Exit fullscreen mode

PHASE 3 — ArgoCD Setup

3.1 Bootstrap ArgoCD (App of Apps)

The argocd/ directory in myapp-gitops now contains the AppProject and a
bootstrap Application. Apply the bootstrap once — after that ArgoCD manages
itself and will also pick up the AppProject automatically.

# Point kubectl at production cluster (where ArgoCD runs)
kubectl config use-context myapp-production-use1

cd myapp-gitops

# One-time bootstrap — creates the self-managing Application
kubectl apply -f argocd/bootstrap.yaml -n argocd

# ArgoCD will now sync argocd/project-production.yaml automatically.
# Watch until it's healthy:
argocd app wait bootstrap --health
Enter fullscreen mode Exit fullscreen mode

The argocd/project-production.yaml AppProject already includes every
namespace and source repo needed by all components. No kubectl patch needed.

3.2 Apply new ApplicationSets to ArgoCD

After the bootstrap Application syncs (it only manages the argocd/ directory),
apply the infrastructure ApplicationSets manually once:

cd myapp-gitops

kubectl apply -f infrastructure/eso/applicationset.yaml
kubectl apply -f infrastructure/monitoring/applicationset.yaml
kubectl apply -f infrastructure/monitoring/alert-rules-applicationset.yaml
kubectl apply -f infrastructure/logging/applicationset.yaml
kubectl apply -f infrastructure/karpenter/applicationset.yaml
kubectl apply -f infrastructure/karpenter/nodepool-applicationset.yaml
kubectl apply -f infrastructure/velero/applicationset.yaml
kubectl apply -f infrastructure/falco/applicationset.yaml
kubectl apply -f infrastructure/argo-rollouts/applicationset.yaml
Enter fullscreen mode Exit fullscreen mode

After this, ArgoCD self-manages all ApplicationSets via the automated sync
on the generated Applications.


PHASE 4 — ArgoCD Sync Order (Production)

Sync in this exact order to respect CRD dependencies:

# Step 1: Prometheus stack (creates CRDs for PrometheusRule, ServiceMonitor, etc.)
argocd app sync prometheus-myapp-production-use1 prometheus-myapp-production-usw2
argocd app wait prometheus-myapp-production-use1 --health
argocd app wait prometheus-myapp-production-usw2 --health

# Step 2: Alert rules (needs Prometheus CRDs)
argocd app sync alert-rules-myapp-production-use1 alert-rules-myapp-production-usw2

# Step 3: Parallel infra components (no inter-dependency)
argocd app sync \
  fluent-bit-myapp-production-use1 fluent-bit-myapp-production-usw2 \
  velero-myapp-production-use1 velero-myapp-production-usw2 \
  falco-myapp-production-use1 falco-myapp-production-usw2

# Step 4: Karpenter controller (needs ECR access to pull image from public.ecr.aws)
argocd app sync karpenter-myapp-production-use1 karpenter-myapp-production-usw2
argocd app wait karpenter-myapp-production-use1 --health

# Step 5: Karpenter NodePools (needs Karpenter CRDs installed by Step 4)
argocd app sync karpenter-nodepool-myapp-production-use1 karpenter-nodepool-myapp-production-usw2

# Step 6: Argo Rollouts controller
argocd app sync argo-rollouts-myapp-production-use1 argo-rollouts-myapp-production-usw2
argocd app wait argo-rollouts-myapp-production-use1 --health

# Step 7: App (uses Rollout CR — needs argo-rollouts controller running)
argocd app sync myapp-production-myapp-production-use1 myapp-production-myapp-production-usw2
Enter fullscreen mode Exit fullscreen mode

Staging sync (can run in parallel with production steps 3+)

argocd app sync \
  eso-myapp-staging-use1 eso-myapp-staging-usw2 \
  fluent-bit-myapp-staging-use1 fluent-bit-myapp-staging-usw2 \
  velero-myapp-staging-use1 velero-myapp-staging-usw2 \
  falco-myapp-staging-use1 falco-myapp-staging-usw2 \
  prometheus-myapp-staging-use1 prometheus-myapp-staging-usw2

# After staging ESO is healthy, ExternalSecrets will sync automatically
argocd app sync myapp-staging-myapp-staging-use1 myapp-staging-myapp-staging-usw2
Enter fullscreen mode Exit fullscreen mode

PHASE 5 — Verification

Monitoring

kubectl get pods -n monitoring --context myapp-production-use1
kubectl get prometheusrule -n monitoring --context myapp-production-use1
kubectl get alertmanager -n monitoring --context myapp-production-use1
# Access Grafana: kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
Enter fullscreen mode Exit fullscreen mode

Logging

kubectl get pods -n logging --context myapp-production-use1
# Verify log groups were created:
AWS_PROFILE=myapp-production-use1 aws logs describe-log-groups \
  --log-group-name-prefix /eks/myapp-production-use1 --region us-east-1
Enter fullscreen mode Exit fullscreen mode

Karpenter

kubectl get pods -n karpenter --context myapp-production-use1
kubectl get nodepool --context myapp-production-use1
kubectl get ec2nodeclass --context myapp-production-use1
# Trigger a scale test:
kubectl scale deploy/stress --replicas=50 -n default --context myapp-production-use1
kubectl get nodes -w --context myapp-production-use1
Enter fullscreen mode Exit fullscreen mode

Velero

kubectl get pods -n velero --context myapp-production-use1
kubectl get schedule -n velero --context myapp-production-use1
# Trigger manual backup:
velero backup create manual-test --context myapp-production-use1
velero backup describe manual-test --context myapp-production-use1
Enter fullscreen mode Exit fullscreen mode

Falco

kubectl get pods -n falco --context myapp-production-use1
# Check CloudWatch for events:
AWS_PROFILE=myapp-production-use1 aws logs describe-log-groups \
  --log-group-name-prefix /falco --region us-east-1
Enter fullscreen mode Exit fullscreen mode

Argo Rollouts (canary deploy)

kubectl get rollout -n production --context myapp-production-use1
kubectl argo rollouts get rollout myapp-production-use1-myapp -n production \
  --context myapp-production-use1 --watch
Enter fullscreen mode Exit fullscreen mode

ESO Staging

kubectl get externalsecret -n staging --context myapp-staging-use1
kubectl describe externalsecret myapp-production-use1-myapp-secrets -n staging \
  --context myapp-staging-use1
Enter fullscreen mode Exit fullscreen mode

WAF

AWS_PROFILE=myapp-production-use1 aws wafv2 list-web-acls \
  --scope REGIONAL --region us-east-1 | grep myapp
Enter fullscreen mode Exit fullscreen mode

GuardDuty

AWS_PROFILE=myapp-production-use1 aws guardduty list-detectors --region us-east-1
AWS_PROFILE=myapp-production-usw2 aws guardduty list-detectors --region us-west-2
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Notes

Issue Fix
Karpenter fails to pull image Ensure the node IAM role has ECR pull-through cache configured or use public.ecr.aws directly. Karpenter controller image is on public.ecr.aws/karpenter/karpenter.
Falco modern_ebpf not supported Some EKS AMIs/kernel versions don't support eBPF. Fall back to driver.kind: ebpf or driver.kind: module in infrastructure/falco/values.yaml.
Velero backup fails Ensure S3 bucket lifecycle rule and encryption config applied. Check IRSA trust policy sub matches system:serviceaccount:velero:velero.
Alert rules not picked up The PrometheusRule must have label release: kube-prometheus-stack (already set in alert-rules.yaml). Verify with kubectl get prometheusrule -n monitoring -o yaml.
Rollout stuck at 20% Check AnalysisTemplate — if myapp_http_requests_total metric doesn't exist yet (app not instrumented), the analysis will fail. Set failureLimit: 3 or temporarily disable analysis by removing the analysis step from the canary steps.
Karpenter NodePool not scheduling Verify subnet and SG tags: aws ec2 describe-subnets --filters "Name=tag:karpenter.sh/discovery,Values=myapp-production-use1".


Day-2 Operations Runbook

For: Anyone operating this pipeline after initial setup is complete
Live system: https://api.matthewoladipupo.dev/health


Quick Reference

URLs and Credentials

Service URL Notes
Application https://api.matthewoladipupo.dev/health Public
ArgoCD UI http://a0c3c1ea43b294c4d8f5c2a7c514f6f2-1678928976.us-east-1.elb.amazonaws.com admin / see Secrets Manager
Grafana https://grafana.matthewoladipupo.dev admin / see Secrets Manager
AWS SSO Portal https://d-9a6757fb3c.awsapps.com/start IAM Identity Center

Cluster → Profile Map

Cluster kubectl context AWS Profile Endpoint
myapp-production-use1 myapp-production-use1 myapp-prod-use1 private
myapp-production-usw2 myapp-production-usw2 myapp-prod-usw2 private
myapp-staging-use1 myapp-staging-use1 myapp-staging-use1 private
myapp-staging-usw2 myapp-staging-usw2 myapp-staging-usw2 private
myapp-dev-use1 myapp-dev-use1 myapp-dev-use1 public
myapp-dev-usw2 myapp-dev-usw2 myapp-dev-usw2 public

OPS-1: Start of Session

Run every time you open a new terminal. SSO tokens last 8 hours.

# 1. Authenticate
aws sso login --sso-session admin --no-browser
# → browser opens → click Allow → wait for "Successfully logged in"

# 2. Verify
aws sts get-caller-identity --profile myapp-prod-use1
# → should return Account: 591120834781

# 3. Quick health check
curl https://api.matthewoladipupo.dev/health
# → {"status":"healthy","region":"us-east-1"}
Enter fullscreen mode Exit fullscreen mode

If you see Token has expired and refresh failed at any point, re-run step 1.


OPS-2: Accessing Private Clusters (Production + Staging)

Production and staging clusters have endpointPublicAccess: false. You must temporarily enable public access, do your work, then lock it back. Never leave production with a public endpoint.

# Enable (wait ~3 minutes after running this)
aws eks update-cluster-config \
  --name myapp-production-use1 \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --resources-vpc-config endpointPublicAccess=true,endpointPrivateAccess=true,publicAccessCidrs=0.0.0.0/0

# Confirm it's ready
aws eks describe-cluster \
  --name myapp-production-use1 \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --query 'cluster.resourcesVpcConfig.endpointPublicAccess'
# Must return: true

# --- Do your kubectl work here ---

# Lock back immediately after
aws eks update-cluster-config \
  --name myapp-production-use1 \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --resources-vpc-config endpointPublicAccess=false,endpointPrivateAccess=true
Enter fullscreen mode Exit fullscreen mode

Replace myapp-production-use1 / myapp-prod-use1 / us-east-1 with the appropriate values for other private clusters.


OPS-3: Standard Deployment

Normal deployments are fully automated — zero manual steps required:

  1. Developer pushes to main branch of MatthewDipo/myapp
  2. GitHub Actions: test → Trivy scan → build → push ECR → Cosign sign → update gitops values
  3. ArgoCD detects gitops change within 3 minutes → syncs
  4. In production: Argo Rollout starts canary (20% traffic for 5 minutes → analysis → 100%)

Monitor progress:

# Watch rollout steps (enable public endpoint first)
kubectl get rollouts -n production --context myapp-production-use1 -w

# Detailed view (requires kubectl-argo-rollouts plugin)
kubectl argo rollouts get rollout myapp -n production \
  --context myapp-production-use1 --watch
Enter fullscreen mode Exit fullscreen mode

OPS-4: Promote or Abort a Canary

Promote immediately (skip the 5-minute pause):

kubectl argo rollouts promote myapp -n production --context myapp-production-use1
Enter fullscreen mode Exit fullscreen mode

Abort (shift all traffic back to stable version instantly):

kubectl argo rollouts abort myapp -n production --context myapp-production-use1

# After abort the rollout shows Degraded — retry to return to Healthy
kubectl argo rollouts retry rollout myapp -n production --context myapp-production-use1
Enter fullscreen mode Exit fullscreen mode

OPS-5: Roll Back a Deployment

Option A — GitOps revert (preferred, keeps git history clean):

cd /path/to/myapp-gitops
git revert HEAD --no-edit
git push origin main
# ArgoCD auto-syncs the revert within 3 minutes
Enter fullscreen mode Exit fullscreen mode

Option B — ArgoCD rollback to a previous revision:

argocd app history myapp-production-myapp-production-use1
argocd app rollback myapp-production-myapp-production-use1 <revision-number>
Enter fullscreen mode Exit fullscreen mode

Option C — Emergency direct image update (bypasses GitOps, use only in outage):

# Get previous image tag from ECR
aws ecr describe-images \
  --repository-name myapp \
  --region us-east-1 \
  --profile myapp-prod-use1 \
  --query 'sort_by(imageDetails,&imagePushedAt)[-2].imageTags' \
  --output json

# Force update the rollout
kubectl argo rollouts set image myapp \
  myapp=206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:sha-<previous-sha> \
  -n production --context myapp-production-use1

# After incident: update values-production.yaml in gitops to match, then push
Enter fullscreen mode Exit fullscreen mode

OPS-6: Rotate a Secret

The External Secrets Operator (ESO) syncs from AWS Secrets Manager on a 1-hour cycle.

Step 1 — Update value in Secrets Manager:

aws secretsmanager put-secret-value \
  --secret-id production/myapp/db-password \
  --secret-string '{"password":"new-value-here"}' \
  --region us-east-1 \
  --profile myapp-prod-use1
Enter fullscreen mode Exit fullscreen mode

Step 2 — Force immediate ESO refresh:

# Enable public endpoint first (OPS-2)
kubectl annotate externalsecret myapp-db-password \
  force-sync=$(date +%s) -n production \
  --context myapp-production-use1 --overwrite

# Verify
kubectl get externalsecret myapp-db-password -n production \
  --context myapp-production-use1
# STATUS: SecretSynced
Enter fullscreen mode Exit fullscreen mode

Step 3 — Restart pods to pick up new value:

kubectl argo rollouts restart myapp -n production --context myapp-production-use1
Enter fullscreen mode Exit fullscreen mode

OPS-7: Incident Response

Falco Security Alert

# 1. Identify what triggered the alert
kubectl logs -n falco daemonset/falco --context myapp-production-use1 \
  | grep -E "Warning|Critical" | tail -20

# 2. Inspect the affected pod
kubectl describe pod <pod-name> -n production --context myapp-production-use1
kubectl logs <pod-name> -n production --context myapp-production-use1 --tail=100

# 3. Contain if confirmed malicious — delete pod (Rollout replaces with clean copy)
kubectl delete pod <pod-name> -n production --context myapp-production-use1

# 4. Preserve logs before deletion
kubectl logs <pod-name> -n production --context myapp-production-use1 \
  > /tmp/incident-$(date +%Y%m%d-%H%M%S).log

# 5. Check GuardDuty for correlated findings
aws guardduty list-findings \
  --detector-id $(aws guardduty list-detectors \
    --region us-east-1 --profile myapp-prod-use1 \
    --query 'DetectorIds[0]' --output text) \
  --region us-east-1 --profile myapp-prod-use1
Enter fullscreen mode Exit fullscreen mode

Kyverno Blocked a Pod

# See the rejection reason
kubectl describe pod <pod-name> -n <namespace> --context myapp-production-use1
# Look for: "admission webhook" in Events

# List policy violations
kubectl get policyreport -n <namespace> --context myapp-production-use1
Enter fullscreen mode Exit fullscreen mode

Common violations and fixes:

Policy Violation Fix
block-privileged privileged: true in spec Remove the privileged flag
require-non-root Running as root Add runAsNonRoot: true, runAsUser: 1000
block-host-path hostPath volume Replace with PVC
require-resource-limits No CPU/memory limits Add resources.limits
verify-image-signature Image not Cosign-signed Must go through CI/CD pipeline

Application Down

# Check pods
kubectl get pods -n production --context myapp-production-use1

# Check pod logs
kubectl logs -n production -l app=myapp --context myapp-production-use1 --tail=50

# Check rollout
kubectl get rollouts -n production --context myapp-production-use1

# Check HPA — is it maxed out?
kubectl get hpa -n production --context myapp-production-use1

# Check if Karpenter is provisioning nodes
kubectl get nodes --context myapp-production-use1 -w
Enter fullscreen mode Exit fullscreen mode

OPS-8: Routine Health Check

# Application
curl https://api.matthewoladipupo.dev/health

# ArgoCD — show only non-Synced apps (empty = all good)
argocd app list | grep -v "Synced.*Healthy"

# Enable public endpoint, then:
kubectl get nodes --context myapp-production-use1
kubectl get pods -n production --context myapp-production-use1
kubectl get hpa -n production --context myapp-production-use1
kubectl get externalsecret -n production --context myapp-production-use1
kubectl get schedule -n velero --context myapp-production-use1

# Lock endpoint back
Enter fullscreen mode Exit fullscreen mode

OPS-9: Restore from Velero Backup

# Enable public endpoint first (OPS-2), then:

# List available backups
kubectl get backups -n velero --context myapp-production-use1

# Restore a namespace
kubectl create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-$(date +%Y%m%d-%H%M)
  namespace: velero
spec:
  backupName: <backup-name-from-list>
  includedNamespaces:
    - production
  restorePVs: true
EOF

# Watch progress
kubectl get restore -n velero --context myapp-production-use1 -w
Enter fullscreen mode Exit fullscreen mode

OPS-10: Common Error Reference

Error Cause Fix
Token has expired and refresh failed SSO session expired aws sso login --sso-session admin --no-browser
dial tcp 10.x.x.x:443: i/o timeout Private cluster endpoint Enable public access temporarily (OPS-2)
config profile (X) could not be found Wrong profile name Use myapp-prod-use1 not production
unknown command "argo" for "kubectl" Plugin not installed Install kubectl-argo-rollouts binary
SecretSynced: False on ExternalSecret IRSA role or secret missing Check IAM role exists, check secret path in Secrets Manager
Pod stuck in Pending Karpenter provisioning Wait 2 min; check kubectl get nodeclaims
ArgoCD app OutOfSync after ESO sync ESO writes status.refreshTime Known false positive — safe to ignore or force-sync

Top comments (0)