DevSecOps Pipeline — Completion Runbook
All code is written and pushed to GitHub. This runbook covers the remaining
operational steps: Terraform applies, GitOps ARN updates, and ArgoCD deployment.
Prerequisites
Install these tools if not already present:
# AWS CLI v2
winget install Amazon.AWSCLI
# Terraform 1.6+
winget install HashiCorp.Terraform
# Terragrunt
# Download from https://github.com/gruntwork-io/terragrunt/releases
# Place in C:\Windows\System32\ or add to PATH
# kubectl
winget install Kubernetes.kubectl
# ArgoCD CLI
winget install argoproj.argocd
AWS Profile Setup
The root terragrunt.hcl uses profiles named myapp-{env}-{region_alias}.
Configure them in ~/.aws/config:
[profile myapp-production-use1]
region = us-east-1
role_arn = arn:aws:iam::591120834781:role/AdministratorAccess
source_profile = default
[profile myapp-production-usw2]
region = us-west-2
role_arn = arn:aws:iam::591120834781:role/AdministratorAccess
source_profile = default
[profile myapp-staging-use1]
region = us-east-1
role_arn = arn:aws:iam::690687753178:role/AdministratorAccess
source_profile = default
[profile myapp-staging-usw2]
region = us-west-2
role_arn = arn:aws:iam::690687753178:role/AdministratorAccess
source_profile = default
[profile myapp-dev-use1]
region = us-east-1
role_arn = arn:aws:iam::557702566877:role/AdministratorAccess
source_profile = default
[profile myapp-dev-usw2]
region = us-west-2
role_arn = arn:aws:iam::557702566877:role/AdministratorAccess
source_profile = default
PHASE 1 — Terraform Applies
Work from the myapp-infra/ directory. Run in the order shown — capture outputs
for updating GitOps files in Phase 2.
1.1 WAF (production + staging)
# Production us-east-1
terragrunt apply --terragrunt-working-dir live/production/us-east-1/waf
# Output → webacl_arn (copy this value)
# Production us-west-2
terragrunt apply --terragrunt-working-dir live/production/us-west-2/waf
# Output → webacl_arn (copy this value)
# Staging (no GitOps ARN needed, but good to have)
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/waf
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/waf
1.2 GuardDuty (all regions — no outputs needed)
terragrunt apply --terragrunt-working-dir live/production/us-east-1/guardduty
terragrunt apply --terragrunt-working-dir live/production/us-west-2/guardduty
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/guardduty
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/guardduty
GuardDuty has no GitOps dependency. Alerts appear in the AWS console and
optionally in CloudWatch.
1.3 ESO IRSA for Staging
# Staging us-east-1
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/eso-irsa
# Output → role_arn (copy → used in environments/staging/applicationset.yaml)
# Staging us-west-2
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/eso-irsa
# Output → role_arn (copy → used in environments/staging/applicationset.yaml)
NOTE: The ESO operator ApplicationSet (
infrastructure/eso/applicationset.yaml)
already includes staging clusters. Once ESO is running on staging and the
ExternalSecret IRSA role is set, ExternalSecrets will sync automatically.
1.4 Fluent Bit IRSA (all 6 clusters)
terragrunt apply --terragrunt-working-dir live/production/us-east-1/fluent-bit-irsa
# → role_arn for myapp-production-use1
terragrunt apply --terragrunt-working-dir live/production/us-west-2/fluent-bit-irsa
# → role_arn for myapp-production-usw2
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/fluent-bit-irsa
# → role_arn for myapp-staging-use1
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/fluent-bit-irsa
# → role_arn for myapp-staging-usw2
terragrunt apply --terragrunt-working-dir live/dev/us-east-1/fluent-bit-irsa
# → role_arn for myapp-dev-use1
terragrunt apply --terragrunt-working-dir live/dev/us-west-2/fluent-bit-irsa
# → role_arn for myapp-dev-usw2
1.5 Karpenter (production only)
terragrunt apply --terragrunt-working-dir live/production/us-east-1/karpenter
# Outputs:
# controller_role_arn → for karpenter applicationset.yaml
# node_role_arn → for verification (name = myapp-production-use1-karpenter-node)
# node_instance_profile → for verification
# interruption_queue_name → should be "myapp-production-use1-karpenter"
terragrunt apply --terragrunt-working-dir live/production/us-west-2/karpenter
# Outputs same structure for usw2
The
nodeRoleNamevalues inkarpenter/nodepool-applicationset.yamlare
pre-set tomyapp-production-use1-karpenter-nodeandmyapp-production-usw2-karpenter-node.
These match what Terraform creates so no update needed there.
1.6 Velero (all 6 clusters)
# Production
terragrunt apply --terragrunt-working-dir live/production/us-east-1/velero
# → role_arn for myapp-production-use1
terragrunt apply --terragrunt-working-dir live/production/us-west-2/velero
# → role_arn for myapp-production-usw2
# Staging
terragrunt apply --terragrunt-working-dir live/staging/us-east-1/velero
# → role_arn for myapp-staging-use1
terragrunt apply --terragrunt-working-dir live/staging/us-west-2/velero
# → role_arn for myapp-staging-usw2
# Dev
terragrunt apply --terragrunt-working-dir live/dev/us-east-1/velero
# → role_arn for myapp-dev-use1
terragrunt apply --terragrunt-working-dir live/dev/us-west-2/velero
# → role_arn for myapp-dev-usw2
PHASE 2 — Update GitOps ARNs
After collecting all outputs from Phase 1, update the GitOps repo
(myapp-gitops/) and push.
2.1 Production WAF ARNs
Edit environments/production/applicationset.yaml — replace "PENDING" with
real WAF ACL ARNs from Step 1.1:
elements:
- cluster: myapp-production-use1
...
wafAclArn: "arn:aws:wafv2:us-east-1:591120834781:regional/webacl/myapp-production-use1-waf/XXXXXXXX"
- cluster: myapp-production-usw2
...
wafAclArn: "arn:aws:wafv2:us-west-2:591120834781:regional/webacl/myapp-production-usw2-waf/XXXXXXXX"
2.2 Staging ESO IRSA ARNs
Edit environments/staging/applicationset.yaml — replace "PENDING" with
role ARNs from Step 1.3:
elements:
- cluster: myapp-staging-use1
...
irsaRoleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-eso"
- cluster: myapp-staging-usw2
...
irsaRoleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-eso"
2.3 Fluent Bit IRSA ARNs
Edit infrastructure/logging/applicationset.yaml — replace all 6 "PENDING" values:
elements:
- cluster: myapp-production-use1 roleArn: "arn:aws:iam::591120834781:role/myapp-production-use1-fluent-bit"
- cluster: myapp-production-usw2 roleArn: "arn:aws:iam::591120834781:role/myapp-production-usw2-fluent-bit"
- cluster: myapp-staging-use1 roleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-fluent-bit"
- cluster: myapp-staging-usw2 roleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-fluent-bit"
- cluster: myapp-dev-use1 roleArn: "arn:aws:iam::557702566877:role/myapp-dev-use1-fluent-bit"
- cluster: myapp-dev-usw2 roleArn: "arn:aws:iam::557702566877:role/myapp-dev-usw2-fluent-bit"
TIP: Role names follow the pattern
{cluster_name}-fluent-bit. Verify with
terragrunt output role_arnin each fluent-bit-irsa directory.
2.4 Karpenter Controller Role ARNs
Edit infrastructure/karpenter/applicationset.yaml — replace 2 "PENDING" values:
elements:
- cluster: myapp-production-use1 controllerRole: "arn:aws:iam::591120834781:role/myapp-production-use1-karpenter"
- cluster: myapp-production-usw2 controllerRole: "arn:aws:iam::591120834781:role/myapp-production-usw2-karpenter"
2.5 Velero Role ARNs
Edit infrastructure/velero/applicationset.yaml — replace all 6 "PENDING" values:
elements:
- cluster: myapp-production-use1 roleArn: "arn:aws:iam::591120834781:role/myapp-production-use1-velero"
- cluster: myapp-production-usw2 roleArn: "arn:aws:iam::591120834781:role/myapp-production-usw2-velero"
- cluster: myapp-staging-use1 roleArn: "arn:aws:iam::690687753178:role/myapp-staging-use1-velero"
- cluster: myapp-staging-usw2 roleArn: "arn:aws:iam::690687753178:role/myapp-staging-usw2-velero"
- cluster: myapp-dev-use1 roleArn: "arn:aws:iam::557702566877:role/myapp-dev-use1-velero"
- cluster: myapp-dev-usw2 roleArn: "arn:aws:iam::557702566877:role/myapp-dev-usw2-velero"
2.6 Slack Webhooks + Grafana Password
Edit infrastructure/monitoring/prometheus-values.yaml:
- Replace both
https://hooks.slack.com/services/CHANGE_MEwith real Slack incoming webhook URLs - Replace
change-me-grafanawith a real password (or use an ExternalSecret)
2.7 Commit + Push GitOps changes
cd myapp-gitops
git add environments/ infrastructure/
git commit -m "chore: fill in real ARNs from terraform outputs"
git push origin HEAD:main
2.8 Create staging Secrets Manager secret
Run this once to seed the staging ExternalSecret:
AWS_PROFILE=myapp-staging-use1 aws secretsmanager create-secret \
--name staging/myapp/db-password \
--secret-string '{"password":"change-me-staging"}' \
--region us-east-1
AWS_PROFILE=myapp-staging-usw2 aws secretsmanager create-secret \
--name staging/myapp/db-password \
--secret-string '{"password":"change-me-staging"}' \
--region us-west-2
PHASE 3 — ArgoCD Setup
3.1 Bootstrap ArgoCD (App of Apps)
The argocd/ directory in myapp-gitops now contains the AppProject and a
bootstrap Application. Apply the bootstrap once — after that ArgoCD manages
itself and will also pick up the AppProject automatically.
# Point kubectl at production cluster (where ArgoCD runs)
kubectl config use-context myapp-production-use1
cd myapp-gitops
# One-time bootstrap — creates the self-managing Application
kubectl apply -f argocd/bootstrap.yaml -n argocd
# ArgoCD will now sync argocd/project-production.yaml automatically.
# Watch until it's healthy:
argocd app wait bootstrap --health
The
argocd/project-production.yamlAppProject already includes every
namespace and source repo needed by all components. Nokubectl patchneeded.
3.2 Apply new ApplicationSets to ArgoCD
After the bootstrap Application syncs (it only manages the argocd/ directory),
apply the infrastructure ApplicationSets manually once:
cd myapp-gitops
kubectl apply -f infrastructure/eso/applicationset.yaml
kubectl apply -f infrastructure/monitoring/applicationset.yaml
kubectl apply -f infrastructure/monitoring/alert-rules-applicationset.yaml
kubectl apply -f infrastructure/logging/applicationset.yaml
kubectl apply -f infrastructure/karpenter/applicationset.yaml
kubectl apply -f infrastructure/karpenter/nodepool-applicationset.yaml
kubectl apply -f infrastructure/velero/applicationset.yaml
kubectl apply -f infrastructure/falco/applicationset.yaml
kubectl apply -f infrastructure/argo-rollouts/applicationset.yaml
After this, ArgoCD self-manages all ApplicationSets via the automated sync
on the generated Applications.
PHASE 4 — ArgoCD Sync Order (Production)
Sync in this exact order to respect CRD dependencies:
# Step 1: Prometheus stack (creates CRDs for PrometheusRule, ServiceMonitor, etc.)
argocd app sync prometheus-myapp-production-use1 prometheus-myapp-production-usw2
argocd app wait prometheus-myapp-production-use1 --health
argocd app wait prometheus-myapp-production-usw2 --health
# Step 2: Alert rules (needs Prometheus CRDs)
argocd app sync alert-rules-myapp-production-use1 alert-rules-myapp-production-usw2
# Step 3: Parallel infra components (no inter-dependency)
argocd app sync \
fluent-bit-myapp-production-use1 fluent-bit-myapp-production-usw2 \
velero-myapp-production-use1 velero-myapp-production-usw2 \
falco-myapp-production-use1 falco-myapp-production-usw2
# Step 4: Karpenter controller (needs ECR access to pull image from public.ecr.aws)
argocd app sync karpenter-myapp-production-use1 karpenter-myapp-production-usw2
argocd app wait karpenter-myapp-production-use1 --health
# Step 5: Karpenter NodePools (needs Karpenter CRDs installed by Step 4)
argocd app sync karpenter-nodepool-myapp-production-use1 karpenter-nodepool-myapp-production-usw2
# Step 6: Argo Rollouts controller
argocd app sync argo-rollouts-myapp-production-use1 argo-rollouts-myapp-production-usw2
argocd app wait argo-rollouts-myapp-production-use1 --health
# Step 7: App (uses Rollout CR — needs argo-rollouts controller running)
argocd app sync myapp-production-myapp-production-use1 myapp-production-myapp-production-usw2
Staging sync (can run in parallel with production steps 3+)
argocd app sync \
eso-myapp-staging-use1 eso-myapp-staging-usw2 \
fluent-bit-myapp-staging-use1 fluent-bit-myapp-staging-usw2 \
velero-myapp-staging-use1 velero-myapp-staging-usw2 \
falco-myapp-staging-use1 falco-myapp-staging-usw2 \
prometheus-myapp-staging-use1 prometheus-myapp-staging-usw2
# After staging ESO is healthy, ExternalSecrets will sync automatically
argocd app sync myapp-staging-myapp-staging-use1 myapp-staging-myapp-staging-usw2
PHASE 5 — Verification
Monitoring
kubectl get pods -n monitoring --context myapp-production-use1
kubectl get prometheusrule -n monitoring --context myapp-production-use1
kubectl get alertmanager -n monitoring --context myapp-production-use1
# Access Grafana: kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
Logging
kubectl get pods -n logging --context myapp-production-use1
# Verify log groups were created:
AWS_PROFILE=myapp-production-use1 aws logs describe-log-groups \
--log-group-name-prefix /eks/myapp-production-use1 --region us-east-1
Karpenter
kubectl get pods -n karpenter --context myapp-production-use1
kubectl get nodepool --context myapp-production-use1
kubectl get ec2nodeclass --context myapp-production-use1
# Trigger a scale test:
kubectl scale deploy/stress --replicas=50 -n default --context myapp-production-use1
kubectl get nodes -w --context myapp-production-use1
Velero
kubectl get pods -n velero --context myapp-production-use1
kubectl get schedule -n velero --context myapp-production-use1
# Trigger manual backup:
velero backup create manual-test --context myapp-production-use1
velero backup describe manual-test --context myapp-production-use1
Falco
kubectl get pods -n falco --context myapp-production-use1
# Check CloudWatch for events:
AWS_PROFILE=myapp-production-use1 aws logs describe-log-groups \
--log-group-name-prefix /falco --region us-east-1
Argo Rollouts (canary deploy)
kubectl get rollout -n production --context myapp-production-use1
kubectl argo rollouts get rollout myapp-production-use1-myapp -n production \
--context myapp-production-use1 --watch
ESO Staging
kubectl get externalsecret -n staging --context myapp-staging-use1
kubectl describe externalsecret myapp-production-use1-myapp-secrets -n staging \
--context myapp-staging-use1
WAF
AWS_PROFILE=myapp-production-use1 aws wafv2 list-web-acls \
--scope REGIONAL --region us-east-1 | grep myapp
GuardDuty
AWS_PROFILE=myapp-production-use1 aws guardduty list-detectors --region us-east-1
AWS_PROFILE=myapp-production-usw2 aws guardduty list-detectors --region us-west-2
Troubleshooting Notes
| Issue | Fix |
|---|---|
| Karpenter fails to pull image | Ensure the node IAM role has ECR pull-through cache configured or use public.ecr.aws directly. Karpenter controller image is on public.ecr.aws/karpenter/karpenter. |
Falco modern_ebpf not supported |
Some EKS AMIs/kernel versions don't support eBPF. Fall back to driver.kind: ebpf or driver.kind: module in infrastructure/falco/values.yaml. |
| Velero backup fails | Ensure S3 bucket lifecycle rule and encryption config applied. Check IRSA trust policy sub matches system:serviceaccount:velero:velero. |
| Alert rules not picked up | The PrometheusRule must have label release: kube-prometheus-stack (already set in alert-rules.yaml). Verify with kubectl get prometheusrule -n monitoring -o yaml. |
| Rollout stuck at 20% | Check AnalysisTemplate — if myapp_http_requests_total metric doesn't exist yet (app not instrumented), the analysis will fail. Set failureLimit: 3 or temporarily disable analysis by removing the analysis step from the canary steps. |
| Karpenter NodePool not scheduling | Verify subnet and SG tags: aws ec2 describe-subnets --filters "Name=tag:karpenter.sh/discovery,Values=myapp-production-use1". |
Day-2 Operations Runbook
For: Anyone operating this pipeline after initial setup is complete
Live system: https://api.matthewoladipupo.dev/health
Quick Reference
URLs and Credentials
| Service | URL | Notes |
|---|---|---|
| Application | https://api.matthewoladipupo.dev/health |
Public |
| ArgoCD UI | http://a0c3c1ea43b294c4d8f5c2a7c514f6f2-1678928976.us-east-1.elb.amazonaws.com |
admin / see Secrets Manager |
| Grafana | https://grafana.matthewoladipupo.dev |
admin / see Secrets Manager |
| AWS SSO Portal | https://d-9a6757fb3c.awsapps.com/start |
IAM Identity Center |
Cluster → Profile Map
| Cluster | kubectl context | AWS Profile | Endpoint |
|---|---|---|---|
myapp-production-use1 |
myapp-production-use1 |
myapp-prod-use1 |
private |
myapp-production-usw2 |
myapp-production-usw2 |
myapp-prod-usw2 |
private |
myapp-staging-use1 |
myapp-staging-use1 |
myapp-staging-use1 |
private |
myapp-staging-usw2 |
myapp-staging-usw2 |
myapp-staging-usw2 |
private |
myapp-dev-use1 |
myapp-dev-use1 |
myapp-dev-use1 |
public |
myapp-dev-usw2 |
myapp-dev-usw2 |
myapp-dev-usw2 |
public |
OPS-1: Start of Session
Run every time you open a new terminal. SSO tokens last 8 hours.
# 1. Authenticate
aws sso login --sso-session admin --no-browser
# → browser opens → click Allow → wait for "Successfully logged in"
# 2. Verify
aws sts get-caller-identity --profile myapp-prod-use1
# → should return Account: 591120834781
# 3. Quick health check
curl https://api.matthewoladipupo.dev/health
# → {"status":"healthy","region":"us-east-1"}
If you see
Token has expired and refresh failedat any point, re-run step 1.
OPS-2: Accessing Private Clusters (Production + Staging)
Production and staging clusters have endpointPublicAccess: false. You must temporarily enable public access, do your work, then lock it back. Never leave production with a public endpoint.
# Enable (wait ~3 minutes after running this)
aws eks update-cluster-config \
--name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1 \
--resources-vpc-config endpointPublicAccess=true,endpointPrivateAccess=true,publicAccessCidrs=0.0.0.0/0
# Confirm it's ready
aws eks describe-cluster \
--name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1 \
--query 'cluster.resourcesVpcConfig.endpointPublicAccess'
# Must return: true
# --- Do your kubectl work here ---
# Lock back immediately after
aws eks update-cluster-config \
--name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1 \
--resources-vpc-config endpointPublicAccess=false,endpointPrivateAccess=true
Replace myapp-production-use1 / myapp-prod-use1 / us-east-1 with the appropriate values for other private clusters.
OPS-3: Standard Deployment
Normal deployments are fully automated — zero manual steps required:
- Developer pushes to
mainbranch ofMatthewDipo/myapp - GitHub Actions: test → Trivy scan → build → push ECR → Cosign sign → update gitops values
- ArgoCD detects gitops change within 3 minutes → syncs
- In production: Argo Rollout starts canary (20% traffic for 5 minutes → analysis → 100%)
Monitor progress:
# Watch rollout steps (enable public endpoint first)
kubectl get rollouts -n production --context myapp-production-use1 -w
# Detailed view (requires kubectl-argo-rollouts plugin)
kubectl argo rollouts get rollout myapp -n production \
--context myapp-production-use1 --watch
OPS-4: Promote or Abort a Canary
Promote immediately (skip the 5-minute pause):
kubectl argo rollouts promote myapp -n production --context myapp-production-use1
Abort (shift all traffic back to stable version instantly):
kubectl argo rollouts abort myapp -n production --context myapp-production-use1
# After abort the rollout shows Degraded — retry to return to Healthy
kubectl argo rollouts retry rollout myapp -n production --context myapp-production-use1
OPS-5: Roll Back a Deployment
Option A — GitOps revert (preferred, keeps git history clean):
cd /path/to/myapp-gitops
git revert HEAD --no-edit
git push origin main
# ArgoCD auto-syncs the revert within 3 minutes
Option B — ArgoCD rollback to a previous revision:
argocd app history myapp-production-myapp-production-use1
argocd app rollback myapp-production-myapp-production-use1 <revision-number>
Option C — Emergency direct image update (bypasses GitOps, use only in outage):
# Get previous image tag from ECR
aws ecr describe-images \
--repository-name myapp \
--region us-east-1 \
--profile myapp-prod-use1 \
--query 'sort_by(imageDetails,&imagePushedAt)[-2].imageTags' \
--output json
# Force update the rollout
kubectl argo rollouts set image myapp \
myapp=206617159586.dkr.ecr.us-east-1.amazonaws.com/myapp:sha-<previous-sha> \
-n production --context myapp-production-use1
# After incident: update values-production.yaml in gitops to match, then push
OPS-6: Rotate a Secret
The External Secrets Operator (ESO) syncs from AWS Secrets Manager on a 1-hour cycle.
Step 1 — Update value in Secrets Manager:
aws secretsmanager put-secret-value \
--secret-id production/myapp/db-password \
--secret-string '{"password":"new-value-here"}' \
--region us-east-1 \
--profile myapp-prod-use1
Step 2 — Force immediate ESO refresh:
# Enable public endpoint first (OPS-2)
kubectl annotate externalsecret myapp-db-password \
force-sync=$(date +%s) -n production \
--context myapp-production-use1 --overwrite
# Verify
kubectl get externalsecret myapp-db-password -n production \
--context myapp-production-use1
# STATUS: SecretSynced
Step 3 — Restart pods to pick up new value:
kubectl argo rollouts restart myapp -n production --context myapp-production-use1
OPS-7: Incident Response
Falco Security Alert
# 1. Identify what triggered the alert
kubectl logs -n falco daemonset/falco --context myapp-production-use1 \
| grep -E "Warning|Critical" | tail -20
# 2. Inspect the affected pod
kubectl describe pod <pod-name> -n production --context myapp-production-use1
kubectl logs <pod-name> -n production --context myapp-production-use1 --tail=100
# 3. Contain if confirmed malicious — delete pod (Rollout replaces with clean copy)
kubectl delete pod <pod-name> -n production --context myapp-production-use1
# 4. Preserve logs before deletion
kubectl logs <pod-name> -n production --context myapp-production-use1 \
> /tmp/incident-$(date +%Y%m%d-%H%M%S).log
# 5. Check GuardDuty for correlated findings
aws guardduty list-findings \
--detector-id $(aws guardduty list-detectors \
--region us-east-1 --profile myapp-prod-use1 \
--query 'DetectorIds[0]' --output text) \
--region us-east-1 --profile myapp-prod-use1
Kyverno Blocked a Pod
# See the rejection reason
kubectl describe pod <pod-name> -n <namespace> --context myapp-production-use1
# Look for: "admission webhook" in Events
# List policy violations
kubectl get policyreport -n <namespace> --context myapp-production-use1
Common violations and fixes:
| Policy | Violation | Fix |
|---|---|---|
block-privileged |
privileged: true in spec |
Remove the privileged flag |
require-non-root |
Running as root | Add runAsNonRoot: true, runAsUser: 1000
|
block-host-path |
hostPath volume | Replace with PVC |
require-resource-limits |
No CPU/memory limits | Add resources.limits
|
verify-image-signature |
Image not Cosign-signed | Must go through CI/CD pipeline |
Application Down
# Check pods
kubectl get pods -n production --context myapp-production-use1
# Check pod logs
kubectl logs -n production -l app=myapp --context myapp-production-use1 --tail=50
# Check rollout
kubectl get rollouts -n production --context myapp-production-use1
# Check HPA — is it maxed out?
kubectl get hpa -n production --context myapp-production-use1
# Check if Karpenter is provisioning nodes
kubectl get nodes --context myapp-production-use1 -w
OPS-8: Routine Health Check
# Application
curl https://api.matthewoladipupo.dev/health
# ArgoCD — show only non-Synced apps (empty = all good)
argocd app list | grep -v "Synced.*Healthy"
# Enable public endpoint, then:
kubectl get nodes --context myapp-production-use1
kubectl get pods -n production --context myapp-production-use1
kubectl get hpa -n production --context myapp-production-use1
kubectl get externalsecret -n production --context myapp-production-use1
kubectl get schedule -n velero --context myapp-production-use1
# Lock endpoint back
OPS-9: Restore from Velero Backup
# Enable public endpoint first (OPS-2), then:
# List available backups
kubectl get backups -n velero --context myapp-production-use1
# Restore a namespace
kubectl create -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: restore-$(date +%Y%m%d-%H%M)
namespace: velero
spec:
backupName: <backup-name-from-list>
includedNamespaces:
- production
restorePVs: true
EOF
# Watch progress
kubectl get restore -n velero --context myapp-production-use1 -w
OPS-10: Common Error Reference
| Error | Cause | Fix |
|---|---|---|
Token has expired and refresh failed |
SSO session expired | aws sso login --sso-session admin --no-browser |
dial tcp 10.x.x.x:443: i/o timeout |
Private cluster endpoint | Enable public access temporarily (OPS-2) |
config profile (X) could not be found |
Wrong profile name | Use myapp-prod-use1 not production
|
unknown command "argo" for "kubectl" |
Plugin not installed | Install kubectl-argo-rollouts binary |
SecretSynced: False on ExternalSecret |
IRSA role or secret missing | Check IAM role exists, check secret path in Secrets Manager |
Pod stuck in Pending
|
Karpenter provisioning | Wait 2 min; check kubectl get nodeclaims
|
ArgoCD app OutOfSync after ESO sync |
ESO writes status.refreshTime
|
Known false positive — safe to ignore or force-sync |
Top comments (0)