Init container cascade recovery when patches keep getting reverted

#k8s #reliability #kubernetescicd

Init container cascade recovery when patches keep getting reverted

K8s reliability | 9 min read

This is for engineering leads staring at a microservice stuck in Init:0/3, three init containers failing for three different reasons, and every kubectl patch reverting within 10 seconds. We cover the diagnostic path, the source-of-truth fix, and how to keep the recovery durable across a rollout.

Problem signals:

Pods sit in Init:0/3 or Init:CrashLoopBackOff for hours because no init container has activeDeadlineSeconds set.
kubectl patch deployment holds for under a minute, then reverts to the broken state.
Each init container fails differently — TCP timeout to a stale ClusterIP, NXDOMAIN on a hostname that is one character off, AMQP ACCESS_REFUSED after a successful auth.
Broker topology declared in a ConfigMap does not match what the live broker reports — required bindings are quietly missing.
The pre-deployment validation Job was dropped during a release rush and nobody noticed for two months.

Three protocols fail at once, and a reconciler hides the fix

Why this happens

A service that gates on a cache, a document store, and a message broker usually uses three init containers as readiness probes. When all three are quietly broken, the Pod looks like it is making progress — controllers see Pods scheduled, init containers running, logs accumulating — but availableReplicas stays at zero and nothing pages.

wait-for-cache dials a hardcoded ClusterIP from the cluster Service range that no longer routes. Kubernetes assigns a new ClusterIP whenever a Service is recreated, and stored configs from an earlier life of the cluster keep pointing at the dead address.
wait-for-document-store resolves a hostname that is one character off from the live Service (e.g. svc-mongo versus svc-mongodb). DNS returns NXDOMAIN, which in the log line looks indistinguishable from a transient resolver failure.
wait-for-broker opens the AMQP connection successfully, authenticates, and then fails on vhost open with ACCESS_REFUSED. Most people read access-refused as bad credentials and chase the wrong fix for an hour.

None of these init containers have activeDeadlineSeconds. They retry forever instead of failing the Pod. Pod-level activeDeadlineSeconds is also missing, so the kubelet has no upper bound on the whole exercise. The signal you are looking for is the absence of a timeout, not the failures themselves.

When you finally identify the broken hostname and run kubectl patch on the Deployment, the change sticks for under a minute and reverts. Something downstream of the Deployment spec — an admission webhook, a custom operator, an ArgoCD self-heal sync, an external configuration reconciler — is re-applying the broken init-container spec from a ConfigMap. The live spec is downstream. The fix has to land on the source-of-truth ConfigMap.

Six moves, in order

Recovery sequence

1. Read each init container log separately — They fail for different reasons. Logging deploy/ aggregates them. Use -c for each init container — TCP timeout, NXDOMAIN, and AMQP ACCESS_REFUSED need different fixes.
2. Confirm whether your patch evaporates — Apply a known harmless change to the live Deployment. Watch for 30 seconds. If it reverts, find the reconciler — usually a ConfigMap behind an admission webhook, an ArgoCD Application, or a custom operator — and edit there.
3. Move every init container to DNS plus a deadline — Service ClusterIPs are not stable across recreations. Use service.namespace.svc.cluster.local. Set activeDeadlineSeconds 60–300 on each init container and on the Pod spec. Both levels — kubelet retries can defeat init-container-level timeouts alone.
4. Reconcile broker topology against the live broker — If your topology is declared in a ConfigMap, diff it against the broker's management API and apply the missing exchanges, queues, and bindings. Use a Job that runs the broker admin tool — not a kubectl exec, which leaves no audit trail.
5. Restore the pre-deploy validation Job — Endpoints non-empty for every upstream Service, broker reports every required binding, activeDeadlineSeconds under 120, runs to Completed. Label it validation: predeploy so the next failure is caught before the rollout.
6. Verify durably, not just once — Two green observations 20 seconds apart. No BackOff events in the last 60 seconds. A single green snapshot during a Pod creation flicker means nothing — the reconciler may be mid-tick.

Where the patch evaporates is where the fix belongs

Recovery flow

flowchart TD
  A[Pods Init:0/3] --> B[Read each init container log]
  B --> C[Three distinct failure modes]
  C --> D[Patch the Deployment]
  D --> E{Survived 30s?}
  E -->|reverted| F[Find the reconciler: ConfigMap, GitOps, operator, webhook]
  E -->|stayed| G[Live-spec patch is enough]
  F --> H[Edit the source-of-truth ConfigMap]
  H --> I[Trigger rollout restart]
  G --> I
  I --> J[Reconcile broker topology via Job]
  J --> K[Restore pre-deploy validation Job]
  K --> L[Two green checks 20s apart, no BackOff]
  L --> M[Durable convergence]

The branch at E is the entire diagnostic story — does the live spec hold, or is it downstream of something.

Diagnostics and a durable pre-deploy validation Job

Artifact

# Each init container fails differently — read them one by one.
kubectl logs -n app deploy/fanout -c wait-for-cache
kubectl logs -n app deploy/fanout -c wait-for-document-store
kubectl logs -n app deploy/fanout -c wait-for-broker

# Test whether a live-spec patch holds, or gets reverted.
kubectl set image deployment/fanout -n app wait-for-cache=busybox:1.36
kubectl get deploy fanout -n app -o jsonpath='{.spec.template.spec.initContainers[0].image}'
# Wait 30 seconds and check again — if it changed back, a reconciler owns this.

# Hunt the reconciler.
kubectl get cm -n app | grep -i init-config
kubectl get applications -A 2>/dev/null
kubectl get mutatingwebhookconfigurations
kubectl get pods -A -l app.kubernetes.io/component=operator

# Compare the stored host against the live Service.
kubectl get svc cache -n app -o jsonpath='{.spec.clusterIP}'
kubectl get cm fanout-init-config -n app -o yaml | grep -E 'cache|host|uri'

# Diff declared broker topology against the live broker.
kubectl exec -n app deploy/broker -- rabbitmqadmin list bindings -f tsv | sort
kubectl get cm fanout-topology-state -n app -o jsonpath='{.data.bindings}'

Diagnostic loop. The key step is the patch-survives-30s test — it tells you whether the source-of-truth is the Deployment or a ConfigMap behind it.

apiVersion: batch/v1
kind: Job
metadata:
  name: fanout-predeploy-validate-2026w20
  namespace: app
  labels:
    validation: predeploy
spec:
  activeDeadlineSeconds: 120
  backoffLimit: 1
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: predeploy-validator
      containers:
        - name: validate
          image: bitnami/kubectl:1.30
          env:
            - name: REQUIRED_BINDINGS
              value: 'events:fanout-q events:audit-q events:dlq events:retry-q events:metrics-q'
            - name: BROKER_PW
              valueFrom:
                secretKeyRef:
                  name: broker-creds
                  key: password
          command: ['/bin/sh', '-c']
          args:
            - |
              set -euo pipefail
              for svc in cache document-store broker; do
                ips=$(kubectl get endpoints $svc -n app -o jsonpath='{.subsets[*].addresses[*].ip}')
                [ -n "$ips" ] || { echo "no endpoints for $svc"; exit 1; }
              done
              live=$(curl -sfu admin:$BROKER_PW http://broker:15672/api/bindings | jq -r '.[] | .source + ":" + .destination' | sort -u)
              for required in $REQUIRED_BINDINGS; do
                echo "$live" | grep -qF "$required" || { echo "missing $required"; exit 1; }
              done

A pre-deploy validation Job that actually validates. Endpoints non-empty for every upstream, every required binding present on the live broker, fails fast under 2 minutes.

Failure modes that turn a recovery into an outage

Common mistakes

Patching the Deployment when an admission webhook, ArgoCD self-heal, or operator owns the source-of-truth ConfigMap. The live spec is downstream of something. Find what, then edit there.
Bulk-substituting hardcoded IPs across every matching ConfigMap with a one-liner. There is almost always a deliberately broken canary or chaos-test resource sitting next to the real config — same shape, same broken hostname — that a drift-detection job reads to confirm detection logic is still alive. Check labels before you patch.
Setting activeDeadlineSeconds only on init containers and not on the Pod spec. The kubelet retries init containers per-restart; the Pod-level timeout is the only durable kill switch.
Declaring broker topology in a ConfigMap and never asserting it against the live broker. The declaration is fiction until a Job proves it on every deploy.
Calling a rollout fixed on a single green observation during a Pod creation flicker. Two checks 20 seconds apart, with no BackOff events in the last 60 seconds, or it is not durable.

Originally published at https://infraforge.agency/insights/init-container-cascade-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.