Init container cascade recovery when patches keep getting reverted
K8s reliability | 9 min read
This is for engineering leads staring at a microservice stuck in Init:0/3, three init containers failing for three different reasons, and every kubectl patch reverting within 10 seconds. We cover the diagnostic path, the source-of-truth fix, and how to keep the recovery durable across a rollout.
Problem signals:
- Pods sit in Init:0/3 or Init:CrashLoopBackOff for hours because no init container has activeDeadlineSeconds set.
- kubectl patch deployment holds for under a minute, then reverts to the broken state.
- Each init container fails differently — TCP timeout to a stale ClusterIP, NXDOMAIN on a hostname that is one character off, AMQP ACCESS_REFUSED after a successful auth.
- Broker topology declared in a ConfigMap does not match what the live broker reports — required bindings are quietly missing.
- The pre-deployment validation Job was dropped during a release rush and nobody noticed for two months.
Three protocols fail at once, and a reconciler hides the fix
Why this happens
A service that gates on a cache, a document store, and a message broker usually uses three init containers as readiness probes. When all three are quietly broken, the Pod looks like it is making progress — controllers see Pods scheduled, init containers running, logs accumulating — but availableReplicas stays at zero and nothing pages.
- wait-for-cache dials a hardcoded ClusterIP from the cluster Service range that no longer routes. Kubernetes assigns a new ClusterIP whenever a Service is recreated, and stored configs from an earlier life of the cluster keep pointing at the dead address.
- wait-for-document-store resolves a hostname that is one character off from the live Service (e.g. svc-mongo versus svc-mongodb). DNS returns NXDOMAIN, which in the log line looks indistinguishable from a transient resolver failure.
- wait-for-broker opens the AMQP connection successfully, authenticates, and then fails on vhost open with ACCESS_REFUSED. Most people read access-refused as bad credentials and chase the wrong fix for an hour.
None of these init containers have activeDeadlineSeconds. They retry forever instead of failing the Pod. Pod-level activeDeadlineSeconds is also missing, so the kubelet has no upper bound on the whole exercise. The signal you are looking for is the absence of a timeout, not the failures themselves.
When you finally identify the broken hostname and run kubectl patch on the Deployment, the change sticks for under a minute and reverts. Something downstream of the Deployment spec — an admission webhook, a custom operator, an ArgoCD self-heal sync, an external configuration reconciler — is re-applying the broken init-container spec from a ConfigMap. The live spec is downstream. The fix has to land on the source-of-truth ConfigMap.
Six moves, in order
Recovery sequence
- 1. Read each init container log separately — They fail for different reasons. Logging deploy/ aggregates them. Use -c for each init container — TCP timeout, NXDOMAIN, and AMQP ACCESS_REFUSED need different fixes.
- 2. Confirm whether your patch evaporates — Apply a known harmless change to the live Deployment. Watch for 30 seconds. If it reverts, find the reconciler — usually a ConfigMap behind an admission webhook, an ArgoCD Application, or a custom operator — and edit there.
- 3. Move every init container to DNS plus a deadline — Service ClusterIPs are not stable across recreations. Use service.namespace.svc.cluster.local. Set activeDeadlineSeconds 60–300 on each init container and on the Pod spec. Both levels — kubelet retries can defeat init-container-level timeouts alone.
- 4. Reconcile broker topology against the live broker — If your topology is declared in a ConfigMap, diff it against the broker's management API and apply the missing exchanges, queues, and bindings. Use a Job that runs the broker admin tool — not a kubectl exec, which leaves no audit trail.
- 5. Restore the pre-deploy validation Job — Endpoints non-empty for every upstream Service, broker reports every required binding, activeDeadlineSeconds under 120, runs to Completed. Label it validation: predeploy so the next failure is caught before the rollout.
- 6. Verify durably, not just once — Two green observations 20 seconds apart. No BackOff events in the last 60 seconds. A single green snapshot during a Pod creation flicker means nothing — the reconciler may be mid-tick.
Where the patch evaporates is where the fix belongs
Recovery flow
flowchart TD
A[Pods Init:0/3] --> B[Read each init container log]
B --> C[Three distinct failure modes]
C --> D[Patch the Deployment]
D --> E{Survived 30s?}
E -->|reverted| F[Find the reconciler: ConfigMap, GitOps, operator, webhook]
E -->|stayed| G[Live-spec patch is enough]
F --> H[Edit the source-of-truth ConfigMap]
H --> I[Trigger rollout restart]
G --> I
I --> J[Reconcile broker topology via Job]
J --> K[Restore pre-deploy validation Job]
K --> L[Two green checks 20s apart, no BackOff]
L --> M[Durable convergence]
The branch at E is the entire diagnostic story — does the live spec hold, or is it downstream of something.
Diagnostics and a durable pre-deploy validation Job
Artifact
# Each init container fails differently — read them one by one.
kubectl logs -n app deploy/fanout -c wait-for-cache
kubectl logs -n app deploy/fanout -c wait-for-document-store
kubectl logs -n app deploy/fanout -c wait-for-broker
# Test whether a live-spec patch holds, or gets reverted.
kubectl set image deployment/fanout -n app wait-for-cache=busybox:1.36
kubectl get deploy fanout -n app -o jsonpath='{.spec.template.spec.initContainers[0].image}'
# Wait 30 seconds and check again — if it changed back, a reconciler owns this.
# Hunt the reconciler.
kubectl get cm -n app | grep -i init-config
kubectl get applications -A 2>/dev/null
kubectl get mutatingwebhookconfigurations
kubectl get pods -A -l app.kubernetes.io/component=operator
# Compare the stored host against the live Service.
kubectl get svc cache -n app -o jsonpath='{.spec.clusterIP}'
kubectl get cm fanout-init-config -n app -o yaml | grep -E 'cache|host|uri'
# Diff declared broker topology against the live broker.
kubectl exec -n app deploy/broker -- rabbitmqadmin list bindings -f tsv | sort
kubectl get cm fanout-topology-state -n app -o jsonpath='{.data.bindings}'
Diagnostic loop. The key step is the patch-survives-30s test — it tells you whether the source-of-truth is the Deployment or a ConfigMap behind it.
apiVersion: batch/v1
kind: Job
metadata:
name: fanout-predeploy-validate-2026w20
namespace: app
labels:
validation: predeploy
spec:
activeDeadlineSeconds: 120
backoffLimit: 1
template:
spec:
restartPolicy: Never
serviceAccountName: predeploy-validator
containers:
- name: validate
image: bitnami/kubectl:1.30
env:
- name: REQUIRED_BINDINGS
value: 'events:fanout-q events:audit-q events:dlq events:retry-q events:metrics-q'
- name: BROKER_PW
valueFrom:
secretKeyRef:
name: broker-creds
key: password
command: ['/bin/sh', '-c']
args:
- |
set -euo pipefail
for svc in cache document-store broker; do
ips=$(kubectl get endpoints $svc -n app -o jsonpath='{.subsets[*].addresses[*].ip}')
[ -n "$ips" ] || { echo "no endpoints for $svc"; exit 1; }
done
live=$(curl -sfu admin:$BROKER_PW http://broker:15672/api/bindings | jq -r '.[] | .source + ":" + .destination' | sort -u)
for required in $REQUIRED_BINDINGS; do
echo "$live" | grep -qF "$required" || { echo "missing $required"; exit 1; }
done
A pre-deploy validation Job that actually validates. Endpoints non-empty for every upstream, every required binding present on the live broker, fails fast under 2 minutes.
Failure modes that turn a recovery into an outage
Common mistakes
- Patching the Deployment when an admission webhook, ArgoCD self-heal, or operator owns the source-of-truth ConfigMap. The live spec is downstream of something. Find what, then edit there.
- Bulk-substituting hardcoded IPs across every matching ConfigMap with a one-liner. There is almost always a deliberately broken canary or chaos-test resource sitting next to the real config — same shape, same broken hostname — that a drift-detection job reads to confirm detection logic is still alive. Check labels before you patch.
- Setting activeDeadlineSeconds only on init containers and not on the Pod spec. The kubelet retries init containers per-restart; the Pod-level timeout is the only durable kill switch.
- Declaring broker topology in a ConfigMap and never asserting it against the live broker. The declaration is fiction until a Job proves it on every deploy.
- Calling a rollout fixed on a single green observation during a Pod creation flicker. Two checks 20 seconds apart, with no BackOff events in the last 60 seconds, or it is not durable.
Originally published at https://infraforge.agency/insights/init-container-cascade-recovery/.
If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.
Top comments (0)