DEV Community: Infraforge

When a hardening rollout breaks 8 layers and your own reconciler fights you

Muhammad Hassaan Javed — Mon, 13 Jul 2026 14:00:47 +0000

The first thing the on-call team tried was patching the status ConfigMap. Five apps showed Progressing in the platform's bleater-status object, the dashboard had been red for four hours, and somebody figured a kubectl patch on the status keys would at least quiet the pages while they investigated. The patch lasted about ten seconds. The internal reconciler running in the bleater-system namespace rewrote the ConfigMap on its next tick, every key back to Progressing, and the pages started again. That was the moment they called us. The hardening rollout that had run the night before had not broken one thing. It had broken eight.

Problem signals:

Apps stuck in Progressing or Degraded for hours after a hardening or security pass, with no single obvious cause in the events stream
Status ConfigMaps written by an in-cluster reconciler get rewritten within 10 to 15 seconds of any kubectl patch
kubectl delete job hangs on suspended PreSync Jobs because a hook-cleanup finalizer is still attached
Migration init containers crash-loop with pg_isready succeeding but a schema verification step failing
kubectl patch on a RoleBinding fails with 'cannot change roleRef' because roleRef is immutable, so the binding has to be recreated

Why editing the status ConfigMap was the wrong instinct

The patch that survived ten seconds

The team had built a small in-house control plane the year before. A Python reconciler Pod in bleater-system watched the managed workloads, computed health from live cluster signals, and wrote a bleater-status ConfigMap every ten to fifteen seconds. Five apps reported there: an auth service, a profile service, a timeline service, a fanout service, and a primary application that handled the user-facing API. None of them used a full GitOps platform. The reconciler followed GitOps conventions, PreSync hook Jobs, sync windows, hook-cleanup finalizers, but it operated on raw Kubernetes primitives. ConfigMaps, Jobs, Roles. No CRDs.

That detail matters because when the on-call lead patched the status ConfigMap to mark the apps healthy, the reconciler was doing its job. It read the live cluster, saw the upstream signals were still bad, and rewrote the status. The patch was not wrong because patching ConfigMaps is wrong. It was wrong because the bleater-status object was not an input to the system. It was an output. Editing an output to fix a system is the same shape of mistake as editing a Prometheus metric to fix a service.

We have seen this pattern enough times to write it down as a rule. If a controller is rewriting your patches in under a minute, the object you are patching is derived state. Find the inputs. The reconciler source was eighty lines of Python and it took two minutes to read. The health predicate was an AND-chain across eight signals: lock state, orphan hook Job (the finalizer), schema version, RBAC capability, PVC bound, ResourceQuota headroom, NetworkPolicy egress, and init-container health. Any one of the first seven returning bad meant Progressing; a failing init container returned Degraded instead. All eight were bad.

# the AND-chain we found in the reconciler
def app_health(app):
    if lock_status() == 'locked':
        return 'Progressing'
    if orphan_hook_present(app):
        return 'Progressing'
    if schema_declared_version() < required_version():
        return 'Progressing'
    if not migration_rbac_capable():
        return 'Progressing'
    if not pvc_bound(app):
        return 'Progressing'
    if quota_exhausted():
        return 'Progressing'
    if not egress_allows_db(app):
        return 'Progressing'
    if init_container_failing(app):
        return 'Degraded'
    return 'Healthy'

The reconciler's health function. Eight independent signals, all gating. Every patch to the output ConfigMap was wasted work until every signal flipped.

What the inventory pass turned up in the bleater namespace

Eight failures wearing one hat

We started with the inventory, because the ticket told us almost nothing. A real P1 page rarely enumerates faults; it tells you what is on fire and gives you the namespace. We ran the kind of get-everything pass we always run on a strange namespace.

kubectl get pods,configmaps,jobs,deployments,roles,rolebindings,serviceaccounts,pvc,resourcequota,networkpolicy -n bleater
kubectl get events -n bleater --sort-by=.lastTimestamp | tail -40
kubectl describe pod -n bleater | grep -A5 'Init Containers\|Events:'

The first three commands we ran. The namespace had about sixteen pre-existing platform workloads from other teams sharing label values with the five managed apps.

What came back was a layered mess. A suspended PreSync Job named auth-presync-migrate-legacy7r2x with a hook-cleanup finalizer and no hook-delete-policy. A second suspended Job named fanout-presync-validate that looked identical but carried the hook-delete-policy annotation and a bleater.io/owner label pointing at platform-team. A hook-reconciliation-lock ConfigMap with status: locked and a stale lock-reason from the night of the rollout. The primary application's pod in Init:CrashLoopBackOff with kubectl logs --previous showing the init container failing after pg_isready returned ok. A bleat-db-schema ConfigMap declaring version=2 with no tables-v3 key. A migration script that contained psql ... || exit 0 and had no set -e.

And then the governance layer, which is where the rollout had really gotten out of hand. A RoleBinding named migration-runner-binding pointed at migration-runner-role-v1, which had read-only verbs. A migration-runner-role-v2 existed alongside it, unbound, with create:jobs and patch:configmaps. A PersistentVolumeClaim named bleat-migration-pvc was Pending with an event saying storageclass.storage.k8s.io "fast-ssd-tier" not found, on a k3s cluster where the only storage class was local-path. A ResourceQuota set to pods: 1. A NetworkPolicy with egress: [] denying everything outbound including DNS.

Each one of those, taken alone, was a small fix. Taken together, they gated each other. The migration could not run because the RBAC was wrong. The repair Pods could not schedule because the quota was at one. The init container could not reach Postgres because the NetworkPolicy denied egress. The schema could not advance because the script swallowed errors. The reconciler refused to mark anything healthy until all of them resolved. The hardening rollout had tightened every knob at once and the knobs were not independent.

The dependency graph we drew on the bridge call. Cascade order falls out of the arrows.

The order of repair when faults gate each other

Why we raised the quota before anything else

The instinct on a multi-fault incident is to start with the most visible symptom. The CrashLoopBackOff is loud. The lock is loud. The orphan Job is loud. None of those were the right first move. The right first move was the boring one: raise the ResourceQuota, because every other fix needed to schedule a Pod. A ResourceQuota of pods: 1 caps the total number of non-terminal Pods in the namespace, not the number beyond what is already running, and about sixteen were already running. The quota was over-committed the moment the rollout landed it, so the API server admitted nothing new. We raised it to pods: 24, cpu: 8, memory: 16Gi, sized above the existing workloads plus the five managed apps and the repair Pods. Production limits. We did not delete the quota.

Then the RBAC. We described both Roles and confirmed v2 had the verbs the migration Job needed. Patching the existing RoleBinding to swing roleRef to v2 returned the error we expected.

$ kubectl patch rolebinding migration-runner-binding -n bleater \
    --type='json' -p='[{"op":"replace","path":"/roleRef/name","value":"migration-runner-role-v2"}]'
The RoleBinding "migration-runner-binding" is invalid: roleRef: Invalid value: rbac.RoleRef{...}: cannot change roleRef

$ kubectl get rolebinding migration-runner-binding -n bleater -o yaml > /tmp/rb.yaml
# edit /tmp/rb.yaml, set roleRef.name to migration-runner-role-v2
$ kubectl delete rolebinding migration-runner-binding -n bleater
$ kubectl apply -f /tmp/rb.yaml

roleRef is immutable. The only path is delete-and-recreate, with the existing object as a template so you do not lose subjects.

PVC next. We listed storage classes, saw local-path was the only one, exported the existing PVC, changed storageClassName, deleted, reapplied. The PVC sat Pending for a few more seconds until we scheduled a consumer Pod against it, because local-path on k3s binds on first consumer. Then the NetworkPolicy. We did not delete it. The deny-by-default posture was the right posture; the rollout had just forgotten to allow anything. We added three explicit egress rules: same-namespace for the Postgres reach, kube-system on UDP 53 for DNS, and the metrics endpoints the reconciler scraped. The deny-all stayed in place for everything else.

Then the lock and the orphan. The lock was a one-line patch to set status: unlocked and to replace lock-reason with resolved-2024-hardening-rollback. We left an audit value rather than blanking the field. The orphan Job hung on delete because of the finalizer. The strip-then-delete sequence is muscle memory at this point but it is worth showing because plenty of teams reach for --force first, which is the wrong tool.

# strip the finalizer first, then delete cleanly
kubectl patch job auth-presync-migrate-legacy7r2x -n bleater \
  --type=json -p='[{"op":"remove","path":"/metadata/finalizers"}]'
kubectl delete job auth-presync-migrate-legacy7r2x -n bleater

# do NOT touch fanout-presync-validate. it has hook-delete-policy set,
# carries bleater.io/owner=platform-team, and the reconciler manages it.
kubectl get job fanout-presync-validate -n bleater \
  -o jsonpath='{.metadata.annotations.argocd\.argoproj\.io/hook-delete-policy}'
# => HookSucceeded

Strip then delete. The decoy Job looks identical to the orphan from a distance; the discriminator is the hook-delete-policy annotation and the ownership label.

We have written more on cleaning up GitOps-style state safely in our Kubernetes and CI/CD stabilization playbook, including the finalizer-strip pattern and how to tell a managed Job from an orphaned one without guessing.

Don't weaken governance to silence alarms

The fixes that had to be repairs, not deletes

Halfway through the recovery the client's platform lead asked the obvious question. Why not just delete the ResourceQuota and the NetworkPolicy until things stabilize, then put them back? It would have shaved twenty minutes. We said no, and the reason is worth writing down, because it is the part of incident work that teams under pressure get wrong most often.

Governance controls exist for a reason. Someone put pods: 1 on that ResourceQuota originally because something had blown up the namespace before. Someone put the deny-all egress on because the auth service should not be able to call random external endpoints. The rollout had mangled the values, not the intent. Deleting the controls would have restored the workloads and silenced the alarms. It would have also removed two of the few real defenses that namespace had, with no scheduled work item to put them back. We have watched teams do this in March and find the controls still missing in November. The graveyard of post-incident TODOs is full of governance restore tickets that never got worked.

So we repaired. The quota went up to production limits in place. The NetworkPolicy got explicit allow rules added while the default-deny stayed. The PVC got a real storage class while the claim itself stayed at the same name and the same size. The orphan Job got deleted, because a stale suspended PreSync Job genuinely is garbage, but the cascade infrastructure stayed. Same controls, working values.

The migration script was the other repair-not-delete case. The version we found had this pattern:

# what we found
#!/bin/bash
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -f /migrations/v3.sql || exit 0
echo "migration complete"

# what we replaced it with
#!/bin/bash
set -euo pipefail
psql -h $DB_HOST -U $DB_USER -d $DB_NAME \
  -v ON_ERROR_STOP=1 \
  -f /migrations/v3.sql
echo "migration complete"

|| exit 0 is the single worst line in any migration script. set -e and ON_ERROR_STOP=1 together mean a failing SQL statement actually fails the Job, which is what the reconciler was waiting to see.

After the script was patched, the migration Job ran successfully under the new RoleBinding, applied the v3 schema, and we read the tables back out of Postgres directly rather than trusting the script's exit code. The bleat-db-schema ConfigMap got tables-v3 written from observed pg_tables output. Not from the migration's stated intent. From the live database. If you ever find yourself writing schema declarations from anything other than what is actually in the database, you are setting up the next incident.

When in-house reconcilers and hardening rollouts collide

If your control plane is gaslighting your operators

The hard part of this kind of incident is not any single fault. The hard part is that an internal control plane is opinionated about state in ways that are not documented anywhere except in the reconciler's source code. When five apps are red and the dashboard says nothing changed, your team can spend an hour patching outputs that get reverted before they understand the inputs. Hardening rollouts make this worse, because they touch ResourceQuotas and NetworkPolicies and RBAC in the same change window, and the rollback path almost never accounts for the case where the controls themselves were the right idea but the values were wrong.

We run these recovery engagements every week. The in-house reconciler pattern shows up at almost every SaaS company past Series A that decided not to run ArgoCD or Flux directly. The shape of the failure is always the same: a small Python or Go service that watches a namespace and writes a status object, an operations team that does not own the reconciler code, and a control plane that fights every cosmetic fix because that is what it was built to do. We have seen the RoleBinding immutability case four times this quarter alone. The NetworkPolicy egress-without-DNS case shows up after every security audit cycle.

If you are watching a namespace where the status object keeps reverting your changes, or where a hardening pass cascaded across half a dozen layers and your team is debating whether to delete the controls to get back to green, book an infrastructure review with our team and we will be on a bridge call with you the same day. We will read your reconciler, draw the dependency graph for the cascade, and walk the repair order with your on-call. The goal is not to get the dashboard green by morning. The goal is to get it green without leaving a graveyard of governance restore tickets behind it.

Originally published at https://infraforge.agency/insights/internal-control-plane-cascade-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Recovering a status page from a half-finished schema migration

Muhammad Hassaan Javed — Mon, 22 Jun 2026 22:19:29 +0000

The log line was 'database schema version 23 is dirty, refusing to start' and the pod exited immediately after printing it. The team had already tried a Helm rollback to the previous chart version. That pod did not get further: it printed 'database schema version 23 found, expected 21' against a binary that wanted 21. One binary refused because the schema was mid-migration, the other because the schema was now ahead of it. Both refused to start against the same database, and the company's only public status page had been down for 38 minutes. The migration job had been OOMKilled mid-run during the upgrade, and Postgres was now in a state neither binary recognized.

Problem signals:

Application logs 'schema version N found, expected M' and exits before serving traffic
Helm rollback to the previous chart version fails with the same or inverse schema error
A migration job pod shows exit code 137 or OOMKilled in kubectl describe
The schema_migrations (or equivalent) table reports a version the table DDL does not actually match
Restoring from the most recent Postgres backup would lose hours of production data the team needs to keep

Both the old and new binary refused to start against the same database

The log line that ruled out a rollback

The on-call had done the obvious thing first. The new chart was failing, so they ran helm rollback to the previous revision. The previous revision's pod came up, hit the database, and crashed with the mirror image of the same error. New code wanted schema 23 and saw 21. Old code wanted 21 and saw 23. Both were sort of right.

That symmetry is what told us the database was the problem, not the chart. A clean rollback should have produced a running pod. If both binaries reject the same database, the database is not in either of the states they expect. It is in a third state nobody coded for.

$ kubectl logs -n statuspage statuspage-app-7b9f-xq2vk
INFO  starting statuspage v0.90.78
INFO  connecting to postgres at postgres.statuspage.svc:5432
ERROR database schema version 23 found, expected 21
FATAL refusing to start with schema version mismatch

$ kubectl rollout history deployment/statuspage-app -n statuspage
REVISION  CHANGE-CAUSE
14        helm upgrade statuspage statuspage/statuspage --version 0.91.2
15        helm rollback statuspage 14

$ kubectl logs -n statuspage statuspage-app-6c4d-7m9pz   # rolled-back pod
ERROR database schema version 23 found, expected 21
FATAL refusing to start with schema version mismatch

Same error from both chart revisions. The database, not the chart, was the problem.

The version row claimed 23. The table DDL was somewhere between 22 and 23.

What the schema_migrations table actually said

We dropped into psql against the application database and pulled the migration tracking table. The row said version 23, applied. The dirty flag was true, which on most migration libraries means 'a migration started running and never reported success'. That single boolean was the thread we pulled on for the next hour.

statuspage=> select * from schema_migrations;
 version | dirty
---------+-------
      23 | t
(1 row)

statuspage=> \d incidents
          Table "public.incidents"
   Column    |  Type   | Nullable | Default
-------------+---------+----------+---------
 id          | bigint  | not null |
 service_id  | bigint  | not null |
 started_at  | timestamp without time zone |          |
 resolved_at | timestamp without time zone |          |
 title       | text    |          |
-- expected per migration 0023: severity column, incident_updates FK, partial index on resolved_at IS NULL
-- present: none of the above

The version row was lying. The table structure was a partial 23.

We pulled the migration files out of the chart's image and read them. Migration 0023 was three statements: add a severity column, create an incident_updates table with a foreign key back, create a partial index on unresolved incidents. None of the three were present in the live schema. The OOM had hit after the migration library wrote the version row and before any DDL statement actually committed. Or possibly between statements. The order was implementation-specific and we did not care which exact moment because the answer was the same: none of 0023's DDL had landed, but the bookkeeping said it had.

This is the specific failure mode that makes partial migrations dangerous. The migration library and the actual schema disagree, and the application trusts the library. The library trusts a row it wrote in a different transaction than the DDL it was supposed to be tracking. Most migration tools have fixed this since 2020 by wrapping version-row-and-DDL in a single transaction, but a long tail of applications still ship with the older split-transaction behaviour, and you only find out which one you have when something interrupts a migration.

The base backup was 6 hours stale and uptime data is the product

Why we did not restore from backup

The instinct, and the safe move on most days, is to restore Postgres from the last known-good base backup and replay WAL up to a point just before the migration started. We checked the backup. It was a nightly pg_basebackup, 6 hours old, and WAL archiving had been configured but never tested for PITR. We could probably have done it. We were not willing to bet the status page on 'probably' while the status page was already down.

More importantly, the uptime check history is the product. A status page that loses 6 hours of check data after an outage is worse than a status page that takes another hour to come back. We talked it through with the team lead and decided the database in front of us was recoverable, and recovering it was lower risk than the restore path. That decision is worth naming because it goes against the usual 'just restore from backup' instinct. When the data itself is the value, finishing a half-migration by hand is often the right call.

Step	What it does
Restore from base backup + WAL	Would lose up to 6 hours of uptime check history. PITR path was configured but never tested. Estimated 45-90 minutes if it worked the first try, much longer if not.
Finish migration 0023 by hand, fix version row	Three DDL statements, all idempotent-ish if we wrote them with IF NOT EXISTS guards. Preserves all data. Estimated 20 minutes including verification. Chose this.

Three DDL statements, a version-row correction, and a careful restart

Finishing the migration by hand

We pulled migration 0023 verbatim from the chart image, rewrote each statement with IF NOT EXISTS guards so a re-run could not double-apply, and ran them inside a single transaction so any failure left the database where we found it. Before touching anything we took a pg_dump of the application schema and data to a local file. That dump was our 'we can always undo this' insurance, separate from the production backup system.

-- 1. snapshot first, outside any transaction
$ kubectl exec -n statuspage postgres-0 -- \
    pg_dump -U statuspage -Fc statuspage > /tmp/statuspage-pre-repair.dump

-- 2. complete migration 0023 inside one transaction
BEGIN;

ALTER TABLE incidents
  ADD COLUMN IF NOT EXISTS severity smallint NOT NULL DEFAULT 3;

CREATE TABLE IF NOT EXISTS incident_updates (
  id           bigserial PRIMARY KEY,
  incident_id  bigint NOT NULL REFERENCES incidents(id) ON DELETE CASCADE,
  body         text NOT NULL,
  created_at   timestamp without time zone NOT NULL DEFAULT now()
);

CREATE INDEX IF NOT EXISTS incidents_unresolved_idx
  ON incidents (started_at)
  WHERE resolved_at IS NULL;

-- 3. clear the dirty flag, version row already says 23
UPDATE schema_migrations SET dirty = false WHERE version = 23;

-- 4. sanity-check before commit
SELECT version, dirty FROM schema_migrations;
\d incidents
\d incident_updates

COMMIT;

All four statements committed in 180ms: three DDL plus the version-row correction. The read-only checks commit nothing. The dirty flag was the last thing to flip.

After the commit we scaled the application deployment from 0 to 1 (we had scaled it to 0 earlier to stop the CrashLoopBackOff noise) and watched the pod logs. It connected, read schema_migrations, found 23 clean, started serving. We curled the health endpoint, got a 200, then hit /api/services and confirmed all the configured uptime checks were present with their full history intact. Total time from 'database is the problem' to 'status page is back': 51 minutes.

The thing worth saying out loud about this kind of repair: it works because the migration was small and the failure was clean. If 0023 had been a data migration that rewrote a million rows halfway, the recovery would have looked very different and the backup-restore path would have won. Always read the failed migration before you decide which recovery to attempt. We have written more about that decision in the migration recovery playbook.

A schema snapshot before every migration, and migration jobs that cannot be OOMKilled

The pre-upgrade hook we shipped the next day

Two changes went in within 24 hours. First, the Helm chart now has a pre-upgrade hook that runs a pg_dump of the schema (not the data, just the structure plus schema_migrations row) to an object storage bucket, tagged with the chart version it is about to migrate to. If a future migration goes sideways, the recovery starts from a known structure-level snapshot taken seconds before the migration began, not from the nightly backup. The hook adds about 4 seconds to every upgrade and has paid for itself once already.

Second, the migration job spec got real resource requests and limits, and the limit is now 2x the largest migration we have ever observed in staging plus a 50% buffer. The OOM happened because the chart shipped with a 256Mi limit and the migration that finally hit the production data size needed about 380Mi. We also removed the liveness probe from the migration job entirely. A migration that takes longer than expected should not be killed by Kubernetes; it should be allowed to either finish or fail on its own terms so the migration library can write a clean dirty=true and exit.

The shape of every upgrade now. The branch on the right is the one that saved us.

We have stopped recommending that teams skip the pre-migration schema snapshot just because their database is 'small enough to restore from nightly'. The nightly backup answers a different question than the schema snapshot. The nightly tells you what the data looked like 6 hours ago. The schema snapshot tells you what shape the database was in seconds before this specific migration began, which is the thing you need when a migration is the thing that broke.

Recovering a partial migration without losing the data behind it

When a status page is the thing that is down

The reason this kind of incident is hard is not the SQL. The SQL is usually three or four statements you can read off the migration file. The hard part is deciding whether to finish by hand or restore from backup, and that decision depends on details most teams have not catalogued: how their migration library handles the version row, whether PITR has ever been tested, whether the failed migration is structural or data-rewriting, and whether the data between the last backup and now is recoverable some other way.

We run these recovery engagements regularly. We have seen the OOMKilled-migration-job pattern four times in the last year, two of them on status pages or monitoring tools where the data IS the product, and we have a checklist for the decision now. If you are staring at a CrashLoopBackOff with a schema version mismatch in the logs and a Helm rollback that did not help, book an infrastructure review and we will be on a bridge with you the same day to work through the finish-by-hand versus restore decision before you commit to either.

Originally published at https://infraforge.agency/insights/recovering-status-page-half-finished-schema-migration/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

When a validating webhook blocks the ConfigMap that would fix it

Muhammad Hassaan Javed — Tue, 16 Jun 2026 20:19:47 +0000

The kubectl patch came back as a webhook call failure, connection refused, not a credentials error. That was the moment the incident stopped being about a rotated MongoDB password and started being about the admission layer. A ValidatingWebhookConfiguration with failurePolicy: Fail was pointed at a webhook pod that was crash-looping on a bad liveness probe, and the only way to fix the webhook pod was to patch a ConfigMap that the webhook itself was supposed to validate. The safety mechanism had become the outage. Our profile service was down because its database credentials were stale, the fix for the credentials was a one-line patch, and that one-line patch could not be applied because the thing that was supposed to keep ConfigMaps safe was rejecting every write in the namespace.

Problem signals:

kubectl patch or apply on a ConfigMap returns a failed calling webhook error (connection refused, or a timeout) instead of a normal validation error
An admission webhook pod is CrashLoopBackOff while its ValidatingWebhookConfiguration is set to failurePolicy: Fail
ArgoCD shows sync pending or OutOfSync on resources in the affected namespace and the sync will not progress
A workload reads stale config from a ConfigMap that was supposedly already updated, and a Deployment-level env var is shadowing the mounted value
Compliance requires that admission webhooks remain failurePolicy: Fail in production, so flipping to Ignore as a workaround is itself an audit event

We thought it was a credentials incident for the first 20 minutes

The patch that came back as a webhook call failure

The page that started the call was a profile service returning 500s on /health. The cause looked obvious. The data layer team had rotated the MongoDB credentials the day before, and the live ConfigMap in the application namespace still held the old password. There was a backup ConfigMap sitting next to it with the rotated values, labelled exactly the way the runbook described. The fix was supposed to be a thirty second kubectl patch.

It was not. The patch came back with this:

$ kubectl -n app patch configmap profile-mongodb-config \
    --type merge --patch-file rotated.yaml
Error from server (InternalError): Internal error occurred:
failed calling webhook "configmap-validator.app.svc":
Post "https://configmap-validator.app.svc:443/validate?timeout=10s":
dial tcp 10.42.7.83:443: connect: connection refused

The actual error. Not a credentials problem, an admission control problem.

That error string is the whole story. The cluster had a ValidatingWebhookConfiguration named configmap-validator that intercepted every ConfigMap write in the namespace. The webhook pod was supposed to enforce a schema policy that the compliance team owned. Right now the webhook pod was not answering on its service IP, which meant every ConfigMap write was failing closed, which meant our credential fix was failing closed, which meant the profile service stayed down.

We had walked into this kind of shape before, but usually on the cert-manager side. This time the trap was tighter: the webhook was supposed to validate the very ConfigMaps that controlled the workloads in its own namespace, and one of those workloads happened to be down for an unrelated reason. Two independent failures had stacked into a deadlock.

Why the pod was crash-looping and why nothing could fix it in place

The webhook was guarding its own ConfigMap

kubectl describe on the configmap-validator pod told us the liveness probe was failing. The pod was getting killed every 30 seconds, restarting, getting killed again. The probe was hitting /healthz on port 8443. The actual application served its health endpoint on /health, no z. Someone had copy-pasted a probe spec from an older service months ago and nobody had noticed because the webhook had been running fine until a recent image bump shifted the health route.

Fixing a Deployment in Kubernetes is normally a kubectl edit deploy or a kubectl patch on the probe spec and you are done. That was not available to us. The webhook configuration intercepted ConfigMap writes, not Deployment writes, so technically we could have patched the Deployment directly. Except the Deployment mounted a ConfigMap for its own startup arguments, and our platform team had a hard rule against in-cluster edits that drifted from the GitOps source. ArgoCD would self-heal the Deployment back to the broken probe spec inside 90 seconds.

We needed the correct health path, and the compliance team kept the canonical values in a separate namespace. Their ConfigMap held the approved liveness path, the approved annotation policy, and the compliance acknowledgement token that any incident response was required to reference. We pulled it:

$ kubectl -n compliance get configmap webhook-standards -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: webhook-standards
  namespace: compliance
data:
  liveness-path: "/health"
  readiness-path: "/ready"
  failure-policy: "Fail"
  ack-token: "COMP-ACK-7c3f9a-2024Q4"
  required-annotations: |
    incident.compliance/id
    incident.compliance/services
    incident.compliance/ack-token

The compliance source of truth. We read these values; we did not retype them.

Flip failurePolicy to Ignore, or delete the ValidatingWebhookConfiguration?

Choosing the smaller blast radius

There were two ways to unblock ConfigMap writes. We could delete the ValidatingWebhookConfiguration entirely, fix everything, and recreate it from the GitOps source. Or we could patch failurePolicy from Fail to Ignore for the length of the recovery and patch it back when we were done.

Step	What it does
Option A. Delete the webhook configuration	Cleanest cut. ConfigMap writes unblock instantly. Risk: an unrelated team applies an out-of-policy ConfigMap during the window and we do not catch it. Also generates a louder audit event because the object disappears from etcd.
Option B. Patch failurePolicy to Ignore	Webhook is still called; if the pod is up it still validates; if it is down, writes pass. Smaller blast radius. Audit log shows a field change, not a delete. We picked this one.

Option B won because of the audit trail. The compliance team would rather see one field flip and one field flip back, with the same controller object identity across the incident, than see a delete and a recreate with a new resourceVersion lineage. That is the kind of preference you only learn by sitting through an audit. We have written more about this kind of constraint in our infrastructure audit readiness work.

# Step 1. Snapshot the current webhook config so we can prove what we changed.
kubectl get validatingwebhookconfiguration configmap-validator \
  -o yaml > /tmp/vwc-before.yaml

# Step 2. Flip failurePolicy to Ignore, scoped to this single webhook entry.
kubectl patch validatingwebhookconfiguration configmap-validator \
  --type='json' \
  -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

# Step 3. Confirm the change before touching any ConfigMap.
kubectl get validatingwebhookconfiguration configmap-validator \
  -o jsonpath='{.webhooks[0].failurePolicy}'
# expect: Ignore

The unblock. Three commands, one of which is a snapshot for the post-incident review.

We patched the ConfigMap and the service was still broken

The env var that made the credential fix invisible

With failurePolicy on Ignore, the credential patch went through. We pulled the rotated values from the backup ConfigMap, applied them to the live one, and watched the profile service pods restart. /health still returned 500. The MongoDB connection error in the application logs still showed the old username.

That was the second moment in the incident where the model of the world had to change. The ConfigMap held the new credentials. The pod environment did not. Something else was setting MONGODB_URI on the container at runtime, and it was winning.

$ kubectl -n app get deploy profile-service \
    -o jsonpath='{.spec.template.spec.containers[0].env}' | jq
[
  { "name": "MONGODB_URI",
    "valueFrom": {
      "configMapKeyRef": {
        "name": "profile-mongodb-config",
        "key": "uri"
      }
    }
  },
  { "name": "PROFILE_MONGODB_URI_OVERRIDE",
    "value": "mongodb://oldapp:oldpw@mongo-old.app.svc:27017/profiles"
  }
]

The override. Set during a migration test six weeks earlier and never removed.

The application code read PROFILE_MONGODB_URI_OVERRIDE if it was set and otherwise read MONGODB_URI. The override had been added during a migration drill six weeks ago, never cleaned up, and was now silently shadowing every ConfigMap update we tried to apply. We have stopped accepting break-glass env overrides on production Deployments for this exact reason. If the override is worth setting, it is worth its own ConfigMap with an expiry annotation that a controller cleans up. Naked env values on the Deployment spec are invisible to the operators who do not know to look for them.

We removed the env var, the pod rolled, and /health came back as 200 on the third pod we curled.

Putting the webhook back, and making sure this never happens the same way again

Restoring failurePolicy: Fail without re-creating the trap

Before we restored failurePolicy to Fail, we fixed the webhook pod. The liveness probe path went from /healthz to /health, the value we had read from the compliance ConfigMap. The pod came up healthy and stayed up. We confirmed the webhook was actually answering by sending a deliberately invalid ConfigMap and watching the validation rejection come back cleanly. Only then did we flip failurePolicy back.

The harder problem was structural. A failurePolicy: Fail webhook that gates ConfigMap writes in a namespace is fine. A failurePolicy: Fail webhook that gates ConfigMap writes in a namespace that also contains the webhook itself, where the webhook's own Deployment depends on ConfigMaps in that namespace, is a bootstrap hazard. The first time something goes wrong, you cannot fix it without breaking your own policy.

The deadlock in one picture. Every fix path routes through a ConfigMap write the webhook itself is blocking.

We made two changes before we left. First, we moved the webhook's own Deployment, Service, and startup ConfigMap out of the app namespace and into a dedicated webhooks namespace, then set a namespaceSelector that excludes webhooks from validation, so the webhook can be rebuilt from its own in-cluster ConfigMaps even when it is the thing that is broken. Every ConfigMap in app is still validated, which is the whole point of the policy. Second, we added an objectSelector that excludes ConfigMaps carrying a specific break-glass label, so an on-call engineer with the right RBAC can apply a labelled emergency patch without flipping failurePolicy at all. Both changes were reviewed by the compliance team before we merged them, because relaxing the scope of a Fail-policy webhook is itself an audit decision.

The recovery script we left behind reads every value it needs (the health path, the ack token, the affected service list) from cluster state rather than hardcoding. Hardcoded recovery scripts go stale within a quarter; scripts that read from a compliance-owned ConfigMap stay correct as long as the source of truth is maintained. The script is idempotent: rerunning it on an already-recovered cluster is a no-op, which matters because the on-call engineer who runs it at 3 am should not have to think about whether they are the first or the third person to run it that night.

When to call us, and what we will look at first

If your admission layer is gating its own recovery

The thing that makes this incident shape hard is not the webhook itself. It is that the recovery path is non-obvious, the audit consequences of the obvious workaround (flipping policy or deleting the webhook config) are real, and the second-order trap (an env var on a Deployment shadowing the ConfigMap you just fixed) only shows up after you have already burned the credibility from the first workaround. Teams who hit this for the first time usually solve the immediate outage but leave the structural deadlock in place, and then it happens again on a different webhook six months later.

We run these recovery engagements every week. The admission-webhook-blocks-its-own-config shape has come up four times this year for us, once with cert-manager involved, twice with policy webhooks like this one, once with a service mesh sidecar injector that depended on a ConfigMap in its own namespace. The env-override-shadowing-a-ConfigMap-fix pattern is even more common; we see some version of it in roughly half of the credential rotation incidents we are called into.

If your cluster has a Fail-policy admission webhook today and you have never tested what happens when its pod is down, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week. We will walk your webhook configurations, identify the ones that gate their own dependencies, and give you a concrete plan to break the loop before an incident finds it for you.

Originally published at https://infraforge.agency/insights/admission-webhook-configmap-deadlock-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

ArgoCD CVE-2022-24348: a Secret leak that hid in log volume

Muhammad Hassaan Javed — Tue, 02 Jun 2026 02:05:12 +0000

The first thing we saw in Loki was a fanout service log line that contained the string 'a2V5Y2xvYWstY2xpZW50' repeated about 40 times in a single minute. Base64 decode: 'keycloak-client'. The fanout service had no business reading anything from the keycloak namespace. It had been emitting fragments of another namespace's client-secret for three days, quietly, while Grafana OnCall sat on a low-priority log volume alert that nobody clicked. The vector turned out to be CVE-2022-24348, the ArgoCD directory traversal bug, riding in on a ConfigMap key that an automation script had committed without anyone noticing.

Problem signals:

A low-priority alert on log volume spikes that nobody investigated for days
ConfigMap keys with URL values that contain '../' segments
Application logs containing base64 strings that decode to credential-shaped prefixes
ArgoCD Application source.repoURL values that point outside the expected repo root
ConfigMap changes in the cluster that have no matching Git commit

Why a fanout service was emitting Keycloak credential fragments

The log line that should not have existed

An on-call engineer was triaging an unrelated paging storm and, out of habit, ran a Loki query against the noisiest service of the previous week. The fanout service had spiked from roughly 200 log lines per minute to 6,400 per minute three days earlier and had stayed there. The lines looked like garbage. They were not garbage.

{app="fanout-service"} |= "" | line_format "{{.message}}" | json | __error__=""

# Sample line (sanitized):
level=info msg="resolved source repo" repo="a2V5Y2xvYWstY2xpZW50LXNlY3JldDovL2NsaWVudC1pZD1ibGVhdGVyLWFwaQ==" component=helm-renderer

The base64 in the repo field decodes to 'keycloak-client-secret://client-id=bleater-api'. The fanout service was logging the resolved value of a config key that should never have resolved to a Secret.

We pulled the live ConfigMap. The offending key was named ARGOCD_APP_SOURCE_REPO_URL and its value was 'gitea.internal/platform/fanout/../keycloak-secrets'. That single '../' segment is the entire CVE-2022-24348 exposure. The ArgoCD Helm renderer, in vulnerable versions, would normalize the path after resolving it, walk out of the intended repo root, and read whatever Helm values or Secret references it found in the sibling directory. In this case the sibling directory was a Helm chart that templated the Keycloak client-secret Secret into its values. The fanout service's own application code, which logged its resolved configuration on startup and on every reconcile, then dumped fragments of that Secret into Loki as base64.

Three days. The fanout service itself was healthy the entire time. RabbitMQ consumers were running, distribution was working, the SLO board was green. The exposure was completely silent from a functional standpoint.

Log volume alerts without log content are noise generators

Why the alert sat for three days

The Grafana OnCall alert that fired three days earlier said, in effect, 'fanout-service log volume is 30x baseline'. It was tagged P3 and routed to a Slack channel that the team treats as a digest. The runbook attached to the alert said to check for retry loops. The on-call engineer at the time did check, saw no retries in the RabbitMQ metrics, and silenced the alert for 24 hours. The silence got renewed twice by the rotation handoff.

This is the part of the story we keep seeing across client engagements. A log volume alert that does not inspect log content tells you something changed, not what changed. If the alert had matched on the byte pattern of base64 strings longer than 32 characters in a service that does not normally emit base64, the page would have been P1 and would have gone to a human within minutes. Volume alone is not a signal anyone can act on in under an hour, so it gets silenced.

We have written more on this in our GitOps and ArgoCD recovery cluster, where the same pattern shows up under different vectors. The constant is that GitOps systems concentrate trust in the manifest pipeline, and any leak in that pipeline tends to surface first as 'weird logs' before it surfaces as anything else.

The pod restart that the ConfigMap patch alone does not give you

Patching the ConfigMap was not the fix

The instinct, once we identified the bad key, was to kubectl edit configmap and delete the line. We did not do that, for two reasons. First, the ConfigMap was managed by ArgoCD; a live edit would last until the next sync. Second, even after the ConfigMap was clean, the existing pods would still have the malicious URL in their environment because envFrom only resolves at pod start. The leak would continue until the pods were rolled.

The correct sequence had four steps and the order mattered.

Step	What it does
1. Commit the fix to Git first	We removed the ARGOCD_APP_SOURCE_REPO_URL key from the ConfigMap manifest in the platform repo and opened a PR. ArgoCD was the source of truth, so any cluster-side edit would be reverted. The PR also added a comment explaining the CVE so the next person reading the repo would understand the deletion.
2. Sync ArgoCD with prune disabled	We forced the sync immediately rather than waiting for the next polling interval. We left prune disabled because we wanted to confirm exactly one diff: the deletion of the bad key. Surprise prunes during a security remediation are how secondary incidents start.
3. Roll the pods explicitly	kubectl rollout restart deployment/fanout-service. The ConfigMap was clean but the pod environments still held the resolved value. Until the pods restarted, every reconcile loop in the running process kept logging the leaked fragment. The rollout took 90 seconds.
4. Verify in Loki before declaring done	We ran the same Loki query that found the leak, scoped to the time window after the rollout completed. Zero matches. Then we ran it across a 30-minute window to be sure we were not just hitting log buffer lag. Still zero. That was the moment we stopped holding our breath.

# Verify the ConfigMap is clean
kubectl get configmap fanout-service-config -n bleater -o json \
  | jq '.data | keys[]' | grep -i repo_url
# (should return nothing)

# Verify pods restarted after the ConfigMap fix timestamp
kubectl get pods -n bleater -l app=fanout-service \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'

# Verify the ArgoCD Application source has no traversal
kubectl get application fanout-service -n argocd \
  -o jsonpath='{.spec.source.repoURL}' | grep -F '../' \
  && echo 'STILL VULNERABLE' || echo 'clean'

# Loki check, post-restart window only
logcli query --since=15m \
  '{app="fanout-service"} |~ "a2V5Y2xvYWs|keycloak"'

The four checks we ran in order. Any non-empty result on any of them would have meant the remediation was not complete.

Treating the leak as breached until proven otherwise

Auditing whether the Secret was actually read

The harder question was not 'did we stop the leak' but 'did anyone read the leaked data while it was leaking'. The fragments were in Loki, which meant anyone with Loki read access to the bleater namespace logs could have seen them. We pulled the Loki audit log for the three-day window and listed every query that matched fanout-service logs. Twelve queries from four engineers, all of them looking at unrelated debugging work, none of them filtering on the byte pattern that would have exposed the credential.

That was reassuring but not sufficient. The Keycloak client-secret had to be treated as compromised regardless, because we could not prove the absence of external log exfiltration with high confidence. We rotated the client-secret, redeployed the services that used it, and audited Keycloak's own access log for any token issuance using the old secret from an unexpected source IP in the exposure window. We found none. The rotation took about 25 minutes including service redeployment.

We then went back to the original question that nobody had asked yet: how did the malicious ConfigMap key get there in the first place. The automation script that applied it was a 'config sync' job that pulled key-value pairs from a shared spreadsheet and wrote them into the ConfigMap. The spreadsheet was editable by a wider group than the cluster was. Somebody had added the URL three days earlier, probably as a copy-paste mistake from a different document, and the sync job had faithfully applied it. There was no Git commit, no PR, no review.

Three controls that close this class of failure

What we changed so it cannot happen quietly again

We made three changes after this incident, in priority order.

The first was an admission webhook that rejects any ConfigMap apply where a string value contains '../' or matches the shape of a URL pointing outside an allowlist of internal domains. The rule is 12 lines of OPA policy. We tested it against six months of historical ConfigMap diffs and it would have caught this exact incident on day zero. It also catches the more common case of someone pasting a localhost URL into a shared config.

The second was retiring the spreadsheet-driven sync job. Every ConfigMap that lands in the cluster now comes from a Git commit, has a commit SHA annotation, and fails admission if the annotation is missing or does not match a real commit in the repo. The work to migrate the existing key-value pairs took about a week. The job is gone and is not coming back.

The third was rewriting the log volume alert. The new version fires when fanout-service log lines contain base64-encoded strings longer than 24 characters at a rate above 5 per minute, scoped to services that do not normally emit base64. It is a Loki recording rule with a regex match and it pages a human at P1. The first week it ran it caught two false positives (both were legitimate JWT logging that we then removed) and zero real incidents. We consider that a healthy signal-to-noise ratio for a security alert.

We also upgraded ArgoCD past the CVE-2022-24348 fix line. That should have happened a year earlier. If you are running an ArgoCD version older than 2.3.0, 2.2.9, or 2.1.15, stop reading this and go check, because the same vector is sitting in your cluster waiting for an unlucky ConfigMap edit.

The control surface after the incident. The two new gates are the admission webhook on apply and the content-aware log alert at runtime.

When a CVE has been silently active in your cluster for days

If you are staring at a similar exposure window

The hard part of this kind of incident is not the patch. The patch is one line. The hard part is reconstructing the exposure window with enough confidence to know what to rotate, what to disclose, and what to audit. That work requires log retention you can query precisely, audit trails for the systems that read those logs, and the discipline to treat any leaked credential as compromised until the access logs say otherwise. Teams that have not rehearsed this work tend to do all three poorly the first time.

We run GitOps and ArgoCD recovery engagements every month. We have seen the path traversal CVE three times in the last year, all on clusters running ArgoCD versions the operators thought were current, and we have seen the same 'silent ConfigMap injection via shared spreadsheet' antipattern more often than that. The remediation pattern is the same. The audit pattern is the same. The controls that close the gap are the same.

If you suspect a similar exposure in your cluster right now, request an infrastructure review and we will start with a 30-minute diagnostic call this week to scope the audit window and the rotation list. If the exposure is active, we will be on a bridge with your team the same day.

Originally published at https://infraforge.agency/insights/argocd-cve-2022-24348-path-traversal-secret-leak-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why Grafana OnCall acknowledgments hang after a Helm upgrade migration

Muhammad Hassaan Javed — Mon, 01 Jun 2026 02:59:51 +0000

The call did not come from our on-call rotation. It came from a customer who noticed two unrelated degradations on their side and asked why we had not paged. We had not paged because Grafana OnCall had been silently swallowing alerts for roughly 72 hours. Every new firing alert was being deduplicated into the same zombie incident, and every attempt to acknowledge or resolve that incident returned HTTP 500. The on-call engineer who first tried to clear it that morning had assumed the spinner was a UI bug and moved on. The thing meant to wake us up was the thing that was broken.

Problem signals:

OnCall UI Acknowledge and Resolve buttons spin and time out with a generic 500
New alerts from real degradations get deduplicated into an incident that cannot be cleared
OnCall pod logs show ORM errors referencing a column that does not exist in the table
The Helm post-upgrade migration job reported success but Postgres logs show a lock_timeout on one ALTER TABLE
There is no Prometheus alert on OnCall's own API error rate, so the regression went undetected

72 hours of swallowed alerts and two zombie incidents absorbing all of them

The alerting platform was the incident

When we got on the bridge, OnCall's incident list looked almost healthy. Two incidents in firing state, both from three days earlier, both with zero acknowledgment events. That should have been impossible. The on-call rotation had been live the whole time, and the runbook said any firing incident over 15 minutes old gets escalated. Nothing had been escalated because nothing new had appeared. Every alert fired by Prometheus Alertmanager in those 72 hours had been deduplicated by labels and folded into one of those two zombies.

The first thing we tried was the obvious one. Click Acknowledge in the UI. The spinner ran for about 20 seconds and the page returned a 500. Same for Resolve. Same for Snooze. Same when we called the API directly with curl. The web pods were up, the database was reachable, Redis was fine. Nothing in any dashboard suggested a problem, because nobody had built a dashboard that watched OnCall itself.

$ curl -s -X POST -H "Authorization: Bearer $TOKEN" \
    https://oncall.internal/api/v1/alert_groups/I8KZ.../acknowledge/
{"detail": "Internal server error"}

# from the oncall-engine pod
$ kubectl logs deploy/oncall-engine -c engine --tail=50 | grep -A2 ERROR
DatabaseError: column alerts_alertgroup.acknowledged_by_confirmation_phone does not exist
LINE 1: ...ledged_by_user_id", "alerts_alertgroup"."acknowledged_by_co...

The ORM was reaching for a column the table did not have.

A silent ALTER TABLE timeout the Helm hook never noticed

Why the migration job exited 0 with a half-finished schema

Our first guess was a bad release. The previous Helm upgrade had bumped OnCall by a minor version, and we assumed the new application code was looking at a field that genuinely had not shipped yet. That was wrong. The release notes said the column had been added in this version, and django_migrations on the OnCall database said the migration had been applied. Both things were true, and the column was still not there.

The clue was in Postgres logs from three days earlier, exactly when the Helm post-upgrade hook ran the migration job. One line, easy to miss, in the middle of dozens of normal statement logs:

2026-05-11 02:14:07 UTC ERROR:  canceling statement due to lock timeout
2026-05-11 02:14:07 UTC STATEMENT:  ALTER TABLE alerts_alertgroup
    ADD COLUMN acknowledged_by_confirmation_phone varchar(20) NULL;
2026-05-11 02:14:07 UTC LOG:  duration: 30001.114 ms

alerts_alertgroup is one of the highest-write tables in OnCall. At 02:14 a backlog of inserts was holding row locks, the ALTER hit the lock_timeout we had set globally to 30 seconds (a sensible default we put in years ago to stop one bad migration from wedging the whole database), and Postgres killed the statement. The migration script caught the exception, logged it to stderr, moved on to the next statement, and finished. The Helm hook checked the job's exit code, saw 0, and marked the release Succeeded. ArgoCD synced. The new pods rolled. And from that moment, every code path that touched the new column returned 500.

A plain retry would not have worked either, for a second and separate reason. The failed attempt had left a pgtrigger object behind on alerts_alertgroup, which we only found by querying pg_trigger directly, so Django's migration would have blown up on replay trying to create a trigger that already existed. That leftover trigger did not cause the lock_timeout (an existing trigger does not block an ALTER TABLE; only conflicting locks held by other sessions do). It just meant the migration could not be re-run cleanly until we removed it, and the ALTER itself still needed a window where the write backlog was not holding the table.

Why we forward-fixed instead of rolling the Helm release back

Drop the trigger, add the column, then unstick the zombies

We considered rolling back to the previous OnCall version. It looked clean on paper: the old image did not need the missing column, so the schema would match again and acks would work. We talked ourselves out of it for two reasons. First, the new pods had been running for three days and had written data shaped for the new version, including new fields in adjacent tables. A rollback would have meant either accepting writes that the old code did not understand or restoring a 72-hour-old database snapshot, which would erase three days of incident history including the zombies we wanted to clean up. Second, the next upgrade would just hit the same lock_timeout the same way. We would be back here in a week.

Forward-fix it was. The sequence had to be careful, because the table was still taking writes and we were going to ALTER it. We picked a low-write window, paused Celery workers that wrote to alerts_alertgroup (not the web tier, which we wanted up so the API stayed responsive), and ran the work inside one transaction:

-- 1. confirm the column is genuinely missing
SELECT column_name FROM information_schema.columns
WHERE table_name = 'alerts_alertgroup'
  AND column_name = 'acknowledged_by_confirmation_phone';
-- (0 rows)

-- 2. find the blocking trigger left over from the failed attempt
SELECT tgname FROM pg_trigger
WHERE tgrelid = 'alerts_alertgroup'::regclass
  AND tgname LIKE 'pgtrigger_%';

-- 3. drop it inside the same transaction we ALTER in
BEGIN;
SET LOCAL lock_timeout = '5min';
DROP TRIGGER IF EXISTS pgtrigger_oncall_protect_finished
  ON alerts_alertgroup;
ALTER TABLE alerts_alertgroup
  ADD COLUMN acknowledged_by_confirmation_phone varchar(20) NULL;
COMMIT;

Raise lock_timeout for this transaction only; do not touch the global.

We did not change the global lock_timeout. Setting it LOCAL inside the transaction lets this one ALTER wait up to five minutes, and any other migration that runs in normal conditions still gets the 30-second guard. Once the column existed, we unpaused the Celery workers and watched the engine pod logs. The 500s stopped within seconds.

That left the zombies. Acknowledging them was not enough. An acknowledged incident still sits in the firing state from OnCall's deduplication perspective, so new alerts would still fold into it. We had to mark them resolved. We did it through the API first to make sure the lifecycle hooks fired and downstream integrations got the resolved webhook. That worked for one of the two. The API still refused the other for an unrelated reason: its integration had been deleted, so OnCall could not look up the routing to fire the webhook. For that single record we set resolved=TRUE and resolved_at to the current timestamp in the database directly, with a note in the incident's raw payload explaining the manual close.

We then fired a synthetic alert from Alertmanager and watched a new incident appear, ack it from the UI in under two seconds, resolve it, and confirm a follow-up alert created a fresh incident instead of folding into the resolved one. That was the real all-clear.

Meta-monitoring for the platform that does the monitoring

What we wired up so the next silent migration trips an alarm

The thing that kept us up afterward was not the migration. Migrations fail. Database locks happen. The thing that kept us up was that OnCall had been broken for three days and not one signal in our monitoring stack had told us. We had alerts on Prometheus being down, on Alertmanager being down, on Grafana being down, on every customer-facing service. We had nothing watching the incident management platform itself.

We added two rules the same week. The first is a straight error-rate alert on OnCall's API. If more than 1% of requests to /api/v1/ return 5xx for five minutes, page the platform team at critical severity. Five minutes is short enough that a real outage gets caught but long enough that a single bad deploy rolling does not page. We picked critical because if OnCall is degraded, nothing else paging matters; alerts get swallowed.

groups:
- name: oncall-meta
  rules:
  - alert: OncallApiErrorRateHigh
    expr: |
      sum(rate(django_http_responses_total_by_status_total{job="oncall",status=~"5.."}[5m]))
      /
      sum(rate(django_http_responses_total_by_status_total{job="oncall"}[5m]))
      > 0.01
    for: 5m
    labels:
      severity: critical
      service: oncall
    annotations:
      summary: "OnCall API returning >1% 5xx for 5m"
      runbook: "https://internal/runbooks/oncall-api-errors"

  - alert: OncallMigrationJobStderr
    expr: |
      sum(increase(kube_job_status_failed{namespace="oncall"}[10m])) > 0
      or
      sum(increase(log_messages_total{namespace="oncall",app="migration",level="ERROR"}[10m])) > 0
    for: 1m
    labels:
      severity: critical
      service: oncall

The second rule catches a migration job that logs errors even if its exit code is 0.

The second rule is the lesson from this specific incident. Helm trusts the exit code. Django migrations swallow individual statement errors and continue. The only place the truth lives is in the job's log stream. We now alert on ERROR-level log lines from any pod with the migration label in the oncall namespace, regardless of whether the job reported success. We have caught two real issues with this rule in the months since (neither as bad as this one, both worth knowing about within minutes instead of days).

The broader pattern, and one we now apply on every recovery engagement we run, is that any tool you depend on to notice problems needs an independent way to notice when that tool itself is the problem. We have written more about this category of failure in our migration recovery work, because the same shape appears in database cutovers, queue platform upgrades, and identity provider migrations: the system you rely on to tell you the truth is the system that has stopped telling the truth, and you only find out from a customer.

When acks are silently 500ing and you cannot tell what data is real

If your OnCall is doing this right now

The hard part of this incident is not the SQL. The hard part is making the call between forward-fix and rollback when your incident history, your zombie state, and your live alert routing are all entangled in a database that is currently being written to by application code that expects a schema it does not have. Roll back without a plan and you lose three days of incident records. Forward-fix without checking for leftover triggers and migration locks and your second attempt fails the same way as the first. Run an ALTER on a hot table during business hours and you find out what your application's actual timeout tolerance is.

We do these engagements every few weeks. Partial Django migrations on Grafana OnCall is the specific case we have now seen three times this year, twice from lock_timeout and once from a leftover pgtrigger object that made the migration fail on replay. Adjacent variants we have handled: Sentry post-deploy migrations that left a column nullable when the code expected NOT NULL, Mattermost upgrades where one index creation timed out, Keycloak realm migrations that completed on the primary but failed on a replica. The pattern is identical and the recovery sequence rhymes.

If your team is staring at a 500 on every ack and trying to decide whether to roll back the Helm release, book an infrastructure review with our team and we will be on a bridge with you the same day. We will help you confirm the schema delta, plan the forward-fix or the rollback with the data implications spelled out, and clean up the zombie incidents without losing the history you need for the postmortem.

Originally published at https://infraforge.agency/insights/grafana-oncall-stuck-incidents-partial-migration/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why a deleted backup Lambda kept billing 9,400 EBS snapshots

Muhammad Hassaan Javed — Sat, 30 May 2026 22:26:35 +0000

The EBS Snapshot line on the monthly bill was $1,830. There was no active EBS snapshot policy on the account. The backup Lambda that had produced these snapshots had been deleted thirteen months earlier, replaced by AWS Backup, and forgotten. Nobody had deleted what it created. Two volumes, snapshotted every two hours for 392 days, came to 9,408 orphans sitting on 36 TB of storage, billed at the us-east-1 EBS Snapshot rate of $0.05 per GB-month every month since.

Problem signals:

EBS Snapshot line is several hundred dollars a month and no active EBS snapshot pipeline is running on the account
describe-snapshots --owner-ids self returns thousands of entries when you expect dozens
Sampling a few snapshot IDs shows SourceVolumeId values that no longer resolve in describe-volumes
A backup Lambda or custom snapshot script was deprecated in the last 12 to 24 months
AWS Backup is the active tool and its dashboard shows normal counts, but the cost line tells a different story

$1,830 a month on a backup product the account no longer used

The line item that should have been zero

The EBS Snapshot line had been climbing slowly for thirteen months. Nobody had flagged it. The quarterly cost review surfaced it because the line item ranked sixth on the account, and the team's mental model said it should have ranked nowhere. There was no EBS snapshot policy running. AWS Backup had taken over RDS and EBS backups a year earlier, with the old Lambda plus EventBridge pipeline retired the same week.

The first instinct in the room was to pull AWS Backup's plan and see if a retention window had widened. The plan was clean. Snapshot counts there were in the low dozens, exactly what the new policy specified. So the snapshots driving the bill were coming from somewhere else.

$ aws ec2 describe-snapshots --owner-ids self \
    --query 'length(Snapshots)' --output text
9408

The number that turned a routine cost review into an incident.

That number was the moment the room got quiet. AWS Backup writes maybe forty snapshots a month on this account. Nine thousand was a different category of problem.

AWS Backup was clean, so who made these 9,408 snapshots

Ruling out the obvious suspect

With AWS Backup ruled out and no other named pipeline running, the question became: who created these 9,408 snapshots, and is anything still creating more. We pulled the StartTime field on the most recent hundred. The newest one was thirteen months old. Whatever pipeline made them had stopped, which meant we were looking at a stable population, not a leak that was still growing. That mattered because it meant the cleanup had a known size.

The next question was whether the source volumes were still around. We sampled twenty random snapshots and ran describe-volumes against their SourceVolumeId. All twenty came back InvalidVolume.NotFound. The pattern was clear: the snapshots were referencing two specific volume IDs (the Lambda snapshotted two production EBS volumes every two hours), both of which had been deleted along with the EC2 instances they served when the application moved to a managed service.

aws ec2 describe-snapshots --owner-ids self \
    --query 'Snapshots[*].[SnapshotId,VolumeId,StartTime]' \
    --output text > all-snapshots.tsv

awk -F'\t' '{print $2}' all-snapshots.tsv | sort -u \
  | while read vid; do
      if ! aws ec2 describe-volumes --volume-ids "$vid" \
          >/dev/null 2>&1; then
        echo "$vid orphan"
      fi
    done > orphan-source-volumes.txt

Group snapshots by their source volume, then check which source volumes still exist.

Only two volume IDs appeared in the orphan list. Two volumes, one snapshot every two hours, 392 days of runtime before the Lambda was deleted: 2 x 12 x 392 is 9,408. The arithmetic closed exactly, which told us we were looking at the whole population and not a subset of it. The Lambda that created them was gone, but AWS does not garbage-collect snapshots when their creator disappears. Snapshots are first-class objects with their own lifecycle, and that lifecycle is whatever you set when you create them. The Lambda set nothing.

Why we sampled twenty before touching the other 9,388

What we did before running delete-snapshot in a loop

The temptation at this point is to write a one-line loop and delete everything. delete-snapshot is irreversible. The cost was real, $1,830 a month for storage of data that referenced infrastructure that no longer existed. Two reasons we did not run the loop immediately.

First, orphan is sometimes a transient state. A volume gets deleted on Tuesday during a planned migration. On Wednesday the orphan-finder runs. A snapshot taken two hours before the volume's deletion looks orphaned but is actually the most recent backup of a service that was just migrated. Deleting it would destroy the only remaining copy of that data. We checked the StartTime on every snapshot in our sample against the deletion date of its source volume. Every one was older than the deletion by at least nine months. The cohort was uniformly historical. No active workflow could be depending on any of them.

Second, we needed to be sure these snapshots were not being referenced as the base for any AMI or any live AWS Backup recovery point. We ran describe-images with a block-device-mapping.snapshot-id filter on the sample, expecting nothing, and got nothing. We checked the AWS Backup recovery point inventory. None of the orphan snapshot IDs appeared there. The deletion was safe.

The delete loop was about forty minutes of actual work, and we spread it across three calendar days on purpose. delete-snapshot is rate-limited at roughly 5 requests per second per account with bursts, so 9,400 deletes plus retries on the occasional 503 is well under an hour of throughput even at a conservative pace. Nobody wanted nine thousand irreversible deletes running in one unsupervised burst, so we split the list into three batches of about 3,100, one per day, each batch gated on someone reading the previous day's failed.log and signing off before the next one started. The loop itself had a 250ms sleep, a checkpoint file, and an append-only deleted.log so we could resume after any interruption without re-trying ones that already succeeded.

while read sid; do
  if grep -qx "$sid" deleted.log; then continue; fi
  aws ec2 delete-snapshot --snapshot-id "$sid" \
    && echo "$sid" >> deleted.log \
    || echo "$sid" >> failed.log
  sleep 0.25
done < orphan-snapshot-ids.txt

Resumable, rate-limited delete loop. The checkpoint file is the load-bearing part.

After three days the EBS Snapshot line on the next monthly forecast dropped to under $20. The thirty-six terabytes of orphan storage was gone.

Tag at creation, schedule the cleanup, watch the lines that should be zero

The rule that meant the next deprecated pipeline could not do this

The deletion fixed the symptom. The interesting part of this engagement was the cause. AWS does not couple a snapshot's lifecycle to the lifecycle of whatever process created it. A Lambda gets deleted, an EventBridge rule gets removed, the IAM role goes with them, and the snapshots they made keep existing and keep being billed, forever, until something explicitly deletes them. There is no warning email. There is no dashboard widget. The only signal is the monthly bill, and the bill takes a year to be loud enough to investigate.

Two changes went in after the cleanup. The first was a tag-at-creation rule. Every snapshot the account creates now carries three tags applied at creation time: Owner (a team or service name), Retention (an ISO date past which the snapshot is safe to delete), and CreatedBy (the pipeline that made it). AWS Backup applies these automatically through its backup plan. The handful of custom Lambdas that survived the migration were rewritten to apply them. A weekly cleanup Lambda walks the account, deletes anything past its Retention date, and flags anything older than 90 days with no Retention tag. For the first 60 days the Lambda posted a Slack message and waited for a thumbs-up before deleting. After that it ran unattended.

The second change was to the quarterly cost review process. It now starts with the line items that should be zero or near zero, not the ones that are already big. The big lines get watched constantly by capacity planners. The lines that should be zero are where deleted infrastructure leaves footprints, and they are the ones least likely to be on anybody's dashboard. EBS Snapshot on a no-EBS-snapshot account. Lambda invocations on a service that was migrated to ECS six months ago. NAT Gateway hours on a workload that should not need cross-AZ egress. These are the lines where deprecated pipelines keep paying rent.

The lifecycle every snapshot now goes through. Untagged snapshots cannot live past 90 days without an explicit decision.

Cost archaeology on accounts where a deprecated pipeline is still paying rent

When the bill is the only thing telling you what you forgot

The shape of this incident is common. A pipeline gets shipped, the engineer who wrote it leaves, the policy gets replaced but the outputs survive, and the bill slowly bends upward. EBS snapshots are the most common shape we see. Detached EIPs are close behind. Idle NAT gateways and orphaned ElastiCache clusters round out the top four. None of these line items alarm on a CloudWatch dashboard because nothing is actively misbehaving. The deprecated pipeline is the misbehavior, and the pipeline no longer exists.

We run these cost-archaeology engagements regularly. In the last quarter we walked through three accounts where a single deprecated backup pipeline accounted for more than half of the account's EBS Snapshot line. We have an inventory script that finds orphan snapshots, detached volumes, unused EIPs, and idle NAT gateways across an account in about 20 minutes, plus a sample-then-delete workflow we walk the team through live so nothing irreversible happens on autopilot. The deletion is always the easy half. The work is figuring out which orphans are safe and writing the tag-at-creation policy that stops the next one.

If your bill has a line that does not match anything that should be running, the orphan audit is usually the fastest way to find out where it is going. Request an infrastructure review and we will run the audit with your team on a 30-minute diagnostic call this week. You can also see the broader pattern in our services overview for cloud cost spike work.

Originally published at https://infraforge.agency/insights/orphan-ebs-snapshots-deleted-backup-pipeline-cost-spike/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why one shared Terraform module made every PR a 14-service change

Muhammad Hassaan Javed — Tue, 26 May 2026 19:30:02 +0000

The PR that shipped the bug had three approvals and a comment that read "LGTM, plans look normal." The plans were not normal. They were 14 separate terraform plan outputs stacked in the CI log, each touching 80 to 120 resources, totaling around 1,400 resource changes for what the author described as a typo fix in a shared module. Buried somewhere in plan number nine was a change to an IAM policy attachment that broke three services on apply. Nobody had read past plan three. The team had spent six months congratulating themselves on collapsing 8,000 lines of Terraform into 1,200, and the bill for that consolidation had just arrived.

Problem signals:

Every PR touching a shared module shows N service plans in CI, each with 50+ resource changes
Reviewers approve with 'plans look normal' without scrolling through them
A single shared module has accumulated 25 to 40 input variables to handle per-service edge cases
CI plan time grows linearly with consumer count: one module change forces all N consumer plans to run, each loading its own remote state
A bug in the module breaks multiple unrelated services in the same apply window

How a 1,400-resource plan output stopped being read

The PR that broke three services had a clean LGTM

The original consolidation was, on paper, exactly the refactor every platform team is told to do. Fourteen service-specific Terraform configs, each maintained by a different feature team, each with its own subtle drift from the others. The platform team pulled the common shape out into one service-stack module, parameterized the differences, and pointed all 14 services at it. Eight thousand lines of HCL became twelve hundred. A change to add a shared observability sidecar landed across all 14 services in a single PR. Everyone celebrated.

The failure mode took six months to surface because the early signals looked like wins. Module changes shipped faster than the per-service changes they replaced. The platform team felt productive. What nobody tracked was that the CI plan output for every module PR had grown from one service's plan to fourteen, and the reviewers had silently adapted by reading the first plan, skimming the second, and rubber-stamping the rest.

Then a module-level change to how IAM policies were attached introduced a subtle bug: for services that overrode the default policy document, the new code path replaced rather than merged. Three of the 14 services overrode that default. The plan output showed the destruction and recreation of those policy attachments quite clearly, on lines somewhere around 870 of the GitHub diff view. The PR had three approvals.

Title: fix: typo in service-stack variable description

Diff: 1 line changed (a comment)

CI: terraform-plan-all ✓
  - service-a/plan: 84 changes
  - service-b/plan: 91 changes
  - service-c/plan: 102 changes
  - service-d/plan: 88 changes
  ... (10 more)

Total: 1,388 resource changes

Reviews:
  @platform-lead   approved 'LGTM, plans look normal'
  @service-b-eng   approved 'looks fine'
  @service-g-eng   approved

What the PR description looked like, paraphrased from the post-mortem

The fix that was not reviewer discipline

What we thought it was, what it actually was

The first instinct, and the one the team had spent two weeks pursuing before we got involved, was that this was a code review hygiene problem. They had written a PR template that required reviewers to acknowledge they had read each plan. They had a Slack bot that posted a daily "unreviewed plan changes" count. The platform lead had given a brown bag talk titled "Read Your Plans." None of it stuck, because none of it could stick. Asking a human to read 1,400 lines of plan output for a one-character comment fix is asking them to do something nobody should do, and they will not do it for long even if you make them feel guilty about it.

The actual problem was structural. The module had become a dependency surface that 14 consumers were forced to redeploy together, on every change, whether the change affected them or not. That is not a code review problem. That is the same coupling problem distributed systems people argue about with monoliths and microservices, except it had snuck in through the back door of a Terraform refactor. The cost of coupling does not show up the day you consolidate. It shows up the first time a small change has to ship and the blast radius is the entire fleet.

We have written more on the broader pattern in the Terraform and IaC debt pillar, but the specific recovery for this shape of problem has three layers, and they have to land in order.

How we cut the blast radius from 14 to 1 in an afternoon

Pinning the module per service was the bleeding stopper

The immediate move was to stop every consumer from being forced to re-plan on every module change. The mechanism is dumb and effective: pin each service's module reference to an explicit git ref instead of letting them all track main.

# Before: every consumer floats on main
module "service" {
  source = "git::https://github.com/org/modules.git//service-stack"
  name   = "auth-api"
  # ...
}

# After: every consumer is pinned to an explicit version
module "service" {
  source = "git::https://github.com/org/modules.git//service-stack?ref=v1.4.2"
  name   = "auth-api"
  # ...
}

Before and after: the module block in each service's Terraform config

After the pin, a module change ships as a tagged release in the modules repo, then ships to consumers one at a time via a per-service PR that bumps the ref. Each of those PRs shows exactly one service's plan, and that plan is short enough to read. The reviewers can do their job again. The author has to think about which services they actually want this change in, in what order, and on what schedule.

There is a real cost to this, and we want to name it honestly: you have given up some of the consolidation win. You can no longer ship an observability change to all 14 services in one PR. You can ship it in one tagged module release plus 14 small bump PRs, which is more clicks. We have not had a client regret the trade once they lived with it for a month. The clicks are cheap; the missed bug in plan number nine is not.

The change-propagation shape before and after pinning

Why the 30-input module became 5 inputs plus an advanced object

Tiering inputs and splitting by change velocity

Pinning bought time. It did not fix the underlying reason the module had become hard to change. We sat with the platform team and looked at all 30 inputs the module had grown. Most services used 5 of them. The other 25 existed because, over six months, individual services had asked for an escape hatch ("can the module take a custom IAM policy document?", "can we override the security group rules?", "can we set a node selector?") and the module owners had said yes, every time, because saying no felt like blocking a teammate. The module had become an everything-bagel.

We refactored the input surface into two tiers. The common five became first-class top-level inputs. The other 25 went into an optional advanced object with optional() fields, so a normal consumer never sees them and an exotic consumer has to opt in deliberately.

# Common path: every service uses these
variable "name"        { type = string }
variable "image"       { type = string }
variable "replicas"    { type = number }
variable "environment" { type = string }
variable "port"        { type = number }

# Escape hatch: explicit, optional, and visible in code review
variable "advanced" {
  type = object({
    custom_iam_policy_json = optional(string)
    extra_sg_rules         = optional(list(object({ ... })))
    node_selector          = optional(map(string))
    # ... 22 more rarely-used knobs
  })
  default = {}
}

variables.tf after the tiering refactor

Then we did the harder work: splitting the module along change-velocity boundaries. The monitoring submodule was changing roughly weekly, the database submodule once a quarter, and the networking submodule about twice a year. Bundling them together meant every monitoring tweak forced a re-plan of database and networking resources for all 14 consumers. We pulled them apart into separate modules with separate version pins, so a consumer can bump monitoring from v2.1 to v2.2 without touching database at all.

This is the same argument microservice advocates make about service boundaries, and the rule of thumb is the same: couple things that change together, decouple things that change at different rates. The cost of getting it wrong in Terraform is not latency or distributed-transaction pain. It is plan output nobody reads, and bugs that ship because of it.

Step	What it does
Pin per consumer	Each service references the module at an explicit git tag. Module changes ship as releases, then propagate per service.
Tier the inputs	Five common inputs stay first-class. The long tail moves into an optional advanced object so escape hatches are explicit.
Split by change velocity	Submodules that change at different rates become separate modules with separate version pins. A weekly change does not drag a quarterly one along.
Gate multi-service plans	An OPA or CI check fails any PR whose plan touches more than three workspaces unless the description includes allow-multi-service: yes.

The OPA gate is worth its own sentence. It is twenty lines of Rego that counts distinct workspaces touched by the plan and fails the PR over a threshold unless the author explicitly opts in. It does not prevent fleet-wide changes; it forces the author to acknowledge they are making one. That single check has caught two accidental fleet-wide PRs at the clients who have adopted it, both of which would have shipped under the old regime.

When the consolidation win has become a coupling tax

If your shared modules feel like this right now

The hard part of this kind of recovery is not the technical work. Pinning a module ref is ten minutes of typing per service. Splitting a module along change-velocity lines is a weekend. The hard part is convincing the team that the consolidation they are proud of has become a liability, and doing the unwind in an order that does not cause an outage. We have done this engagement at four SaaS platforms in the last year, and the pattern of "reviewer fatigue followed by a buried bug" shows up in three of the four. The fourth caught it before the bug shipped, only because their CI plan output had grown past GitHub's diff size limit and forced the conversation.

We run these recovery engagements every week. If your platform team is shipping module changes that produce thousand-line plan outputs and your reviewers have started writing "plans look normal" without reading them, the next bug is already on its way. Book an infrastructure review with our team and we will spend a 30-minute diagnostic call this week mapping your module consumer graph and naming the first three pins to put in place.

Originally published at https://infraforge.agency/insights/terraform-shared-module-coupling-fleet-wide-plans/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

When ArgoCD shows Healthy but Keycloak silently strips JWT claims

Muhammad Hassaan Javed — Fri, 22 May 2026 22:19:02 +0000

ArgoCD reported Synced and Healthy. The Keycloak Helm release was green. And the downstream timeline service was returning 401 on every authenticated request. That was the call we got: every dashboard says the platform is fine, and authentication is broken across three services. The JWTs auth-service was issuing had stopped carrying the groups claim and the email_verified claim about 40 minutes earlier, right after an ArgoCD auto-sync rolled out a Keycloak chart bump. Six OIDC clients had silently lost protocol mappers and role mappings during that sync, and we did not yet know it.

Problem signals:

ArgoCD shows Synced and Healthy on the Keycloak application, but downstream services return 401 on tokens they accepted an hour ago
JWTs decoded at jwt.io are missing claims that production code depends on (groups, email_verified, audience)
Engineers have been making emergency fixes directly in the Keycloak admin console during recent incidents and not committing them back
The realm import ConfigMap in git has not been touched in weeks, yet the live realm has clearly changed
Helm values for the Keycloak chart set realm import strategy to OVERWRITE or leave it unset (which defaults to OVERWRITE on most charts)

The sync that looked clean and quietly stripped six clients

ArgoCD said Healthy. Auth said 401.

Our first guess was wrong. The team had been staring at auth-service for 25 minutes when we joined the bridge, because the tokens it was issuing were obviously malformed. The groups claim was gone. The email_verified claim was gone on a different client. Surely auth-service had shipped a bad release. Except auth-service had not shipped in nine days, and the failure had started 40 minutes ago, not nine days ago.

The shape of the failure is what gave it away. Three OIDC clients had each lost a different mapper at the same moment. Auth-service had lost a groups protocol mapper. The profile service had lost an email_verified client scope mapping. The api gateway had lost role mappings for a downstream audience. Three services do not lose three unrelated pieces of OIDC config simultaneously unless something upstream rewrote all of them at once. The only thing that had touched Keycloak in that window was an ArgoCD auto-sync of the Keycloak Helm release.

We pulled the ArgoCD sync history and found the sync 41 minutes earlier. It was a chart version bump, nothing that should have changed realm content. But the chart ships a realm import ConfigMap, and the realm JSON inside that ConfigMap had not been updated in weeks. Meanwhile the live realm in the Keycloak PostgreSQL database had been edited through the admin console at least a dozen times during recent incidents. None of those console changes had been committed back to git.

So the chart redeployed the ConfigMap. The Keycloak init container read it. And the realm import ran with the strategy set to OVERWRITE. Every console change made during the previous two weeks of incident response got reverted to the stale git version, silently, with no error and no event surfaced to ArgoCD.

Diffing live realm state against the ConfigMap before doing anything destructive

Six clients had drifted and the next sync would make it worse

The first thing we did was not a fix. The first thing we did was freeze. Auto-sync was still enabled on the Keycloak ArgoCD application. If anyone touched a Helm value for any reason in the next hour, another sync would fire and a second OVERWRITE pass would run against whatever state we had managed to reconstruct. We paused auto-sync first and removed the self-heal annotation, then started the diagnosis.

# 1. Freeze the ArgoCD app so the next sync cannot fire mid-recovery
argocd app set keycloak --sync-policy none
argocd app set keycloak --self-heal=false

# 2. Pull live realm state from the Keycloak Admin REST API
TOKEN=$(curl -s -X POST "$KC/realms/master/protocol/openid-connect/token" \
  -d "grant_type=password" -d "client_id=admin-cli" \
  -d "username=$ADMIN_USER" -d "password=$ADMIN_PASS" | jq -r .access_token)

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/clients" | jq . > live-clients.json

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/client-scopes" | jq . > live-scopes.json

# 3. Extract the realm JSON ArgoCD just pushed
kubectl -n keycloak get cm keycloak-realm-import -o jsonpath='{.data.realm\.json}' \
  | jq . > configmap-realm.json

Snapshot live state before any reconciliation. The live API is now the source of truth, not the ConfigMap.

Diffing live-clients.json against the clients block in configmap-realm.json showed six clients with material differences. Two were missing protocol mappers entirely. Three had client scopes that had been removed. One had role mappings that were present in production but absent from the ConfigMap, because an engineer had re-added them by hand in the console during the firefight, before we joined the bridge, and had not committed them either. That last finding was the one that mattered most. The drift ran in both directions, and the ConfigMap was still armed: every future sync was another OVERWRITE pass waiting to delete config that existed only in the Keycloak database. This one had simply cascaded far enough to break downstream services.

Two write paths to the same realm. OVERWRITE makes one of them silently win.

Reconstructing realm state without invalidating active sessions

Why we did not re-import the ConfigMap

The obvious recovery path was to fix the realm JSON in git, commit it, and let ArgoCD re-sync. We did not do that, and the reason matters. A full realm re-import, even with the right content, runs through the Keycloak realm import flow on startup. Depending on the chart and the Keycloak version, that can rotate signing keys, drop active sessions, or invalidate refresh tokens. We had roughly 8,000 active user sessions at that moment. Forcing all of them to re-authenticate at 11pm during an active incident was not a recovery; it was a second outage on top of the first.

So we split the fix into two phases. Phase one was to restore live realm state using the Admin REST API directly, client by client, mapper by mapper. The REST API can add a protocol mapper or attach a client scope to a client without bouncing anything. Phase two was to update the ConfigMap in git to match the now-correct live state AND change the import strategy, so that the next ArgoCD sync would be a no-op rather than another OVERWRITE pass.

# Phase 1: restore each missing mapper live via Admin REST API
# Example: re-add the groups protocol mapper to auth-service client
CLIENT_ID=$(jq -r '.[] | select(.clientId=="auth-service") | .id' live-clients.json)

curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  "$KC/admin/realms/primary/clients/$CLIENT_ID/protocol-mappers/models" \
  -d '{
    "name": "groups",
    "protocol": "openid-connect",
    "protocolMapper": "oidc-group-membership-mapper",
    "config": {
      "claim.name": "groups",
      "full.path": "false",
      "id.token.claim": "true",
      "access.token.claim": "true",
      "userinfo.token.claim": "true"
    }
  }'

# Verify a freshly issued token now carries the claim before moving on
curl -s -X POST "$KC/realms/primary/protocol/openid-connect/token" \
  -d 'grant_type=client_credentials' \
  -d "client_id=auth-service" -d "client_secret=$SECRET" \
  | jq -r .access_token | cut -d. -f2 | base64 -d 2>/dev/null | jq .

Restore each mapper live, then verify the issued token actually carries the claim before moving to the next client.

We worked through the six clients in dependency order: auth-service first because every other service consumed its tokens, then the api gateway, then profile, then the rest. After each client we curl'd a fresh token and base64-decoded the payload to confirm the claim was present. Twenty-two minutes from the start of restoration, timeline-service was returning 200s again. No sessions dropped. No users re-authenticated. The Keycloak pods were never restarted.

What we changed so the next sync becomes a no-op

The one Helm value that should never be OVERWRITE

With live state correct, the dangerous artifact in the system was still the stale realm JSON in the ConfigMap and the OVERWRITE strategy that would re-apply it on any future sync. We exported the now-correct realm via the Admin API, ran it through a diff against what was in git, and committed the result. We also patched the Keycloak Helm values to set the realm import strategy to IGNORE_EXISTING.

# values.yaml for the Keycloak chart
extraEnv: |
  - name: KEYCLOAK_IMPORT_STRATEGY
    value: IGNORE_EXISTING
  # On Keycloak 22+ via Quarkus distribution:
  - name: KC_SPI_IMPORT_SINGLE_FILE_STRATEGY
    value: IGNORE_EXISTING

# For the operator/CR variant:
# spec:
#   realmImport:
#     strategy: IGNORE_EXISTING   # NOT OVERWRITE_EXISTING

IGNORE_EXISTING means the ConfigMap seeds a realm on first creation but never overwrites existing resources. This is the correct setting for any realm that humans also edit.

We re-enabled ArgoCD auto-sync and watched it run. The sync diffed clean: ConfigMap content matched live realm, import strategy was IGNORE_EXISTING, no resources were touched. Green for the right reason this time.

We changed two things in the way the team operates going forward. First, we wrote a small drift detector that runs nightly. It pulls the live realm via the Admin API, diffs it against the realm JSON in git, and posts to a Slack channel if they disagree. It is roughly 80 lines and it has caught two console-edits-not-committed in the six weeks since. Second, we now treat OVERWRITE as a forbidden value for any realm that is also editable in the admin console. If you want OVERWRITE semantics, you must also remove admin console write access for everyone except a break-glass account, because otherwise you are building a system where one of two writers silently destroys the other's work. We have written more about this category of GitOps failure in the ArgoCD and GitOps recovery cluster, and the same pattern shows up with Grafana dashboards, Argo Workflows templates, and anything else where humans and a controller both have write access to the same object.

When GitOps is silently rewriting your identity provider

If your realm config and your cluster disagree

The hard part of this kind of incident is not the Keycloak knowledge. It is recognizing that a green ArgoCD dashboard can coexist with a destroyed production configuration, and knowing which fixes preserve sessions versus which ones lock out every user in the building at midnight. The team we worked with had the Keycloak skills. What they did not have was a recovery sequence that prioritized live state capture over git reconciliation, and a clear rule about when to apply via the Admin API versus when to let ArgoCD do it.

We run these recovery engagements every week. The OVERWRITE-vs-IGNORE_EXISTING trap has hit two other teams this quarter, both on Keycloak, and we have seen the same shape on Grafana provisioning, Argo Workflows ClusterWorkflowTemplates, and a memorable case with Vault policies. The pattern is always: controller writes, human writes, controller wins on the next reconcile, nobody notices for hours.

If your identity provider, your dashboards, or any other system with human-editable state is sitting behind ArgoCD and you have ever wondered whether you are quietly losing changes, book an infrastructure review with our team and we will be on a bridge with you the same day. The first 30 minutes will tell you whether you have a drift problem, and from there we can scope a recovery that does not require kicking your users out.

Originally published at https://infraforge.agency/insights/keycloak-realm-overwrite-argocd-sync-drift/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why a Terraform apply hangs 90 minutes on a custom provider with no timeout

Muhammad Hassaan Javed — Fri, 22 May 2026 10:14:24 +0000

Two hundred destroys that needed 40 seconds of real work hung for 90 minutes. The platform team kicked off a terraform apply to remove stale config entries from an internal service, watched the progress bar stop at minute 12, and then stared at a frozen terminal until someone finally ran kill -9. By that point the state file was half-updated, the DynamoDB lock was still held, and nobody was sure which of the 200 entries had actually been deleted. The custom Terraform provider doing the destroys had a synchronous HTTP call with no context timeout, and the backend behind it was rate-limiting at 5 RPS. Neither side was wrong on its own. The contract between them was broken.

Problem signals:

terraform apply prints no output for 20+ minutes after destroys begin, no progress, no errors
The backend service is healthy on its dashboard but throttling requests at a low RPS limit
kill -9 on the terraform process leaves the DynamoDB state lock held forever
After force-unlock, terraform state list shows resources that no longer exist in the cloud
The custom provider in use was written internally and has no timeouts {} block support documented

What the team thought was happening, and what was actually happening

Forty seconds of work, ninety minutes of silence

The first assumption was that the internal config service was hung. It was not. Its dashboard showed it healthy and serving requests, just slowly. The second assumption was that terraform was making progress and just not printing anything. That one was half true. Terraform was making progress, just nowhere near the rate the backend could have served. The backend was rate-limiting at 5 requests per second, so 200 entries is 40 seconds of real work at that ceiling. The team waited 90 minutes.

The reason for the gap was a custom Terraform provider written by a previous platform team. Its DeleteResource function looked roughly like the snippet below. No context. No timeout. No retry-with-backoff. No progress emission back to Terraform's UI layer. When the backend returned a 429, the provider's HTTP client did its own internal retry, swallowed the error, and tried again. Forever. Because the provider never returned from Delete, Terraform's supervisor saw a working call and waited.

func resourceConfigEntryDelete(d *schema.ResourceData, meta interface{}) error {
    client := meta.(*ConfigClient)
    id := d.Id()

    // No context. No timeout. No bound on retries.
    for {
        err := client.DeleteEntry(id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            time.Sleep(1 * time.Second)
            continue
        }
        return err
    }
}

The shape of the broken Delete function (reconstructed from the provider source)

What this should have been is below. The schema.ResourceTimeout block lets users set a timeouts {} block on the resource. The context carries that deadline. When the deadline expires, the provider returns an error and Terraform marks the resource as tainted, not as silently in-progress for the rest of human history.

func resourceConfigEntryDelete(ctx context.Context, d *schema.ResourceData, meta interface{}) diag.Diagnostics {
    client := meta.(*ConfigClient)
    id := d.Id()

    return retry.RetryContext(ctx, d.Timeout(schema.TimeoutDelete), func() *retry.RetryError {
        err := client.DeleteEntryWithContext(ctx, id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            return retry.RetryableError(err)
        }
        return retry.NonRetryableError(err)
    })
}

What the Delete function should look like

The half-updated state and the stuck DynamoDB lock

Why kill -9 left us worse off

When the engineer finally ran kill -9 on the terraform process, two things happened that compounded the problem. First, the DynamoDB lock entry stayed exactly where it was. Terraform releases its lock on graceful shutdown, not on SIGKILL. So the next person who ran terraform plan got the familiar error and assumed someone else was still working on it. They were not. The lock was a ghost.

Second, the deletes had been trickling through at a small fraction of the backend's ceiling. Every worker retried on a flat one-second sleep, ignoring Retry-After, and the backend's limiter counted rejected requests against the same bucket. The retry traffic kept that bucket saturated and starved the deletes that would otherwise have gone through. Observed throughput was roughly 60 of the 200 entries in the first 12 minutes, about one delete every 12 seconds against a 5 RPS ceiling. That gap between the nominal limit and the achieved rate is why 40 seconds of work turned into an afternoon. Terraform does persist state to the remote backend as an apply progresses, so most of those 60 completed deletes were recorded. What SIGKILL destroyed was the boundary: the delete that was in flight when the signal landed, and the state write the process never got to finish. State and the backend now disagreed at the edges, and the provider's Read function carried the same no-timeout bug, so a refresh could not be trusted to settle the argument on its own.

Before doing anything else we confirmed the terraform process was actually dead on the operator's machine. ps aux | grep terraform, on the actual machine, not a tmux pane from yesterday. We have force-unlocked locks that turned out to belong to a process still doing useful work, and the damage is worse than a stuck lock. Once confirmed dead, terraform force-unlock with the lock ID from the error message released DynamoDB.

# 1. Confirm no terraform process is running on the operator's machine
ssh operator-host 'ps aux | grep -v grep | grep terraform'

# 2. Release the lock (lock ID comes from the error message)
terraform force-unlock 7c4a3e22-1b9d-4e8a-b6d7-9f2a8c5e4d11

# 3. See what state thinks vs what the cloud actually has
terraform plan -refresh-only

# 4. Apply the refresh so state matches reality
terraform apply -refresh-only

The recovery sequence after confirming the process is dead

Scripting state rm and import for 200 entries

Reconciling state against a half-finished destroy

After the refresh-only apply, state and cloud agreed on what existed. But the original goal, deleting all 200 entries, was still only partially done. We now had two populations to handle: entries that still existed both in tfstate and in cloud (the destroy had not gotten to them), and entries that had been removed from cloud during the hung apply but were no longer in tfstate either (the refresh had cleaned them up). The first group we could destroy normally. The second group needed nothing further.

Where it got annoying was a third population we discovered later: a handful of entries that had been deleted from cloud by the hung apply, but where the refresh had failed to notice because the provider's Read function had the same no-timeout bug and was returning stale cached data. Those entries were ghosts in tfstate. For each one we had to run terraform state rm by address. With 47 of them, we scripted it from a diff.

# Pull current tfstate resource list
terraform state list | grep config_entry > tfstate_entries.txt

# Pull live entries from the backend, one page at a time, with a pause between calls
: > live_entries.txt
page=1
while :; do
  body=$(curl -sf "$CONFIG_API/entries?page=$page&per_page=100") || { echo "fetch failed on page $page"; exit 1; }
  count=$(printf '%s' "$body" | jq '.entries | length')
  [ "$count" -eq 0 ] && break
  printf '%s' "$body" | jq -r '.entries[].id' >> live_entries.txt
  page=$((page + 1))
  sleep 1
done

# Guard: an empty live list means the fetch broke, not that the backend is empty.
# Without this, every resource in tfstate looks like a ghost.
if [ ! -s live_entries.txt ]; then
  echo "live_entries.txt is empty, refusing to generate state rm commands"
  exit 1
fi

# Entries in tfstate but not in cloud: these are ghosts
comm -23 <(sort tfstate_entries.txt) <(sort live_entries.txt | sed 's|^|module.config.config_entry.|') > ghosts.txt

# Remove them from state
while read addr; do
  terraform state rm "$addr"
done < ghosts.txt

Generating the state rm commands from a diff between tfstate and the live backend

For the inverse case (entry exists in cloud but not in tfstate), the recovery is terraform import. We did not hit this on this incident but we have hit it on similar ones, and the same diff approach works in the other direction. The general pattern for any half-finished Terraform operation against a custom provider is laid out in our Terraform state recovery playbook.

The contract every custom Terraform provider has to honor

What the provider should have done

A custom Terraform provider is a contract. Terraform's whole supervision model assumes the provider plays by it. The contract is short: Create, Read, Update, and Delete each accept a context, each respect the user's timeouts {} block, each emit clear errors when something goes wrong, and each return in bounded time. When a provider violates the contract, Terraform's user-facing behavior degrades in ways that look like Terraform bugs but are not.

Internal providers skip the contract more often than vendor ones, because the team that writes the provider also runs the backend it talks to, and they convince themselves they have full visibility. They do not. terraform-cli is a separate process. It cannot see your retry loop. It cannot see your in-flight HTTP call. All it sees is a function that has not yet returned. The fix for this provider was four changes:

Step	What it does
1. Accept context on every CRUD function	Migrate from the legacy schema.CreateFunc signatures to the context-aware schema.CreateContextFunc variants. This is a non-optional change on terraform-plugin-sdk v2.
2. Declare and honor timeouts on every resource	Add a Timeouts: &schema.ResourceTimeout{Create: schema.DefaultTimeout(5 * time.Minute), Delete: schema.DefaultTimeout(5 * time.Minute)} block on every resource schema. Use d.Timeout(schema.TimeoutDelete) inside the function.
3. Replace internal retry loops with retry.RetryContext	The retry helper respects the context deadline and surfaces retryable vs non-retryable errors cleanly. Hand-rolled for-loops over time.Sleep do not.
4. Pin the fixed version via .terraform.lock.hcl	Release a new patch version of the provider, update the lockfile, and remove the old version from your internal registry so nobody can fall back to it.

The apply pattern itself also needed a change. Destroying 200 entries in one shot against a 5 RPS backend is asking for trouble even with a correct provider, because a 5-minute timeout per resource is generous when one resource genuinely takes 200ms but useless when the queue ahead of you is 199 other deletes. We split future bulk operations into batches of 10 using -target, or we push the backend team to expose a bulk delete endpoint. The provider then wraps the bulk endpoint as a single resource operation instead of looping.

The relationship that broke and what fixes each side

When a custom provider has left your state in an unknown shape

If you are looking at a hung apply right now

Hung Terraform applies against internal providers are the kind of incident that sounds boring in a postmortem and feels terrifying in the moment. You cannot tell if the apply is still doing useful work or stuck forever. You cannot kill it without risking a half-finished state. You cannot force-unlock until you are certain the process is dead. And once you do recover, you do not actually know which resources got modified and which did not, because the provider did not emit progress and the kill landed in the middle of a state write.

We run these recovery engagements often enough that the script above is templated. The no-timeout custom provider pattern shows up in maybe one in five of the Terraform recoveries we have done this year, almost always with internal providers written years ago by an engineer who has since left. The fix is mechanical once you know the shape of the failure: confirm process death, force-unlock, refresh-only plan, diff state against cloud, reconcile with state rm and import, then patch the provider so it cannot happen again.

If you are staring at a hung apply right now and you are not sure whether to kill it, book an infrastructure review with our team and we will be on a bridge with you the same day. If the apply is already dead and you are sorting through the wreckage, the same engagement covers the state reconciliation and the provider fix together.

Originally published at https://infraforge.agency/insights/terraform-apply-hung-custom-provider-no-timeout/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Muhammad Hassaan Javed — Thu, 21 May 2026 21:13:36 +0000

The first fix lasted 90 seconds. We had corrected the Grafana datasource URL from prometheus:9999 back to prometheus:9090, watched the pod roll, refreshed the dashboard, and seen one panel come alive. By the time we opened a second tab, the ConfigMap was back to 9999. That was the real incident. The 'No Data' dashboards were a symptom of an observability stack that someone, or something, was actively re-corrupting from at least seven places we had not yet found.

Problem signals:

Grafana dashboards show 'No Data' on every panel after a cluster migration, and kubectl edit fixes revert within 1-3 minutes
Prometheus targets page is empty or stuck on a namespace that does not exist anymore
ClusterRoleBindings you just recreated reference a ClusterRole name nobody on the team typed
ps aux shows kworker-looking processes with elevated CPU that hold open file descriptors to a kubeconfig
kubectl get cronjobs -A shows entries in namespaces nobody on the platform team remembers creating

Why we stopped fixing config and started looking for what was undoing it

The fix that lasted 90 seconds

The team that called us had been at this for nine hours. After a cluster migration, every Grafana dashboard was blank. The on-call had walked through the obvious things. The Prometheus datasource in Grafana pointed at port 9999. The Loki datasource pointed at port 3199. The Prometheus scrape config had annotation keys nobody recognized (prometheus_io_metrics_enabled instead of prometheus.io/scrape) and targeted a namespace that did not exist. The Grafana deployment had a config-validator init container running sleep 3600. Each one of those was a real bug. Each one of those, fixed in isolation, would revert before the next pod rolled out.

The shape of what they were describing was not a botched migration. A botched migration leaves bad state. This was bad state being re-applied. When manual kubectl edits revert in minutes, the question is no longer 'what is wrong with the manifest', it is 'what process has write access and is reconciling against a corrupt source of truth'. We told them to stop fixing config until we had inventoried every actor that could write to the cluster.

This sounds obvious written down. In the middle of an incident, with executives asking for an ETA on dashboards, the instinct is to keep patching. We have run this play enough times now to know the patching never converges. You burn three more hours and your changes still revert. The only path out is persistence-first triage.

Seven places state was being rewritten from

A kworker thread holding a kubeconfig

We started on the nodes. ps auxf on each worker showed a process named [kworker/u8:2-events_unbound]. Square brackets usually mean a kernel thread, and you learn early not to touch kernel threads. We almost moved on. The thing that snagged our attention was CPU: a real kernel worker thread on an idle-ish node should not be sitting at 12 percent. We pulled its open file descriptors.

$ ls -l /proc/$(pgrep -f 'kworker/u8:2')/fd/ 2>/dev/null | head
lr-x------ 1 root root 64 ... 3 -> /root/.kube/config
lrwx------ 1 root root 64 ... 7 -> socket:[884213]
lr-x------ 1 root root 64 ... 9 -> /opt/.reconciler/state.json
$ cat /proc/$(pgrep -f 'kworker/u8:2')/comm
kworker/u8:2-events_unbound
$ readlink /proc/$(pgrep -f 'kworker/u8:2')/exe
/opt/.reconciler/agent

Kernel threads do not hold kubeconfigs or have an exe link. This was a userspace binary with a spoofed comm name.

That was reconciler one. The same trick was on every node, with comm names rotating through plausible kworker patterns (flush-dm-0, mm_percpu_wq). We collected the binary, killed every instance, removed the systemd unit that was respawning it, and moved on. Then we did the boring sweep nobody wants to do in the middle of an incident.

kubectl get cronjobs -A surfaced config-audit in kube-system and prometheus-metrics-federation in cattle-monitoring-system. Neither was ours. Both ran every 60 seconds and wrote ConfigMaps. Reconcilers two and three.
systemctl list-timers on each node showed k8s-health-monitor.timer firing every two minutes against the API server with a node-local kubeconfig. Four.
ls /etc/cron.d/ had a host cron entry running a script under /opt/.reconciler/ once a minute as a belt-and-braces backup to the systemd timer. Five.
kubectl get mutatingwebhookconfigurations turned up the one that hurt us most, rbac-policy-enforcer, which rewrote roleRef on any ClusterRoleBinding we created. Six.
The second mutating webhook, pod-policy-webhook, injected the config-validator init container running sleep 3600 into every pod admitted to the monitoring namespace. Seven.

Seven writers. Any one of them, left running, would have re-corrupted the stack within minutes of our fixes. Four more findings wrote nothing at all; they existed to make the cleanup slow.

namespace-policy-webhook, a validating webhook, rejected our edits to the cattle-monitoring-system namespace object outright.
chattr was set +i on /etc/cron.d/k8s-health and on the corrupted ConfigMap manifests staged on disk. Edits failed silently with 'operation not permitted'.
Finalizers on the CronJobs prevented kubectl delete from completing until we patched them off.
PodSecurity labels on cattle-monitoring-system were set to enforce a baseline that blocked our debug pods from running.

Some teams have a reconciler. This cluster had a mesh of them, each one a backup for the others, with a layer of obstacles bolted on top to slow anyone trying to unpick it. That is not a thing healthy infrastructure does; it is a thing a previous incident or a hostile takeover does. Either way, the response is the same.

The order we neutralized things, and why order matters

Why we deleted the webhooks before touching RBAC

There is a trap in this kind of cleanup. If you fix the visible problem before you neutralize the actor reverting it, you have wasted a fix and burned credibility with the room. The worst version of this in our case was the RBAC webhook. The Prometheus ClusterRoleBinding had been deleted entirely, and the deployment had been swapped to the default service account. The obvious move was to recreate the CRB and patch the deployment back to a proper SA.

We tried it once, in a scratch namespace, just to see. The CRB came back with roleRef pointing at a ClusterRole that did not exist. The mutating webhook was matching anything with 'prometheus' or 'monitoring' in the name and silently rewriting the roleRef. If we had run that against the real CRB in production with the team watching, we would have looked like we did not know what we were doing, and the fix would not have worked.

Neutralize first, then fix. RBAC and any 'monitoring'-named resource go last because the webhook would mutate them on creation.

So the order was: strip finalizers from the CronJobs, chattr -i on the immutable files, delete the three webhook configurations, suspend and delete the CronJobs in kube-system and cattle-monitoring-system, mask the systemd timer, remove the host cron entry, kill the userspace reconciler processes on every node and remove their systemd unit. Then we sat for 60 seconds and watched. No ConfigMap mutations. No Deployment patches. Quiet cluster. That was the first time in nine hours the cluster had been quiet, and you could feel the room exhale.

Restoring the observability stack once writes were ours alone

The order we put it back together

With the reconcilers gone, the config fixes were the easy part. We did them top-down by data flow: scrape config, then service routing, then the consumers.

# 1. Prometheus ConfigMap: restore annotation keys, fix namespace, drop interval
kubectl -n monitoring get cm prometheus-config -o yaml > /tmp/prom-cm.yaml
# edit: prometheus_io_metrics_* -> prometheus.io/scrape, /metrics, port
#       namespaces: [bleater-nonexistent] -> the real app namespace
#       scrape_interval: 300s -> 30s
kubectl apply -f /tmp/prom-cm.yaml

# 2. Prometheus Service: targetPort 9099 -> 9090
kubectl -n monitoring patch svc prometheus --type=json \
  -p='[{"op":"replace","path":"/spec/ports/0/targetPort","value":9090}]'

# 3. Service account and RBAC (webhooks already deleted)
kubectl -n monitoring create sa prometheus
kubectl create clusterrolebinding prometheus \
  --clusterrole=prometheus --serviceaccount=monitoring:prometheus
kubectl -n monitoring set serviceaccount deploy/prometheus prometheus

# 4. Prometheus readiness probe: port 9099 /-/healthz -> 9090 /-/ready
# 5. Loki: drop -server.http-listen-port=3199 arg, fix svc selector loki-server -> loki
# 6. Grafana: remove init container, fix probe ports, drop GF_SERVER_HTTP_PORT,
#    fix volume refs (-v2 -> base name), reset admin secret
# 7. Delete NetworkPolicy grafana-egress-restrict
kubectl -n monitoring delete networkpolicy grafana-egress-restrict

We applied these as separate kubectl operations on purpose, not a single helm rollout, so we could verify each one stuck before moving on.

After every step we waited 30 seconds and re-read the resource. Nothing reverted. We rolled the Grafana deployment, watched it come up clean with no init container blocking startup, hit the Prometheus targets page and saw 11 active up series including the application pods, then loaded a dashboard. Data. The two-minute stability window passed with no drift. We held the bridge for another 20 minutes anyway, because the team needed to see it not break more than they needed us to leave.

Persistence-first triage is now the default for post-migration observability failures

What we changed in our own playbook

We have changed how we open any incident where fixes do not stick. The first 15 minutes are no longer spent on config. They are spent on a sabotage sweep: cronjobs in every namespace (not just the obvious ones, cattle-monitoring-system bit us and we have seen it bite others), systemd timers on every node, /etc/cron.d, validating and mutating webhooks, finalizers on resources we expect to delete, immutable file attributes on staged manifests, and a ps auxf on every node with an eye on anything in square brackets that has an exe link.

We also changed how we think about kubectl edit during a live incident. If a change has to land and the cluster has any chance of having a reconciler we have not yet found, we apply through git and watch the apply, not edit the live object. It is slower by 90 seconds and saves you from spending an hour wondering why your fix evaporated. We have written more on the same instinct in our notes on Kubernetes release failures and on ArgoCD self-heal traps, which is the friendly version of this same pattern.

The non-obvious lesson from this incident is that hostile or accidental reconcilers do not announce themselves. The kworker spoof was the cleverest piece; it would have survived a casual ps. The cattle-monitoring-system namespace looked legitimate to anyone who had ever run Rancher. The webhook had a name (rbac-policy-enforcer) that sounded like something a security team would install. In each case the move that surfaced it was boring: enumerate the category exhaustively, then ask which entries the team can account for. Anything they cannot account for is the answer.

When fixes revert, the problem is not the fix

If your post-migration monitoring keeps un-fixing itself

The hard part of incidents like this is not the Prometheus annotation key or the Grafana port. Those take 20 minutes once the cluster stops fighting you. The hard part is having the discipline to stop patching and inventory every actor that can write to your cluster, especially when leadership is asking for an ETA and your instinct is to keep typing. The hard part is also knowing what the categories of reconciler are. If you have never had to look for a mutating webhook that rewrites RBAC, or a host process pretending to be a kworker, the search takes hours. If you have seen it before, it takes 15 minutes.

We run these recovery engagements every week. We have seen the kworker spoof twice this year, the cattle-monitoring-system CronJob trick three times, and the RBAC-mutating webhook in two unrelated post-migration incidents. The playbook is portable; the patience to run it before patching is the part teams in the middle of an outage struggle with, and that is usually why they call us.

If your dashboards are blank after a migration and your fixes are not sticking, book an infrastructure review with our team and we will be on a bridge with you the same day. Bring node SSH access, kubectl with cluster-admin, and a list of every namespace you can name. We will handle the rest.

Originally published at https://infraforge.agency/insights/grafana-no-data-after-migration-reconcilers-reverting-fixes/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

When MinIO Deny Wins Cause Silent Upload Failure

Muhammad Hassaan Javed — Thu, 21 May 2026 01:44:45 +0000

The dashboards were green. The api-gateway logged 12,400 successful media POSTs over six hours, the storage service SDK reported 200 on every PutObject, and the fanout queue happily processed every notification. The MinIO bucket had gained zero new objects in the same window. Users were seeing broken image tiles in their feeds and the on-call team had spent three hours chasing the fanout service because that was the only place the symptom was visible. The actual problem was an explicit Deny on s3:PutObject sitting inside a bucket policy that had been written during a security hardening sprint two days earlier and rolled out to the cluster at 02:14 that morning, and MinIO was doing exactly what S3 IAM semantics say it should do: deny wins, even when the user policy says Allow.

Problem signals:

Upload endpoints return HTTP 200 but the object never appears in the bucket
The app publishes upload events on its own success path and downstream consumers process phantom events
Grafana shows upload throughput as healthy because SDK success metrics dominate the panel
Users report broken image links while every service-level dashboard is green
A recent IAM or bucket policy change correlates in time with the start of phantom uploads

The discrepancy that should have been the first alert

12,400 successful uploads, zero new objects

We came in on the third hour of the incident. The team had been chasing the fanout consumer because user reports were all of the form 'my avatar is broken' and the only service touching media after upload was fanout. Their working theory was that fanout was racing the CDN, or that the notification payload was missing a key, or that signed URLs were expiring early. They had three engineers staring at fanout-service logs and finding nothing wrong, because there was nothing wrong with fanout-service.

The question we asked, which is the question we always ask first when an upload pipeline misbehaves: how many objects has the bucket actually gained in the last hour? Not how many uploads the API recorded. Not how many notifications fanout received. How many real objects exist now that did not exist sixty minutes ago. We ran the listing against the MinIO admin API and the answer was zero. The bucket had not gained a single object since 02:14 that morning, which lined up almost exactly with the rollout of a security hardening PR the platform team had merged two days prior. The policy was written on the Tuesday; it only reached the cluster when the nightly config sync ran at 02:14.

# count objects added in the last hour
mc find local/bleater-media --newer-than 1h | wc -l
# 0

# meanwhile the storage-service success counter
curl -s http://prometheus/api/v1/query \
  --data-urlencode 'query=sum(increase(storage_service_put_object_success_total[1h]))'
# {"status":"success","data":{"result":[{"value":[..., "2074"]}]}}

Two views of the same hour. The SDK was confident. The bucket was not.

Once we had that gap on a shared screen the room changed. The fanout investigation got paused. The new question was: why is the SDK reporting success for writes that never persisted?

Where the 200 came from when the object never landed

What the SDK thought, and what the server actually did

This is the part of the story that is worth understanding even if you never touch MinIO. MinIO authorizes a request in its auth handler, before the body is ever processed, so every one of those PutObject calls came back as a clean 403 AccessDenied. The server was never confused. The storage service was. Its upload handler pushed the PutObject onto a background goroutine so the HTTP response would not wait on the write, and the error that goroutine returned was assigned to a variable nobody checked; the handler returned 200 as soon as it had read the bytes off the client. Worse, the event fanout consumes was never a MinIO bucket notification at all. The storage service published it to RabbitMQ on that same success path, right after accepting the bytes and long before anything was confirmed persisted. The 403 sat in MinIO's audit log the whole time. Nobody was reading the audit log.

Enabling the MinIO audit target was the diagnostic turn. Two commands and the lie unwound itself.

mc admin config set local audit_webhook:1 \
  endpoint="http://collector:8080/minio-audit" enable=on
mc admin service restart local

# tail the collector for a few seconds
# {"api":{"name":"PutObject","bucket":"bleater-media",
#        "object":"avatars/u-83421.jpg","status":"AccessDenied",
#        "statusCode":403},
#  "requestClaims":{"accessKey":"storage-service"},
#  "error":{"message":"Access Denied.",
#           "source":["cmd/auth-handler.go:checkRequestAuthTypeCredential"]}}

Audit log showed 403 AccessDenied on every PutObject from the storage-service identity. The upload handler threw it away.

The storage-service identity had a user policy that explicitly granted s3:PutObject on arn:aws:s3:::bleater-media/*. We confirmed this in two seconds. Which meant the deny had to be coming from somewhere else.

The bucket policy nobody had read since the hardening PR

Where the explicit Deny was hiding

MinIO, like S3, evaluates IAM in two layers. The user (or service account) policy attached to the identity is one layer. The bucket policy attached to the resource is the other. An explicit Deny in either layer overrides any Allow in either layer. The hardening PR had added a bucket policy intended to lock down a different identity, an analytics reader that had been overprovisioned, and the author had used a wildcard Principal with a NotPrincipal exception that was wrong. The effective rule said: deny s3:PutObject on this bucket for everyone who is not the analytics-reader identity. Which of course included the storage service.

curl -s -u $ADMIN:$SECRET \
  http://minio:9000/minio/admin/v3/get-bucket-policy?bucket=bleater-media \
  | jq .

# {
#   "Version": "2012-10-17",
#   "Statement": [
#     {
#       "Sid": "RestrictWritesToAnalyticsReader",
#       "Effect": "Deny",
#       "NotPrincipal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
#       "Action": ["s3:PutObject"],
#       "Resource": ["arn:aws:s3:::bleater-media/*"]
#     }
#   ]
# }

The bucket policy that swallowed every write. NotPrincipal with Deny is a footgun in any S3-compatible IAM.

We have seen NotPrincipal misused in three separate engagements this year. It reads as if it means 'apply this rule to everyone except this principal' the same way a NotAction would, but the semantics interact badly with cross-account and service-account identities. If you are writing a Deny that you want scoped to a specific identity, write the Deny with Principal naming the identity you mean to block. Do not invert it. The blast radius of a wrong inversion is the entire bucket.

Before we touched anything we wanted to rule out the obvious adjacent causes, because removing a security-hardening policy at 08:15 without confirmation is the kind of fix that becomes its own incident. We checked credential expiry on the storage-service service account (valid for another 47 days), checked network policy for any new egress restrictions from the storage-service namespace (none), and confirmed bucket versioning was off so we were not chasing delete markers. The audit log had already told us the answer; we just wanted the rollback to be unambiguous when we wrote it up.

The four-minute patch and the queue we had to reconcile

Removing the Deny without re-opening the bucket

Two questions before patching. First, did we want to fix the bucket policy in place, or revert the hardening PR entirely? We chose patch in place. The hardening PR had also tightened three other identities correctly, and reverting would have undone work that was real. Second, did we want to leave the analytics-reader restriction in some form? Yes, but written correctly. We rewrote the statement as an explicit Deny on the analytics-reader principal for write actions, which is what the author had intended.

cat > /tmp/bleater-media-policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BlockAnalyticsReaderWrites",
      "Effect": "Deny",
      "Principal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
      "Action": ["s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::bleater-media/*"]
    }
  ]
}
EOF

curl -s -u $ADMIN:$SECRET \
  -X PUT \
  --data-binary @/tmp/bleater-media-policy.json \
  "http://minio:9000/minio/admin/v3/set-bucket-policy?bucket=bleater-media"

# validate with a real write from the storage-service identity
curl -s -X PUT -T /tmp/canary.bin \
  -H "Authorization: ...storage-service-sigv4..." \
  http://minio:9000/bleater-media/canary/$(date +%s).bin

mc ls local/bleater-media/canary/ | tail -1
# [2024-...] 4.0KiB STANDARD 1717420831.bin

Replace the inverted NotPrincipal with an explicit Principal Deny, then prove with a canary that the storage-service identity can write.

The canary landed. Real uploads from the application resumed within the next minute as new requests came in. That fixed the forward path. It did not fix the past six hours.

The phantom notification problem was harder to bound. The fanout service had processed roughly 12,400 notification events for objects that did not exist, which meant 12,400 user timelines contained references to media that would 404 forever. We pulled the notification log from the RabbitMQ stream and diffed against the actual object listing in the bucket. The count of phantom references came in at 12,387. We pushed a one-shot reconciliation job that re-emitted upload prompts to the affected users for any media uploaded in that window, because we had no way to recover the original bytes; the storage service had dropped them the moment MinIO rejected the write and nothing upstream had buffered a copy.

MinIO answered 403 on every write. The storage service dropped the error and published the event anyway.

What we changed so the next deny-wins conflict is not silent

The synthetic that would have caught this in 90 seconds

The deeper lesson here is not about MinIO. It is that SDK success and server persistence are different facts, and most observability stacks conflate them. Every metric on the storage service dashboard came from the SDK return code. Every metric on the fanout dashboard came from notification receipt. Nothing in the stack was sourced from the only ground truth that mattered, which was the count of objects actually present in the bucket. The hardening PR could have done much worse than this and we would still have been blind.

We made three changes after this incident. First, a synthetic that writes a canary object every 60 seconds and then lists the bucket to confirm the canary is there. The metric is the gap between writes and confirmed reads, and it alerts at gap greater than two intervals. This is the kind of probe we now build into every object-storage path we touch. Second, the MinIO audit webhook now ships to the log aggregation pipeline with a Loki alert rule on any sustained rate of statusCode 403 for PutObject, scoped per identity. Third, we wrote a pre-merge check for bucket policy changes that flags any statement using NotPrincipal with Effect Deny and requires an explicit reviewer sign-off.

# Loki alert: deny-wins on PutObject for any service identity
- alert: MinioPutObjectDenied
  expr: |
    sum by (accessKey) (
      rate({job="minio-audit"}
        | json
        | api_name = "PutObject"
        | api_statusCode = "403"
        [5m])
    ) > 0
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "MinIO denying PutObject for {{ $labels.accessKey }}"
    runbook: "Check bucket policy and user policy for explicit Deny statements."

The alert that would have paged the on-call within five minutes of the hardening PR rolling out.

If your upload events drive downstream business logic, you have the same shape of risk we did. The event path and the persistence path are not the same path, and one unchecked error return is all it takes to decouple them. Assume nothing about server persistence based on SDK return codes. Read the audit log.

When a hardening PR silently revokes write access in production

If your object store is quietly lying to your monitors

This class of incident is hard for a specific reason: every monitoring surface a normal team has built reports healthy, because every normal monitoring surface reads from the layer above the failure. The teams we work with that have hit this pattern were not careless. They had dashboards, they had alerts, they had error budgets. None of those instruments were positioned to see a server-side deny that the SDK swallowed. The fix is a small synthetic and an audit log alert, and they take an afternoon to build. Getting to the point of knowing you need them usually takes one bad incident.

We run object-storage and IAM recovery engagements often enough that this exact shape, a hardening PR introducing a deny-wins conflict against a service account, has come up three times this year on three different stacks (MinIO, Ceph RGW, and AWS S3 with a SCP). The mechanics are the same in all three. If your team is staring at green dashboards and broken user reports, the gap between SDK success and ground-truth persistence is the first place to look. If you want a second set of eyes on a hardening rollout before it lands, or you are inside one of these incidents right now, book an infrastructure review with our team and we will be on a bridge with you the same day. We also document the audit-log and synthetic patterns in more depth on the infrastructure audit readiness page if you want to read ahead.

Originally published at https://infraforge.agency/insights/minio-deny-wins-silent-upload-failure/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

ArgoCD Drift: Three Namespaces, One JWT Hotfix

Muhammad Hassaan Javed — Wed, 20 May 2026 22:09:53 +0000

The on-call team had been chasing a 30% 401 rate on profile-service for two hours when we got pulled in. Only profile-service, only some pods, only authenticated requests. The shape of that number is what had thrown them off: a 30% failure rate on a 3-pod deployment looks exactly like one pod out of three running a different config, so that is where they had been digging. It was not a pod problem. All three profile-service pods were identical, and 30% was simply the share of traffic that carried a token at all. What was underneath it was a week-old JWT key rotation hotfix that had landed in the live cluster, never made it to Git, and ArgoCD auto-sync had been disabled across three applications and quietly left off. By the time we opened a terminal there were four versions of the same ConfigMap floating around: one in Git, three in three namespaces, none of them in agreement.

Problem signals:

A service is returning 401s on a stable fraction of requests, and the fraction tracks the share of traffic that carries a token rather than any pod ratio
ArgoCD shows applications as OutOfSync but auto-sync is disabled and nobody remembers turning it off
kubectl diff against the rendered Helm or Kustomize output shows changes nobody can attribute to a recent PR
Multiple namespaces have a propagated copy of the same ConfigMap and the copies disagree
A recent incident postmortem mentions a manual kubectl edit or kubectl patch that was never followed by a Git commit

The first 20 minutes: mapping how far the drift had spread

Four ConfigMaps, four different values

The initial theory from the on-call lead was that a pod had missed the last restart and was still holding the pre-rotation JWT public key. Reasonable theory. It was wrong, but only because it was incomplete.

We ran the obvious diff first. Pull the ConfigMap from each of the three namespaces, pull the manifest from the Git repo at HEAD, compare. What we expected to find was two values: a correct one in the cluster and a stale one in Git, or the reverse. What we actually found was four.

# auth-service namespace
$ kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-11-rot

# like-service namespace (propagated copy)
$ kubectl -n like get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-09

# profile-service namespace (propagated copy)
$ kubectl -n profile get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
HS256 key-2024-09

# Git, main branch
$ grep -E 'JWT_(ALGORITHM|PUBLIC_KEY_ID)' deploy/*/auth-config.yaml
deploy/auth/auth-config.yaml:  JWT_ALGORITHM: HS256
deploy/auth/auth-config.yaml:  JWT_PUBLIC_KEY_ID: key-2024-09
# (and the same stale pair in like and profile manifests)

What the diff actually showed. Four states of the same ConfigMap.

The story behind the four states reconstructed quickly from the previous week's incident channel. During the rotation, an SRE had patched auth-service's ConfigMap directly with the new RS256 key. They then walked the change into the like-service namespace and got the algorithm right but typo'd the key ID, leaving the old one. They ran out of focus before reaching profile-service, intended to come back to it, and did not. ArgoCD auto-sync had been disabled across all three applications during the incident as a guardrail and never re-enabled. These applications run automated sync with selfHeal: true, so that toggle is the only reason the cluster state survived a week without self-heal reverting it back to the stale Git values. With auto-sync on and selfHeal off (the default), the manual patch would have survived too, and the applications would simply have sat OutOfSync.

So the 30% 401 rate had a clean explanation, and it had nothing to do with pods. Every profile-service pod was reading the same never-patched ConfigMap, so all three were validating tokens as HS256 against the old key ID while auth-service had moved to issuing RS256-signed tokens. Every request that carried a token failed. The requests that survived were the health checks, the static reads and the unauthenticated endpoints that never touch token validation, and on this service those are roughly seven requests in ten.

The decision that almost broke production a second time

Why Git was the wrong source of truth

The instinct, when you find drift between Git and a cluster, is to trust Git. That is the whole point of GitOps. The pull request is the source of truth and the cluster is downstream. Run an ArgoCD sync, let it overwrite the live state, move on.

That instinct would have broken auth-service inside of 30 seconds. Git held the pre-rotation HS256 values. The new private key that auth-service was signing tokens with did not match the public key Git was about to push into the ConfigMap. A sync from Git would have invalidated every token in flight across all three services, not just 30% of them.

We had to invert the model. For this one incident, the auth-service namespace's live ConfigMap was the canonical truth, and Git was stale. The recovery had to flow live-to-Git first, then Git-to-cluster for the other two namespaces, and only then could auto-sync be turned back on. The order mattered.

Recovery flow. Live state was canonical for one application, Git was canonical after the commit for the other two.

How we got the canonical values into Git and synced the stragglers

Committing a live hotfix back to Git without breaking auth

The commit itself was unremarkable once we had a clear model. We pulled the auth-service ConfigMap, extracted the two fields, and updated all three manifests in the deploy repo in a single PR with a postmortem link in the description. The PR title was 'Hotfix reconcile: commit post-rotation JWT values from live state (incident #INC-441)' because future-us was going to want to know why these values arrived without an upstream change.

# 1. Export canonical values from auth-service namespace
KID=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_PUBLIC_KEY_ID}')
ALG=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM}')

# 2. Patch the three manifests in the Git checkout, commit, push
for d in deploy/auth deploy/like deploy/profile; do
  yq -i ".data.JWT_PUBLIC_KEY_ID = \"$KID\" | .data.JWT_ALGORITHM = \"$ALG\"" "$d/auth-config.yaml"
done
git add deploy/auth deploy/like deploy/profile
git commit -m 'Reconcile JWT config from live auth-service (post-rotation hotfix, INC-441)'
git push

# 3. Trigger ArgoCD sync per application, in order
for app in auth-service like-service profile-service; do
  argocd app sync $app --prune=false
  argocd app wait $app --health --timeout 180
done

The commit and the sync sequence. auth-service syncs first as a no-op safety check before we touch the broken ones.

We synced auth-service first deliberately. It was already correct, so the sync should be a no-op. If it had shown a diff we did not expect, that was our signal to stop and re-audit before touching like-service or profile-service. It came back clean, which told us our commit matched the live state exactly. Then like-service synced and went healthy. Then profile-service synced and within 40 seconds the 401 rate in Prometheus went from 30% to 0.

Auto-sync we left off until the 401 rate had been at zero for ten minutes and we had eyes on the Jaeger traces showing fresh successful auth flows end to end. Only then did we re-enable auto-sync on all three applications, in the same order as the sync. We have written more about the order-of-operations on multi-app reconciles in the ArgoCD and GitOps recovery playbook.

Two cheap controls that prevent the next split-state week

What we changed about hotfix discipline after this one

The technical recovery was straightforward once the model was right. The interesting part of this incident was how a one-hour rotation hotfix turned into a week of latent drift. Two things had to go wrong together: a manual change that did not get committed, and an auto-sync toggle that did not get turned back on. Either one of those failing alone would have been caught within an hour by the self-heal loop on these applications, which is the part of automated sync that actually reverts live drift back to Git. Disabling it is what let both survive a week.

We made two changes to the platform after this. The first was a scheduled job that lists ArgoCD applications with auto-sync disabled and posts to a channel if any of them have been in that state for more than four hours. It is twelve lines of bash around argocd app list -o json. It has caught the same pattern twice in the last quarter, both times within the same incident as the original change instead of a week later.

# Posted to platform-alerts when auto-sync has been off for >4h on any app
argocd app list -o json \
  | jq -r '.[] | select(.spec.syncPolicy.automated == null)
            | [.metadata.name, .status.operationState.finishedAt] | @tsv' \
  | awk -v cutoff="$(date -u -d '4 hours ago' +%FT%TZ)" '$2 < cutoff'

The auto-sync watchdog. The cheapest control with the highest ROI we shipped this year.

The second change was a rule we now apply to every incident we run: if a hotfix lands in the cluster via kubectl, the same incident does not close until the change is in a merged PR. Not the next day. Not 'we'll get to it'. The incident commander treats the Git commit as a recovery step, not a follow-up. That sounds like a process rule, and it is, but it has a sharp version: the on-call's runbook for manual ConfigMap patches now includes the export-and-PR commands at the bottom of the same page. The friction to do it right is now lower than the friction to defer it.

When the cluster and Git disagree and you cannot just sync your way out

If your GitOps is in a split state right now

The hard part of this kind of incident is not the kubectl or the argocd CLI. The hard part is figuring out which system is the source of truth for which field right now, when the answer is not 'Git, always'. Get that wrong and an ArgoCD sync will take production down a second time on top of whatever is already broken. We have seen the same shape of failure four times this year: a rotation, a migration, an emergency schema change, and a CRD upgrade, each of which left some subset of clusters carrying values that Git did not yet know about.

InfraForge runs these reconciles every week. We know the order to commit, the order to sync, the checks that catch a propagated copy you forgot about, and the questions to ask before you trust Git over the live state. If your auto-sync has been off for a week and you are not sure what would happen when you turn it back on, book an infrastructure review with our team and we will be on a bridge with you the same day to walk the drift before you touch anything.

Originally published at https://infraforge.agency/insights/argocd-drift-three-namespaces-jwt-configmap-hotfix/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.