The on-call team had been chasing a 30% 401 rate on profile-service for two hours when we got pulled in. Only profile-service, only some pods, only authenticated requests. The shape of that number is what gave it away: a 30% failure rate on a service backed by a 3-pod deployment is what you see when one pod out of three is running with a different config. Except it was not a config rollout in flight. It was a week-old JWT key rotation hotfix that had landed in the live cluster, never made it to Git, and ArgoCD auto-sync had been disabled across three applications and quietly left off. By the time we opened a terminal there were four versions of the same ConfigMap floating around: one in Git, three in three namespaces, none of them in agreement.
Problem signals:
- A service is returning 401s on a fraction of requests that matches a pod count ratio (30% on 3 pods, 25% on 4 pods)
- ArgoCD shows applications as OutOfSync but auto-sync is disabled and nobody remembers turning it off
- kubectl diff against the rendered Helm or Kustomize output shows changes nobody can attribute to a recent PR
- Multiple namespaces have a propagated copy of the same ConfigMap and the copies disagree
- A recent incident postmortem mentions a manual kubectl edit or kubectl patch that was never followed by a Git commit
The first 20 minutes: mapping how far the drift had spread
Four ConfigMaps, four different values
The initial theory from the on-call lead was that a pod had missed the last restart and was still holding the pre-rotation JWT public key. Reasonable theory. It was wrong, but only because it was incomplete.
We ran the obvious diff first. Pull the ConfigMap from each of the three namespaces, pull the manifest from the Git repo at HEAD, compare. What we expected to find was two values: a correct one in the cluster and a stale one in Git, or the reverse. What we actually found was four.
# auth-service namespace
$ kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-11-rot
# like-service namespace (propagated copy)
$ kubectl -n like get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-09
# profile-service namespace (propagated copy)
$ kubectl -n profile get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
HS256 key-2024-09
# Git, main branch
$ grep -E 'JWT_(ALGORITHM|PUBLIC_KEY_ID)' deploy/*/auth-config.yaml
deploy/auth/auth-config.yaml: JWT_ALGORITHM: HS256
deploy/auth/auth-config.yaml: JWT_PUBLIC_KEY_ID: key-2024-09
# (and the same stale pair in like and profile manifests)
What the diff actually showed. Four states of the same ConfigMap.
The story behind the four states reconstructed quickly from the previous week's incident channel. During the rotation, an SRE had patched auth-service's ConfigMap directly with the new RS256 key. They then walked the change into the like-service namespace and got the algorithm right but typo'd the key ID, leaving the old one. They ran out of focus before reaching profile-service, intended to come back to it, and did not. ArgoCD auto-sync had been disabled across all three applications during the incident as a guardrail and never re-enabled, which is the only reason the cluster state had survived a week without ArgoCD reverting it back to the stale Git values.
So the 30% 401 rate had a clean explanation. profile-service's pods had been restarted at some point and picked up the HS256 config from the unpatched ConfigMap. The auth-service was now issuing RS256-signed tokens. profile-service was trying to validate them as HS256 with the wrong key ID. The only requests that did not 401 were the ones that happened to skip the auth path entirely.
The decision that almost broke production a second time
Why Git was the wrong source of truth
The instinct, when you find drift between Git and a cluster, is to trust Git. That is the whole point of GitOps. The pull request is the source of truth and the cluster is downstream. Run an ArgoCD sync, let it overwrite the live state, move on.
That instinct would have broken auth-service inside of 30 seconds. Git held the pre-rotation HS256 values. The new private key that auth-service was signing tokens with did not match the public key Git was about to push into the ConfigMap. A sync from Git would have invalidated every token in flight across all three services, not just 30% of them.
We had to invert the model. For this one incident, the auth-service namespace's live ConfigMap was the canonical truth, and Git was stale. The recovery had to flow live-to-Git first, then Git-to-cluster for the other two namespaces, and only then could auto-sync be turned back on. The order mattered.
Recovery flow. Live state was canonical for one application, Git was canonical after the commit for the other two.
<!-- mermaid source:
flowchart TD
A[auth-service live ConfigMap\nRS256, key-2024-11-rot\nCANONICAL] -->|1. Export and commit| B[Git: deploy/auth, deploy/like,\ndeploy/profile manifests updated]
B -->|2. ArgoCD sync auth-service\n no-op, already correct| C[auth-service Synced]
B -->|3. ArgoCD sync like-service\n fixes stale key ID| D[like-service Synced]
B -->|4. ArgoCD sync profile-service\n fixes algorithm + key ID| E[profile-service Synced]
E -->|5. Verify 401 rate = 0| F[Re-enable auto-sync on all 3 apps]
-->
How we got the canonical values into Git and synced the stragglers
Committing a live hotfix back to Git without breaking auth
The commit itself was unremarkable once we had a clear model. We pulled the auth-service ConfigMap, extracted the two fields, and updated all three manifests in the deploy repo in a single PR with a postmortem link in the description. The PR title was 'Hotfix reconcile: commit post-rotation JWT values from live state (incident #INC-441)' because future-us was going to want to know why these values arrived without an upstream change.
# 1. Export canonical values from auth-service namespace
KID=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_PUBLIC_KEY_ID}')
ALG=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM}')
# 2. Patch the three manifests in the Git checkout, commit, push
for d in deploy/auth deploy/like deploy/profile; do
yq -i ".data.JWT_PUBLIC_KEY_ID = \"$KID\" | .data.JWT_ALGORITHM = \"$ALG\"" "$d/auth-config.yaml"
done
git add deploy/auth deploy/like deploy/profile
git commit -m 'Reconcile JWT config from live auth-service (post-rotation hotfix, INC-441)'
git push
# 3. Trigger ArgoCD sync per application, in order
for app in auth-service like-service profile-service; do
argocd app sync $app --prune=false
argocd app wait $app --health --timeout 180
done
The commit and the sync sequence. auth-service syncs first as a no-op safety check before we touch the broken ones.
We synced auth-service first deliberately. It was already correct, so the sync should be a no-op. If it had shown a diff we did not expect, that was our signal to stop and re-audit before touching like-service or profile-service. It came back clean, which told us our commit matched the live state exactly. Then like-service synced and went healthy. Then profile-service synced and within 40 seconds the 401 rate in Prometheus went from 31% to 0.
Auto-sync we left off until the 401 rate had been at zero for ten minutes and we had eyes on the Jaeger traces showing fresh successful auth flows end to end. Only then did we re-enable auto-sync on all three applications, in the same order as the sync. We have written more about the order-of-operations on multi-app reconciles in the ArgoCD and GitOps recovery playbook.
Two cheap controls that prevent the next split-state week
What we changed about hotfix discipline after this one
The technical recovery was straightforward once the model was right. The interesting part of this incident was how a one-hour rotation hotfix turned into a week of latent drift. Two things had to go wrong together: a manual change that did not get committed, and an auto-sync toggle that did not get turned back on. Either one of those failing alone would have been caught within an hour by ArgoCD's reconciliation loop.
We made two changes to the platform after this. The first was a scheduled job that lists ArgoCD applications with auto-sync disabled and posts to a channel if any of them have been in that state for more than four hours. It is twelve lines of bash around argocd app list -o json. It has caught the same pattern twice in the last quarter, both times within the same incident as the original change instead of a week later.
# Posted to platform-alerts when auto-sync has been off for >4h on any app
argocd app list -o json \
| jq -r '.[] | select(.spec.syncPolicy.automated == null)
| [.metadata.name, .status.operationState.finishedAt] | @tsv' \
| awk -v cutoff="$(date -u -d '4 hours ago' +%FT%TZ)" '$2 < cutoff'
The auto-sync watchdog. The cheapest control with the highest ROI we shipped this year.
The second change was a rule we now apply to every incident we run: if a hotfix lands in the cluster via kubectl, the same incident does not close until the change is in a merged PR. Not the next day. Not 'we'll get to it'. The incident commander treats the Git commit as a recovery step, not a follow-up. That sounds like a process rule, and it is, but it has a sharp version: the on-call's runbook for manual ConfigMap patches now includes the export-and-PR commands at the bottom of the same page. The friction to do it right is now lower than the friction to defer it.
When the cluster and Git disagree and you cannot just sync your way out
If your GitOps is in a split state right now
The hard part of this kind of incident is not the kubectl or the argocd CLI. The hard part is figuring out which system is the source of truth for which field right now, when the answer is not 'Git, always'. Get that wrong and an ArgoCD sync will take production down a second time on top of whatever is already broken. We have seen the same shape of failure four times this year: a rotation, a migration, an emergency schema change, and a CRD upgrade, each of which left some subset of clusters carrying values that Git did not yet know about.
InfraForge runs these reconciles every week. We know the order to commit, the order to sync, the checks that catch a propagated copy you forgot about, and the questions to ask before you trust Git over the live state. If your auto-sync has been off for a week and you are not sure what would happen when you turn it back on, book an infrastructure review with our team and we will be on a bridge with you the same day to walk the drift before you touch anything.
Originally published at https://infraforge.agency/insights/argocd-drift-three-namespaces-jwt-configmap-hotfix/.
If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.
Top comments (0)