ArgoCD Drift: Three Namespaces, One JWT Hotfix

#gitops #recovery #gitopsargocd

The on-call team had been chasing a 30% 401 rate on profile-service for two hours when we got pulled in. Only profile-service, only some pods, only authenticated requests. The shape of that number is what had thrown them off: a 30% failure rate on a 3-pod deployment looks exactly like one pod out of three running a different config, so that is where they had been digging. It was not a pod problem. All three profile-service pods were identical, and 30% was simply the share of traffic that carried a token at all. What was underneath it was a week-old JWT key rotation hotfix that had landed in the live cluster, never made it to Git, and ArgoCD auto-sync had been disabled across three applications and quietly left off. By the time we opened a terminal there were four versions of the same ConfigMap floating around: one in Git, three in three namespaces, none of them in agreement.

Problem signals:

A service is returning 401s on a stable fraction of requests, and the fraction tracks the share of traffic that carries a token rather than any pod ratio
ArgoCD shows applications as OutOfSync but auto-sync is disabled and nobody remembers turning it off
kubectl diff against the rendered Helm or Kustomize output shows changes nobody can attribute to a recent PR
Multiple namespaces have a propagated copy of the same ConfigMap and the copies disagree
A recent incident postmortem mentions a manual kubectl edit or kubectl patch that was never followed by a Git commit

The first 20 minutes: mapping how far the drift had spread

Four ConfigMaps, four different values

The initial theory from the on-call lead was that a pod had missed the last restart and was still holding the pre-rotation JWT public key. Reasonable theory. It was wrong, but only because it was incomplete.

We ran the obvious diff first. Pull the ConfigMap from each of the three namespaces, pull the manifest from the Git repo at HEAD, compare. What we expected to find was two values: a correct one in the cluster and a stale one in Git, or the reverse. What we actually found was four.

# auth-service namespace
$ kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-11-rot

# like-service namespace (propagated copy)
$ kubectl -n like get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-09

# profile-service namespace (propagated copy)
$ kubectl -n profile get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
HS256 key-2024-09

# Git, main branch
$ grep -E 'JWT_(ALGORITHM|PUBLIC_KEY_ID)' deploy/*/auth-config.yaml
deploy/auth/auth-config.yaml:  JWT_ALGORITHM: HS256
deploy/auth/auth-config.yaml:  JWT_PUBLIC_KEY_ID: key-2024-09
# (and the same stale pair in like and profile manifests)

What the diff actually showed. Four states of the same ConfigMap.

The story behind the four states reconstructed quickly from the previous week's incident channel. During the rotation, an SRE had patched auth-service's ConfigMap directly with the new RS256 key. They then walked the change into the like-service namespace and got the algorithm right but typo'd the key ID, leaving the old one. They ran out of focus before reaching profile-service, intended to come back to it, and did not. ArgoCD auto-sync had been disabled across all three applications during the incident as a guardrail and never re-enabled. These applications run automated sync with selfHeal: true, so that toggle is the only reason the cluster state survived a week without self-heal reverting it back to the stale Git values. With auto-sync on and selfHeal off (the default), the manual patch would have survived too, and the applications would simply have sat OutOfSync.

So the 30% 401 rate had a clean explanation, and it had nothing to do with pods. Every profile-service pod was reading the same never-patched ConfigMap, so all three were validating tokens as HS256 against the old key ID while auth-service had moved to issuing RS256-signed tokens. Every request that carried a token failed. The requests that survived were the health checks, the static reads and the unauthenticated endpoints that never touch token validation, and on this service those are roughly seven requests in ten.

The decision that almost broke production a second time

Why Git was the wrong source of truth

The instinct, when you find drift between Git and a cluster, is to trust Git. That is the whole point of GitOps. The pull request is the source of truth and the cluster is downstream. Run an ArgoCD sync, let it overwrite the live state, move on.

That instinct would have broken auth-service inside of 30 seconds. Git held the pre-rotation HS256 values. The new private key that auth-service was signing tokens with did not match the public key Git was about to push into the ConfigMap. A sync from Git would have invalidated every token in flight across all three services, not just 30% of them.

We had to invert the model. For this one incident, the auth-service namespace's live ConfigMap was the canonical truth, and Git was stale. The recovery had to flow live-to-Git first, then Git-to-cluster for the other two namespaces, and only then could auto-sync be turned back on. The order mattered.

Recovery flow. Live state was canonical for one application, Git was canonical after the commit for the other two.

How we got the canonical values into Git and synced the stragglers

Committing a live hotfix back to Git without breaking auth

The commit itself was unremarkable once we had a clear model. We pulled the auth-service ConfigMap, extracted the two fields, and updated all three manifests in the deploy repo in a single PR with a postmortem link in the description. The PR title was 'Hotfix reconcile: commit post-rotation JWT values from live state (incident #INC-441)' because future-us was going to want to know why these values arrived without an upstream change.

# 1. Export canonical values from auth-service namespace
KID=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_PUBLIC_KEY_ID}')
ALG=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM}')

# 2. Patch the three manifests in the Git checkout, commit, push
for d in deploy/auth deploy/like deploy/profile; do
  yq -i ".data.JWT_PUBLIC_KEY_ID = \"$KID\" | .data.JWT_ALGORITHM = \"$ALG\"" "$d/auth-config.yaml"
done
git add deploy/auth deploy/like deploy/profile
git commit -m 'Reconcile JWT config from live auth-service (post-rotation hotfix, INC-441)'
git push

# 3. Trigger ArgoCD sync per application, in order
for app in auth-service like-service profile-service; do
  argocd app sync $app --prune=false
  argocd app wait $app --health --timeout 180
done

The commit and the sync sequence. auth-service syncs first as a no-op safety check before we touch the broken ones.

We synced auth-service first deliberately. It was already correct, so the sync should be a no-op. If it had shown a diff we did not expect, that was our signal to stop and re-audit before touching like-service or profile-service. It came back clean, which told us our commit matched the live state exactly. Then like-service synced and went healthy. Then profile-service synced and within 40 seconds the 401 rate in Prometheus went from 30% to 0.

Auto-sync we left off until the 401 rate had been at zero for ten minutes and we had eyes on the Jaeger traces showing fresh successful auth flows end to end. Only then did we re-enable auto-sync on all three applications, in the same order as the sync. We have written more about the order-of-operations on multi-app reconciles in the ArgoCD and GitOps recovery playbook.

Two cheap controls that prevent the next split-state week

What we changed about hotfix discipline after this one

The technical recovery was straightforward once the model was right. The interesting part of this incident was how a one-hour rotation hotfix turned into a week of latent drift. Two things had to go wrong together: a manual change that did not get committed, and an auto-sync toggle that did not get turned back on. Either one of those failing alone would have been caught within an hour by the self-heal loop on these applications, which is the part of automated sync that actually reverts live drift back to Git. Disabling it is what let both survive a week.

We made two changes to the platform after this. The first was a scheduled job that lists ArgoCD applications with auto-sync disabled and posts to a channel if any of them have been in that state for more than four hours. It is twelve lines of bash around argocd app list -o json. It has caught the same pattern twice in the last quarter, both times within the same incident as the original change instead of a week later.

# Posted to platform-alerts when auto-sync has been off for >4h on any app
argocd app list -o json \
  | jq -r '.[] | select(.spec.syncPolicy.automated == null)
            | [.metadata.name, .status.operationState.finishedAt] | @tsv' \
  | awk -v cutoff="$(date -u -d '4 hours ago' +%FT%TZ)" '$2 < cutoff'

The auto-sync watchdog. The cheapest control with the highest ROI we shipped this year.

The second change was a rule we now apply to every incident we run: if a hotfix lands in the cluster via kubectl, the same incident does not close until the change is in a merged PR. Not the next day. Not 'we'll get to it'. The incident commander treats the Git commit as a recovery step, not a follow-up. That sounds like a process rule, and it is, but it has a sharp version: the on-call's runbook for manual ConfigMap patches now includes the export-and-PR commands at the bottom of the same page. The friction to do it right is now lower than the friction to defer it.

When the cluster and Git disagree and you cannot just sync your way out

If your GitOps is in a split state right now

The hard part of this kind of incident is not the kubectl or the argocd CLI. The hard part is figuring out which system is the source of truth for which field right now, when the answer is not 'Git, always'. Get that wrong and an ArgoCD sync will take production down a second time on top of whatever is already broken. We have seen the same shape of failure four times this year: a rotation, a migration, an emergency schema change, and a CRD upgrade, each of which left some subset of clusters carrying values that Git did not yet know about.

InfraForge runs these reconciles every week. We know the order to commit, the order to sync, the checks that catch a propagated copy you forgot about, and the questions to ask before you trust Git over the live state. If your auto-sync has been off for a week and you are not sure what would happen when you turn it back on, book an infrastructure review with our team and we will be on a bridge with you the same day to walk the drift before you touch anything.

Originally published at https://infraforge.agency/insights/argocd-drift-three-namespaces-jwt-configmap-hotfix/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.