When ArgoCD shows Healthy but Keycloak silently strips JWT claims

#gitops #recovery #gitopsargocd

ArgoCD reported Synced and Healthy. The Keycloak Helm release was green. And the downstream timeline service was returning 401 on every authenticated request. That was the call we got: every dashboard says the platform is fine, and authentication is broken across three services. The JWTs auth-service was issuing had stopped carrying the groups claim and the email_verified claim about 40 minutes earlier, right after an ArgoCD auto-sync rolled out a Keycloak chart bump. Six OIDC clients had silently lost protocol mappers and role mappings during that sync, and we did not yet know it.

Problem signals:

ArgoCD shows Synced and Healthy on the Keycloak application, but downstream services return 401 on tokens they accepted an hour ago
JWTs decoded at jwt.io are missing claims that production code depends on (groups, email_verified, audience)
Engineers have been making emergency fixes directly in the Keycloak admin console during recent incidents and not committing them back
The realm import ConfigMap in git has not been touched in weeks, yet the live realm has clearly changed
Helm values for the Keycloak chart set realm import strategy to OVERWRITE or leave it unset (which defaults to OVERWRITE on most charts)

The sync that looked clean and quietly stripped six clients

ArgoCD said Healthy. Auth said 401.

Our first guess was wrong. The team had been staring at auth-service for 25 minutes when we joined the bridge, because the tokens it was issuing were obviously malformed. The groups claim was gone. The email_verified claim was gone on a different client. Surely auth-service had shipped a bad release. Except auth-service had not shipped in nine days, and the failure had started 40 minutes ago, not nine days ago.

The shape of the failure is what gave it away. Three OIDC clients had each lost a different mapper at the same moment. Auth-service had lost a groups protocol mapper. The profile service had lost an email_verified client scope mapping. The api gateway had lost role mappings for a downstream audience. Three services do not lose three unrelated pieces of OIDC config simultaneously unless something upstream rewrote all of them at once. The only thing that had touched Keycloak in that window was an ArgoCD auto-sync of the Keycloak Helm release.

We pulled the ArgoCD sync history and found the sync 41 minutes earlier. It was a chart version bump, nothing that should have changed realm content. But the chart ships a realm import ConfigMap, and the realm JSON inside that ConfigMap had not been updated in weeks. Meanwhile the live realm in the Keycloak PostgreSQL database had been edited through the admin console at least a dozen times during recent incidents. None of those console changes had been committed back to git.

So the chart redeployed the ConfigMap. The Keycloak init container read it. And the realm import ran with the strategy set to OVERWRITE. Every console change made during the previous two weeks of incident response got reverted to the stale git version, silently, with no error and no event surfaced to ArgoCD.

Diffing live realm state against the ConfigMap before doing anything destructive

Six clients had drifted and the next sync would make it worse

The first thing we did was not a fix. The first thing we did was freeze. Auto-sync was still enabled on the Keycloak ArgoCD application. If anyone touched a Helm value for any reason in the next hour, another sync would fire and a second OVERWRITE pass would run against whatever state we had managed to reconstruct. We paused auto-sync first and removed the self-heal annotation, then started the diagnosis.

# 1. Freeze the ArgoCD app so the next sync cannot fire mid-recovery
argocd app set keycloak --sync-policy none
argocd app set keycloak --self-heal=false

# 2. Pull live realm state from the Keycloak Admin REST API
TOKEN=$(curl -s -X POST "$KC/realms/master/protocol/openid-connect/token" \
  -d "grant_type=password" -d "client_id=admin-cli" \
  -d "username=$ADMIN_USER" -d "password=$ADMIN_PASS" | jq -r .access_token)

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/clients" | jq . > live-clients.json

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/client-scopes" | jq . > live-scopes.json

# 3. Extract the realm JSON ArgoCD just pushed
kubectl -n keycloak get cm keycloak-realm-import -o jsonpath='{.data.realm\.json}' \
  | jq . > configmap-realm.json

Snapshot live state before any reconciliation. The live API is now the source of truth, not the ConfigMap.

Diffing live-clients.json against the clients block in configmap-realm.json showed six clients with material differences. Two were missing protocol mappers entirely. Three had client scopes that had been removed. One had role mappings that were present in production but absent from the ConfigMap, because an engineer had re-added them by hand in the console during the firefight, before we joined the bridge, and had not committed them either. That last finding was the one that mattered most. The drift ran in both directions, and the ConfigMap was still armed: every future sync was another OVERWRITE pass waiting to delete config that existed only in the Keycloak database. This one had simply cascaded far enough to break downstream services.

Two write paths to the same realm. OVERWRITE makes one of them silently win.

Reconstructing realm state without invalidating active sessions

Why we did not re-import the ConfigMap

The obvious recovery path was to fix the realm JSON in git, commit it, and let ArgoCD re-sync. We did not do that, and the reason matters. A full realm re-import, even with the right content, runs through the Keycloak realm import flow on startup. Depending on the chart and the Keycloak version, that can rotate signing keys, drop active sessions, or invalidate refresh tokens. We had roughly 8,000 active user sessions at that moment. Forcing all of them to re-authenticate at 11pm during an active incident was not a recovery; it was a second outage on top of the first.

So we split the fix into two phases. Phase one was to restore live realm state using the Admin REST API directly, client by client, mapper by mapper. The REST API can add a protocol mapper or attach a client scope to a client without bouncing anything. Phase two was to update the ConfigMap in git to match the now-correct live state AND change the import strategy, so that the next ArgoCD sync would be a no-op rather than another OVERWRITE pass.

# Phase 1: restore each missing mapper live via Admin REST API
# Example: re-add the groups protocol mapper to auth-service client
CLIENT_ID=$(jq -r '.[] | select(.clientId=="auth-service") | .id' live-clients.json)

curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  "$KC/admin/realms/primary/clients/$CLIENT_ID/protocol-mappers/models" \
  -d '{
    "name": "groups",
    "protocol": "openid-connect",
    "protocolMapper": "oidc-group-membership-mapper",
    "config": {
      "claim.name": "groups",
      "full.path": "false",
      "id.token.claim": "true",
      "access.token.claim": "true",
      "userinfo.token.claim": "true"
    }
  }'

# Verify a freshly issued token now carries the claim before moving on
curl -s -X POST "$KC/realms/primary/protocol/openid-connect/token" \
  -d 'grant_type=client_credentials' \
  -d "client_id=auth-service" -d "client_secret=$SECRET" \
  | jq -r .access_token | cut -d. -f2 | base64 -d 2>/dev/null | jq .

Restore each mapper live, then verify the issued token actually carries the claim before moving to the next client.

We worked through the six clients in dependency order: auth-service first because every other service consumed its tokens, then the api gateway, then profile, then the rest. After each client we curl'd a fresh token and base64-decoded the payload to confirm the claim was present. Twenty-two minutes from the start of restoration, timeline-service was returning 200s again. No sessions dropped. No users re-authenticated. The Keycloak pods were never restarted.

What we changed so the next sync becomes a no-op

The one Helm value that should never be OVERWRITE

With live state correct, the dangerous artifact in the system was still the stale realm JSON in the ConfigMap and the OVERWRITE strategy that would re-apply it on any future sync. We exported the now-correct realm via the Admin API, ran it through a diff against what was in git, and committed the result. We also patched the Keycloak Helm values to set the realm import strategy to IGNORE_EXISTING.

# values.yaml for the Keycloak chart
extraEnv: |
  - name: KEYCLOAK_IMPORT_STRATEGY
    value: IGNORE_EXISTING
  # On Keycloak 22+ via Quarkus distribution:
  - name: KC_SPI_IMPORT_SINGLE_FILE_STRATEGY
    value: IGNORE_EXISTING

# For the operator/CR variant:
# spec:
#   realmImport:
#     strategy: IGNORE_EXISTING   # NOT OVERWRITE_EXISTING

IGNORE_EXISTING means the ConfigMap seeds a realm on first creation but never overwrites existing resources. This is the correct setting for any realm that humans also edit.

We re-enabled ArgoCD auto-sync and watched it run. The sync diffed clean: ConfigMap content matched live realm, import strategy was IGNORE_EXISTING, no resources were touched. Green for the right reason this time.

We changed two things in the way the team operates going forward. First, we wrote a small drift detector that runs nightly. It pulls the live realm via the Admin API, diffs it against the realm JSON in git, and posts to a Slack channel if they disagree. It is roughly 80 lines and it has caught two console-edits-not-committed in the six weeks since. Second, we now treat OVERWRITE as a forbidden value for any realm that is also editable in the admin console. If you want OVERWRITE semantics, you must also remove admin console write access for everyone except a break-glass account, because otherwise you are building a system where one of two writers silently destroys the other's work. We have written more about this category of GitOps failure in the ArgoCD and GitOps recovery cluster, and the same pattern shows up with Grafana dashboards, Argo Workflows templates, and anything else where humans and a controller both have write access to the same object.

When GitOps is silently rewriting your identity provider

If your realm config and your cluster disagree

The hard part of this kind of incident is not the Keycloak knowledge. It is recognizing that a green ArgoCD dashboard can coexist with a destroyed production configuration, and knowing which fixes preserve sessions versus which ones lock out every user in the building at midnight. The team we worked with had the Keycloak skills. What they did not have was a recovery sequence that prioritized live state capture over git reconciliation, and a clear rule about when to apply via the Admin API versus when to let ArgoCD do it.

We run these recovery engagements every week. The OVERWRITE-vs-IGNORE_EXISTING trap has hit two other teams this quarter, both on Keycloak, and we have seen the same shape on Grafana provisioning, Argo Workflows ClusterWorkflowTemplates, and a memorable case with Vault policies. The pattern is always: controller writes, human writes, controller wins on the next reconcile, nobody notices for hours.

If your identity provider, your dashboards, or any other system with human-editable state is sitting behind ArgoCD and you have ever wondered whether you are quietly losing changes, book an infrastructure review with our team and we will be on a bridge with you the same day. The first 30 minutes will tell you whether you have a drift problem, and from there we can scope a recovery that does not require kicking your users out.

Originally published at https://infraforge.agency/insights/keycloak-realm-overwrite-argocd-sync-drift/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.