ArgoCD CVE-2022-24348: a Secret leak that hid in log volume

#gitops #recovery #gitopsargocd

The first thing we saw in Loki was a fanout service log line that contained the string 'a2V5Y2xvYWstY2xpZW50' repeated about 40 times in a single minute. Base64 decode: 'keycloak-client'. The fanout service had no business reading anything from the keycloak namespace. It had been emitting fragments of another namespace's client-secret for three days, quietly, while Grafana OnCall sat on a low-priority log volume alert that nobody clicked. The vector turned out to be CVE-2022-24348, the ArgoCD directory traversal bug, riding in on a ConfigMap key that an automation script had committed without anyone noticing.

Problem signals:

A low-priority alert on log volume spikes that nobody investigated for days
ConfigMap keys with URL values that contain '../' segments
Application logs containing base64 strings that decode to credential-shaped prefixes
ArgoCD Application source.repoURL values that point outside the expected repo root
ConfigMap changes in the cluster that have no matching Git commit

Why a fanout service was emitting Keycloak credential fragments

The log line that should not have existed

An on-call engineer was triaging an unrelated paging storm and, out of habit, ran a Loki query against the noisiest service of the previous week. The fanout service had spiked from roughly 200 log lines per minute to 6,400 per minute three days earlier and had stayed there. The lines looked like garbage. They were not garbage.

{app="fanout-service"} |= "" | line_format "{{.message}}" | json | __error__=""

# Sample line (sanitized):
level=info msg="resolved source repo" repo="a2V5Y2xvYWstY2xpZW50LXNlY3JldDovL2NsaWVudC1pZD1ibGVhdGVyLWFwaQ==" component=helm-renderer

The base64 in the repo field decodes to 'keycloak-client-secret://client-id=bleater-api'. The fanout service was logging the resolved value of a config key that should never have resolved to a Secret.

We pulled the live ConfigMap. The offending key was named ARGOCD_APP_SOURCE_REPO_URL and its value was 'gitea.internal/platform/fanout/../keycloak-secrets'. That single '../' segment is the entire CVE-2022-24348 exposure. The ArgoCD Helm renderer, in vulnerable versions, would normalize the path after resolving it, walk out of the intended repo root, and read whatever Helm values or Secret references it found in the sibling directory. In this case the sibling directory was a Helm chart that templated the Keycloak client-secret Secret into its values. The fanout service's own application code, which logged its resolved configuration on startup and on every reconcile, then dumped fragments of that Secret into Loki as base64.

Three days. The fanout service itself was healthy the entire time. RabbitMQ consumers were running, distribution was working, the SLO board was green. The exposure was completely silent from a functional standpoint.

Log volume alerts without log content are noise generators

Why the alert sat for three days

The Grafana OnCall alert that fired three days earlier said, in effect, 'fanout-service log volume is 30x baseline'. It was tagged P3 and routed to a Slack channel that the team treats as a digest. The runbook attached to the alert said to check for retry loops. The on-call engineer at the time did check, saw no retries in the RabbitMQ metrics, and silenced the alert for 24 hours. The silence got renewed twice by the rotation handoff.

This is the part of the story we keep seeing across client engagements. A log volume alert that does not inspect log content tells you something changed, not what changed. If the alert had matched on the byte pattern of base64 strings longer than 32 characters in a service that does not normally emit base64, the page would have been P1 and would have gone to a human within minutes. Volume alone is not a signal anyone can act on in under an hour, so it gets silenced.

We have written more on this in our GitOps and ArgoCD recovery cluster, where the same pattern shows up under different vectors. The constant is that GitOps systems concentrate trust in the manifest pipeline, and any leak in that pipeline tends to surface first as 'weird logs' before it surfaces as anything else.

The pod restart that the ConfigMap patch alone does not give you

Patching the ConfigMap was not the fix

The instinct, once we identified the bad key, was to kubectl edit configmap and delete the line. We did not do that, for two reasons. First, the ConfigMap was managed by ArgoCD; a live edit would last until the next sync. Second, even after the ConfigMap was clean, the existing pods would still have the malicious URL in their environment because envFrom only resolves at pod start. The leak would continue until the pods were rolled.

The correct sequence had four steps and the order mattered.

Step	What it does
1. Commit the fix to Git first	We removed the ARGOCD_APP_SOURCE_REPO_URL key from the ConfigMap manifest in the platform repo and opened a PR. ArgoCD was the source of truth, so any cluster-side edit would be reverted. The PR also added a comment explaining the CVE so the next person reading the repo would understand the deletion.
2. Sync ArgoCD with prune disabled	We forced the sync immediately rather than waiting for the next polling interval. We left prune disabled because we wanted to confirm exactly one diff: the deletion of the bad key. Surprise prunes during a security remediation are how secondary incidents start.
3. Roll the pods explicitly	kubectl rollout restart deployment/fanout-service. The ConfigMap was clean but the pod environments still held the resolved value. Until the pods restarted, every reconcile loop in the running process kept logging the leaked fragment. The rollout took 90 seconds.
4. Verify in Loki before declaring done	We ran the same Loki query that found the leak, scoped to the time window after the rollout completed. Zero matches. Then we ran it across a 30-minute window to be sure we were not just hitting log buffer lag. Still zero. That was the moment we stopped holding our breath.

# Verify the ConfigMap is clean
kubectl get configmap fanout-service-config -n bleater -o json \
  | jq '.data | keys[]' | grep -i repo_url
# (should return nothing)

# Verify pods restarted after the ConfigMap fix timestamp
kubectl get pods -n bleater -l app=fanout-service \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'

# Verify the ArgoCD Application source has no traversal
kubectl get application fanout-service -n argocd \
  -o jsonpath='{.spec.source.repoURL}' | grep -F '../' \
  && echo 'STILL VULNERABLE' || echo 'clean'

# Loki check, post-restart window only
logcli query --since=15m \
  '{app="fanout-service"} |~ "a2V5Y2xvYWs|keycloak"'

The four checks we ran in order. Any non-empty result on any of them would have meant the remediation was not complete.

Treating the leak as breached until proven otherwise

Auditing whether the Secret was actually read

The harder question was not 'did we stop the leak' but 'did anyone read the leaked data while it was leaking'. The fragments were in Loki, which meant anyone with Loki read access to the bleater namespace logs could have seen them. We pulled the Loki audit log for the three-day window and listed every query that matched fanout-service logs. Twelve queries from four engineers, all of them looking at unrelated debugging work, none of them filtering on the byte pattern that would have exposed the credential.

That was reassuring but not sufficient. The Keycloak client-secret had to be treated as compromised regardless, because we could not prove the absence of external log exfiltration with high confidence. We rotated the client-secret, redeployed the services that used it, and audited Keycloak's own access log for any token issuance using the old secret from an unexpected source IP in the exposure window. We found none. The rotation took about 25 minutes including service redeployment.

We then went back to the original question that nobody had asked yet: how did the malicious ConfigMap key get there in the first place. The automation script that applied it was a 'config sync' job that pulled key-value pairs from a shared spreadsheet and wrote them into the ConfigMap. The spreadsheet was editable by a wider group than the cluster was. Somebody had added the URL three days earlier, probably as a copy-paste mistake from a different document, and the sync job had faithfully applied it. There was no Git commit, no PR, no review.

Three controls that close this class of failure

What we changed so it cannot happen quietly again

We made three changes after this incident, in priority order.

The first was an admission webhook that rejects any ConfigMap apply where a string value contains '../' or matches the shape of a URL pointing outside an allowlist of internal domains. The rule is 12 lines of OPA policy. We tested it against six months of historical ConfigMap diffs and it would have caught this exact incident on day zero. It also catches the more common case of someone pasting a localhost URL into a shared config.

The second was retiring the spreadsheet-driven sync job. Every ConfigMap that lands in the cluster now comes from a Git commit, has a commit SHA annotation, and fails admission if the annotation is missing or does not match a real commit in the repo. The work to migrate the existing key-value pairs took about a week. The job is gone and is not coming back.

The third was rewriting the log volume alert. The new version fires when fanout-service log lines contain base64-encoded strings longer than 24 characters at a rate above 5 per minute, scoped to services that do not normally emit base64. It is a Loki recording rule with a regex match and it pages a human at P1. The first week it ran it caught two false positives (both were legitimate JWT logging that we then removed) and zero real incidents. We consider that a healthy signal-to-noise ratio for a security alert.

We also upgraded ArgoCD past the CVE-2022-24348 fix line. That should have happened a year earlier. If you are running an ArgoCD version older than 2.3.0, 2.2.9, or 2.1.15, stop reading this and go check, because the same vector is sitting in your cluster waiting for an unlucky ConfigMap edit.

The control surface after the incident. The two new gates are the admission webhook on apply and the content-aware log alert at runtime.

When a CVE has been silently active in your cluster for days

If you are staring at a similar exposure window

The hard part of this kind of incident is not the patch. The patch is one line. The hard part is reconstructing the exposure window with enough confidence to know what to rotate, what to disclose, and what to audit. That work requires log retention you can query precisely, audit trails for the systems that read those logs, and the discipline to treat any leaked credential as compromised until the access logs say otherwise. Teams that have not rehearsed this work tend to do all three poorly the first time.

We run GitOps and ArgoCD recovery engagements every month. We have seen the path traversal CVE three times in the last year, all on clusters running ArgoCD versions the operators thought were current, and we have seen the same 'silent ConfigMap injection via shared spreadsheet' antipattern more often than that. The remediation pattern is the same. The audit pattern is the same. The controls that close the gap are the same.

If you suspect a similar exposure in your cluster right now, request an infrastructure review and we will start with a 30-minute diagnostic call this week to scope the audit window and the rotation list. If the exposure is active, we will be on a bridge with your team the same day.

Originally published at https://infraforge.agency/insights/argocd-cve-2022-24348-path-traversal-secret-leak-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.