DEV Community

david
david

Posted on • Originally published at woitzik.dev

ArgoCD Gotchas: Cache Staleness and the SharedResourceWarning Nobody Explains

Originally published at woitzik.dev

kubectl apply reports success. You check the resource โ€” the field you just changed is back to its old value. No error. No event. kubectl get shows the change applied, then a few seconds later shows it gone, like it never happened.

This isn't a typo or a YAML indentation bug. It's ArgoCD's selfHeal doing exactly what it's designed to do โ€” re-applying from its own cached understanding of what the resource should be, which can lag behind a change you just made by hand, or even behind a fresh git push.

This hit the same homelab three times in one day, across three unrelated resources. Here's the pattern, the fix, and a second, related gotcha that produces a different symptom from a similar root cause.

View the complete homelab infrastructure source on GitHub ๐Ÿ™

The Symptom

Three separate incidents, same shape:

  • A Tempo PersistentVolumeClaim's storageClassName kept reverting after being changed.
  • Traefik's tlsStore and dashboard configuration reverted after a Helm values update.
  • A paperless-gpt deployment's volumeMounts reverted after a direct edit.

Each time, the sequence was: edit the live resource or push a change to Git โ†’ confirm the change is live โ†’ come back later โ†’ the old value is back, with no error logged anywhere obvious.

Why This Happens: selfHeal Plus a Stale Cache

ArgoCD's selfHeal: true continuously reconciles the live cluster state against ArgoCD's rendered understanding of what the Application's manifests/Helm chart should produce. That's the entire point of GitOps โ€” drift gets corrected automatically, so a manual kubectl edit doesn't silently become the new permanent state.

The bug isn't that selfHeal exists. It's that the rendered understanding ArgoCD reconciles against comes from the argocd-repo-server's manifest/Helm chart cache, and that cache doesn't always get invalidated promptly after a fresh git push or a fresh kubectl apply made outside ArgoCD. For a window of time โ€” usually short, but long enough to be confusing โ€” ArgoCD's source of truth for "what should this look like" is stale, and selfHeal faithfully reverts your change back to match it.

This is functionally indistinguishable, from the outside, from "ArgoCD is ignoring my change" โ€” but the actual mechanism is "ArgoCD is enforcing an outdated cached version of what it thinks I want."

The Fix: Force a Hard Refresh

kubectl patch application <name> -n argocd --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'
Enter fullscreen mode Exit fullscreen mode

The hard refresh value (as opposed to normal) tells ArgoCD to bypass the repo-server's manifest cache entirely and re-render from source. Wait roughly 15 seconds, then re-check.

If that alone doesn't resolve it, the cache itself may need restarting, not just invalidating for one Application:

kubectl rollout restart deployment argocd-repo-server -n argocd
Enter fullscreen mode Exit fullscreen mode

This is a bigger hammer โ€” it affects every Application's next reconciliation, not just the one you're debugging โ€” so try the targeted hard refresh annotation first.

The StatefulSet Exception

For the Tempo PVC specifically, neither of the above fully resolved it on the first try, because volumeClaimTemplates on a StatefulSet are immutable โ€” Kubernetes rejects any attempt to change them on an existing object. Clearing ArgoCD's stale cache fixes ArgoCD's intent going forward, but it can't retroactively fix a field that was never mutable on the live object in the first place.

The fix there is to delete and recreate the StatefulSet itself (the underlying PVC and its data survive deleting the StatefulSet, as long as you don't also delete the PVC):

kubectl delete statefulset <name> -n <namespace> --cascade=orphan
# re-sync from ArgoCD to recreate the StatefulSet with the new template
Enter fullscreen mode Exit fullscreen mode

--cascade=orphan deletes the StatefulSet object without deleting the Pods or PVCs it owns โ€” letting ArgoCD's next sync recreate the StatefulSet (now with the corrected, non-stale template) and re-adopt the existing PVC.

A Second, Different-Looking Bug With a Related Cause: SharedResourceWarning

A related but distinct symptom: a resource flickers between two different specs, or gets pruned entirely, and .status.conditions on one of the Applications shows a SharedResourceWarning.

This isn't a cache problem โ€” it's an ownership conflict. Two different ArgoCD Applications are both trying to manage a resource with the same name and namespace. In this case: a Helm chart's own ingressRoute.dashboard.enabled flag was creating a Traefik dashboard IngressRoute, while a separately, manually-defined IngressRoute with the same name existed in a different Application's manifest set โ€” both claiming ownership of the same object.

ArgoCD has no way to know which one is "correct" โ€” it just observes that the live object doesn't match what either Application individually expects, and flags the conflict rather than guessing.

The fix is to pick exactly one owner and have the other stop claiming the resource:

# kubernetes/system/traefik/application.yml โ€” Helm chart's own dashboard route, disabled
helm:
  values: |
    ingressRoute:
      dashboard:
        enabled: false  # the manual, Authelia-protected route below is canonical
Enter fullscreen mode Exit fullscreen mode
# kubernetes/system/other-ingressroute.yml โ€” the manually-defined route, kept
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: traefik-dashboard
  namespace: traefik
spec:
  # ... Authelia-protected route โ€” this is the one that stays
Enter fullscreen mode Exit fullscreen mode

Once only one Application's manifest set defines the object, recreate it (delete the now-orphaned duplicate definition's effect, let the remaining owner's next sync take over cleanly) and the warning clears.

Telling the Two Apart

Symptom Likely Cause Fix
A field reverts within seconds of a manual or git-pushed change; no error anywhere Repo-server cache staleness hard refresh annotation; restart argocd-repo-server if that's not enough
A field reverts but volumeClaimTemplates is involved on a StatefulSet Cache staleness plus an immutable field that can't be patched in place Same cache fix, plus delete-and-recreate the StatefulSet with --cascade=orphan
A resource flickers between two different specs, or gets pruned; SharedResourceWarning in .status.conditions Two Applications both claim ownership of the same resource Disable one owner's claim (Helm flag or manifest removal), keep the other

The diagnostic tell: cache staleness is temporal โ€” the same Application reverts a change made moments ago, and a refresh fixes it. Ownership conflict is structural โ€” check .status.conditions for SharedResourceWarning first; if it's there, refreshing the cache won't help, because there's nothing stale about either Application's understanding โ€” they're both correctly rendering their own manifests, and the manifests themselves conflict.


The cache-staleness pattern is specific to ArgoCD's repo-server architecture, but the ownership-conflict pattern is universal to any GitOps tool managing Kubernetes resources โ€” Flux has the same failure mode if two Kustomizations or HelmReleases both define a resource with the same identity. Checking .status.conditions before assuming a sync or cache problem saves a lot of time chasing the wrong fix.

Top comments (0)