A hard-earned rule from incident retrospectives:

#devops #sre #kubernetes #terraform

LinkedIn Draft — Workflow (2026-04-21)

GitOps drift: the silent accumulation that makes clusters unmanageable

GitOps promises Git as the source of truth. The reality: every manual kubectl during an incident is a lie you told your cluster and forgot to retract.

GitOps truth gap over time:

Week 1:  Git ══════════ Cluster  (clean)
Week 4:  Git ══════╌╌╌╌ Cluster  (2 manual patches)
Week 12: Git ════╌╌╌╌╌╌╌╌╌╌╌╌╌  (drift accumulates)
                         Cluster  (unknown state)

Where it breaks:
▸ Manual patches during incidents create cluster state Git doesn't know about — Argo/Flux will overwrite it silently.
▸ Secrets managed outside GitOps (sealed-secrets, Vault agent) drift independently — invisible in sync status.
▸ Multi-cluster setups multiply drift: each cluster diverges at its own pace once human intervention happens.

The rule I keep coming back to:
→ Treat every manual cluster change as a 5-minute loan. Commit it back to Git before the incident closes — or it's gone.

How I sanity-check it:
▸ Argo CD drift detection dashboard — surface out-of-sync resources before they become incident contributors.
▸ Weekly diff job: live cluster state vs Git. Opens a PR for anything untracked. Makes drift visible before it's painful.

The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.

Deep dive: https://neeraja-portfolio-v1.vercel.app/workflows/gitops-drift-the-silent-accumulation-that-makes-clusters-unmanageable

Curious what guardrails you've built around this. Drop your pattern below.