The most useful insight in the Kubernetes-drift postmortem I want to walk through here is the one the team writes near the end, almost in passing: production didn't break at the moment of the deploy. It broke six months earlier, when somebody ran kubectl edit on a ConfigMap and didn't put the change in Git. The deploy was the moment that fact became visible.
That sentence reframes a whole class of cluster outage. It's worth taking seriously. The team I'm retelling here ran a perfectly normal release pipeline (GitLab CI; tests; image build; helm upgrade --atomic as the last step) into an afternoon production environment, watched the green checkmark appear, and then watched the alerts fire. New replicas couldn't reach the database. The image rollback to the previous tag, the one that had been running fine for months, did not help. Same CrashLoopBackOff. Same authentication failure against PostgreSQL. The version of the application that had been alive at lunchtime would not boot from clean state at 3 PM.
The reason was that the running pods at lunchtime had been initialised at some forgotten point in the past, when somebody had kubectl edit-ed the users-api-env ConfigMap to fix an unrelated PgBouncer issue. The fix had never made it into the chart's values-prod.yaml. The ConfigMap in the cluster diverged from the ConfigMap that Helm thought it was managing. The pods kept running, holding their open connections, perfectly content. The Helm release was only consulted when something rendered new pods. Then the chart's "correct" version of DB_HOST was the one that got applied — and it was the one that had been wrong for half a year.
This is the load-bearing observation. The release didn't break the service. It synchronised a divergence that had been sitting there for months. Reading it that way, the question stops being "what went wrong with this deploy" and becomes "why did the cluster have a state that didn't exist in Git in the first place, and how had it been allowed to live there for so long."
Why the rollback didn't help, and why it never could have
The reflex when a deploy goes red is to roll the image back. That works when the regression is in the image: the new code does something the old code didn't, the old code doesn't do that thing, the symptoms go away. It does not work when the regression is in the environment and the deploy was the trigger for the environment to be re-rendered.
In this incident the image was a red herring. Both the new image and the old image were starting up against the same chart-derived config, because both were starting from a clean pod, and the clean pod's config was the config in the chart, and the config in the chart was wrong. Image rollbacks are an answer for one shape of regression. They are silent on another shape, and the cluster's current state quietly determines which shape you are looking at.
This is also where the team realised --atomic had been doing less than they thought. The Helm --atomic flag rolls back the upgrade if the operation itself fails. It does not protect against an upgrade that succeeds operationally and breaks behaviourally a few minutes later. Kubernetes saw replicas come up, declared rollout done, and the pipeline turned green. The 5xx surge happened after that, on real traffic, against real bugs, in a cluster that thought everything had gone fine.
Five things the team had been carrying without seeing them
Re-reading the team's own list of weak points, what's striking is how much of it is invisible until something pulls on it. They wrote down five:
- CI had direct access to the production cluster. Kubeconfig in GitLab CI variables, protected and masked, but a credential whose presence was assumed and forgotten. An attacker who lands inside CI is one yaml away from production state.
- The runner had inbound access to the Kubernetes API server. The model required it; nobody loved it.
- Git was a description of what the team thought was running, not what was running. The two had been allowed to disagree.
- There was no continuous drift check. A divergence could persist indefinitely, and only the next pod scheduling event would surface it.
- Rollback covered the image, not the environment. When the regression lived in the environment, the rollback control was pointing at the wrong layer.
The combination is what made the incident long. Each item alone is fixable; the team was running on the assumption that the items were all small enough not to matter, and the cluster's response was the standard one: state things you don't watch will drift.
Why a "no manual edits" rule would not have been enough
The reasonable response to an incident like this is to add a rule. Don't kubectl edit production. Pull a deploy job. File a change request. Rules of this kind are useful and people sometimes follow them. But rules don't change the system's properties. They change the population of paper-trail incidents. Somebody at three in the morning will still reach for the fastest available control surface, and the fastest available control surface will still be kubectl edit. The cluster will accept the edit. The cluster does not know the difference between an edit committed to Git and one performed in anger.
What the team wanted was a property of the system rather than a property of the operators: production should converge to the state in Git, automatically and continuously, as a structural fact. That is the substance of GitOps as the OpenGitOps project (overseen by CNCF TAG App Delivery's GitOps Working Group) defines it: declarative desired state, versioned and immutable, pulled automatically by an in-cluster agent, and continuously reconciled against the live system. Rules sit on top of operators. Reconciliation sits on top of the cluster. They are not the same engineering target.
The team picked Argo CD. They could have picked Flux, and the difference for what they needed came down largely to UI ergonomics — visible diffs, Synced / OutOfSync statuses readable by people who weren't platform engineers. What mattered was that the cluster now contained an agent whose entire job was to look at the manifests in a Git repository, look at the state of the cluster, and complain about the difference.
How they rolled it out, and why the rollout itself is the lesson
If you've been near GitOps adoption you've seen the failure mode: somebody enables auto-sync, prune, and self-heal on a long-lived production cluster all at once, and a controller starts deleting things the cluster needed but Git never knew about. The team avoided that by deliberately walking through the gradient.
They started with a pure observation mode. Argo CD pointed at the manifests, watched the cluster, and reported OutOfSync on every divergence. It did not act. The output was a list. The list was what mattered. Some of the differences were legitimate runtime fields (status, certain annotations added by admission controllers, replica counts under HPA). Some of them were real drift exactly of the kind that had caused the incident: forgotten ConfigMaps, ad-hoc Service objects, role bindings left over from experiments, Ingress annotations from a deprecated workaround. Each line in the list got classified before any control was turned on.
Only after the classification did selfHeal go on, app by app. Only after that did prune go on, project by project. The whole thing was a phased rollout, not a flag flip:
| Phase | Argo CD setting | What it gives you | What it can break |
|---|---|---|---|
| Observe-only |
automated: absent or all false
|
A diff list and a classification exercise: drift, runtime field, or owned-elsewhere | Nothing — pure read |
| Auto-sync (subset) |
automated.prune: false, automated.selfHeal: false per app |
Convergence to Git for the chosen apps; no retroactive cleanup yet | A wrong commit hits prod faster than a human would catch it |
| Self-heal | automated.selfHeal: true |
A manual kubectl edit no longer survives reconciliation |
Operator field-ownership fights surface (HPA vs Git on replicas, etc.) |
| Prune (per project) | automated.prune: true |
Resources removed from Git actually leave the cluster | Anything you forgot to commit gets deleted |
| RBAC narrowing | Argo CD AppProject whitelists |
Bounded blast radius per project; CD agent isn't cluster-admin | Initial whitelist mis-config breaks deploys until corrected |
The fix for the incident itself, before any of this, was a one-line MR. It's worth showing. This is the entirety of the change that brought production back:
env:
- DB_HOST: "pgbouncer.users.svc.cluster.local"
+ DB_HOST: "pgbouncer-primary.users.svc.cluster.local"
Eight characters of suffix, after a kubectl describe-and-verify on the live pgbouncer-primary Service to confirm there were healthy endpoints behind it. The team did the merge through the existing pipeline rather than re-issuing a kubectl edit on the ConfigMap, which would have been faster by perhaps two minutes and would have re-introduced the original mistake on the same afternoon they were trying to learn from it.
The Argo CD Application they ended up with for that service, after all phases were on, looks roughly like this:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: users-api
namespace: argocd
spec:
project: production
source:
repoURL: https://git.example.com/platform/prod-manifests.git
targetRevision: main
path: services/users-api
helm:
valueFiles:
- values-prod.yaml
destination:
server: https://kubernetes.default.svc
namespace: users
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=false
- ApplyOutOfSyncOnly=true
What changed around the Application is at least as important as the Application itself. Four pieces:
The CI runner lost its kubeconfig. It builds images. It pushes images. It commits a tag bump into the infra repository through a bot account whose only right is to write to that repo. That is the entire deploy authority CI has. Argo CD does the rest, from inside the cluster, by pulling state, which is the architectural inversion that gives GitOps its name.
Argo CD itself got a reasonable RBAC. The team broke applications into Argo CD AppProjects — production, platform, system — and the production project has an empty clusterResourceWhitelist. Application repositories cannot create ClusterRole, cannot create MutatingWebhookConfiguration, cannot define new CRDs. Anything cluster-scoped is somebody else's repository and somebody else's review.
Secrets moved out of Git. The Kubernetes docs are explicit about why: "Kubernetes Secrets are, by default, stored unencrypted in the API server's underlying data store (etcd). Anyone with API access can retrieve or modify a Secret, and so can anyone with access to etcd." Base64 is encoding, not protection. The team adopted External Secrets Operator. Git holds an ExternalSecret resource that says "this app needs the secret at prod/users-api/db"; the actual value lives in the configured store (Vault, AWS Secrets Manager, Doppler, whatever). The split is the point: declaration in Git, value in a secrets system that knows how to do rotation and audit.
There is an emergency path. Not a workaround, an explicit procedure: pause auto-sync for the affected Application, do the manual change, restore service, immediately commit the change, re-enable auto-sync. You don't pretend manual changes never happen in incidents. You make sure they don't survive into normality.
What stops being mysterious afterwards
The change that surprised the team, and the one I find most interesting to read about, is not in the controller. It's in the language people use during incidents.
Before, an outage would generate a familiar list of questions: what's actually in the cluster right now? Did this even apply? Who changed the ConfigMap? Why does Git say one thing and the namespace say another? Those questions weren't trivia. They were the inability of the team to answer them quickly enough that mattered. Now Synced means the live state matches Git within the configured comparison rules. OutOfSync means there's a diff and the diff is visible. The questions don't go away — Kubernetes is still Kubernetes — but the gap between asking and answering has collapsed into something an Argo CD UI can show in a tab.
Rollback works at the right layer. To roll a tag, revert the tag in Git. To roll a config change, revert the config commit. The control surface and the failure surface line up. This does not extend to data — GitOps does not roll back database migrations, queues, or external contracts, and pretending it does is a separate kind of mistake — but it does extend, cleanly, to everything Kubernetes considers its own.
Manual edits stop being invisible. This is the cultural change the technology forces. A six-month-old kubectl edit could survive every previous deploy because nothing was looking. Now it becomes OutOfSync within minutes, or, after self-heal goes on, gets quietly reverted. The team writes that some people didn't love this. Through a couple of incidents the consensus settled: GitOps doesn't stop you from fixing production, it stops you from forgetting what you fixed.
The CI attack surface contracts. The compromise of a CI runner used to mean potential write access to production. Now an attacker who lands in CI gets a registry credential and a write-to-infra-repo bot. To reach production they need to land a malicious change in an infra-repo branch, get it past review, and hope nobody catches it before reconciliation. That's not nothing, but it's a longer chain than "exec a kubectl with the credential we already have."
And there's a quieter outcome, which I think is the most honest one in the postmortem: the team got better at noticing where GitOps shouldn't be the only mechanism. Schema migrations don't belong in syncWaves for arbitrarily heavy changes. Expand-and-contract migrations, feature flags, runtime guards, and proper observability sit in a different layer. GitOps applies declarative Kubernetes state well; it doesn't apply data-shape changes in PostgreSQL well. Knowing the boundary makes both layers cheaper.
Git as source of truth is a property, not a slogan
The line I want to leave here is the one the team uses to close their own piece. Git as source of truth has become unremarkable from repetition, and it's possible to deploy software for years without testing whether it's actually true in your environment. After this incident the team rephrases it as a set of falsifiable questions: can you take a namespace, delete the managed resources, and rebuild it from Git? Can you roll a change by reverting a commit, not by piecing together kubectl history from Slack? Can you remove CI's access to the cluster API and still ship? If any answer is no, Git isn't your source of truth yet — it's a partial description the cluster sometimes consults.
That's the framing I'm taking from the postmortem. The deploy didn't break the service. The team had been running on a production state that existed only because somebody had once fixed something quickly and forgotten. The release pulled the curtain back, and the work after that was to build a system whose normal operation makes that kind of forgetting structurally impossible.
Top comments (0)