How Kubernetes Drift Detection Saved Us From Infrastructure Chaos

#kubernetes #devops #platformengineering #gitop

Three months into a production migration, we discovered that 14 of our 47 deployments had quietly drifted from their declared state. Not in a dramatic, pager-firing way. In the slow, invisible way that turns a Tuesday afternoon into a Friday incident.

That's the thing about configuration drift. It doesn't announce itself. It accumulates.

Here's what happened, what we built to fix it, and why I think most teams are one bad deploy away from the same problem.

The Setup

We were running a mid-sized Kubernetes cluster across three environments: dev, staging, and production. Standard GitOps workflow. ArgoCD handling deployments. Helm charts checked into Git. Everything was "declarative." Everything was "source-of-truth."

Except it wasn't.

Engineers were patching things manually under pressure. kubectl edit became a habit. Resource limits got tweaked directly on pods. ConfigMaps were updated in-cluster without touching the repo. Nobody flagged it because nothing broke. The cluster kept humming. The dashboards stayed green.

Then we started seeing weird behavior. A service that should have been running with a 512Mi memory limit was sitting at 2Gi. Another deployment had two replicas when the Helm chart clearly declared three. A sidecar container version was six weeks behind what we'd intended to ship.

None of it was catastrophic. All of it was real. And we had no idea how long it had been that way.

Point 1: GitOps Sync Status Isn't the Same as Drift Detection

This is the part that trips people up. ArgoCD told us our apps were "Synced." And technically, they were, at the moment of last sync. But sync status is a snapshot, not a continuous assertion. If someone runs kubectl edit after a sync, ArgoCD doesn't know. It's not watching for that.

Drift detection means continuously comparing what's running in the cluster against what's declared in Git, and alerting when they diverge. That's a different problem than deployment sync. Most teams conflate them and pay for it later.

We built a reconciliation loop using a combination of ArgoCD's resource tracking and a custom controller that scraped live cluster state on a 5-minute interval, diffed it against our Helm-rendered manifests, and pushed the deltas into a monitoring pipeline. Nothing fancy. About 400 lines of Go and a Prometheus exporter.

The first run returned 14 drifted resources. Four of them in production.

Point 2: The Real Problem Is Toil and Pressure, Not Malicious Intent

Every one of those manual edits had a story.

A memory limit bumped because an OOMKill was happening at 2 AM and someone needed to stop the bleeding. A replica count changed because load spiked and autoscaling hadn't kicked in fast enough. A ConfigMap updated because a third-party API changed its endpoint and we needed 30 seconds to fix it, not 30 minutes to run a pipeline.

These aren't reckless engineers. These are engineers solving real problems with the tools in front of them.

The issue is the feedback loop, or the lack of one. Without drift detection, that 2 AM fix becomes permanent. Nobody goes back. The PR never gets opened. The Helm chart never gets updated. And six weeks later, someone deploys from Git and rolls back the fix that's been holding production together.

Sound familiar?

The fix isn't telling people to stop using kubectl edit. It's making the correct path faster than the escape hatch, and making drift visible so it can't quietly accumulate.

Point 3: Alerting on Drift Changes the Culture

Once engineers could see a drift dashboard, broken down by namespace, by team, by resource type, behavior shifted. Not because we mandated it. Because visibility creates accountability in a way that process documents never do.

We tagged each drift event with:

The last-known modifier (pulled from audit logs via Kubernetes API)
Time since divergence
Severity of the delta

A replica count change is low severity. A security context change is high. A resource limit change that's 4x the declared value gets a PagerDuty alert.

Within three weeks of launching the dashboard, the team had self-corrected 11 of the 14 original drifted resources without us asking. They just didn't want to see red in their namespace.

We also made it a blocker on our weekly architecture review. Any service with unresolved drift older than 72 hours got a 5-minute explanation from the owning team. Not punitive, just a forcing function for documentation and communication.

Point 4: Drift Detection Has to Be Cheap to Maintain

Here's where most homegrown solutions fall apart. You build the thing, it works, then it becomes another system someone has to babysit.

We kept ours deliberately simple:

No custom UI. A Grafana dashboard pulling from Prometheus.
The controller runs as a standard Kubernetes deployment with a ServiceAccount scoped to read-only cluster access.
The diff logic uses server-side apply dry-runs, which Kubernetes gives you for free.
Total compute overhead is negligible.

We've been running it for eight months. It's needed exactly two bug fixes and one config update when we migrated Helm chart versions. That's it.

Complexity is debt. Every additional feature you bolt on is another thing that can fail or get abandoned.

The Takeaway

Drift detection isn't a Kubernetes problem. It's a systems problem. The cluster just happens to be where the drift lives.

If you're running GitOps and you've never run a diff between your declared manifests and your live cluster state, you probably have drift. You just don't know what it looks like yet.

The question worth sitting with: what decisions are you currently making based on cluster state that you think matches Git, but doesn't?