My Cluster's haunted: A story about fighting Ghosts with Code

#showdev #kubernetes #devops #gitops

It starts with a Slack message from the product manager. "Hey, the new checkout flow... is it on or off in staging? It seems to disappear like every few hours."

That's when your heart sinks. A flickering feature is so much worse than a broken one.

You check the site. They're right. The new feature is gone. But you swear it was there an hour ago.

First stop, git history. Our team uses GitOps, so the deployment YAML in the repo is the source of truth, right? Right? The feature flag, an environment variable ENABLE_NEW_CHECKOUT_FLOW, is set to "true". No recent commits.

The GitOps dashboard is all green. The cluster is in sync - no recent actions recorded. As far as it knows, everything is perfect.

Fine, give it some time. Couple deap breaths after exec into a pod, print the env vars, and there it is. ENABLE_NEW_CHECKOUT_FLOW="false".

How?

Whenever you use the GitOps tool as your only means of managing the deployment, then how can you be sure that the live state matches what is in Git? If what you see in Git is a "truth" but is different (and you don't know how it occurred) you have lost control of your deployment, and why/what is the source or cause of that loss is now unknown.

It's probably not a person. It's something automated. A ghost in the machine.

The Ghost in the Machine

It happens all the time in clusters that have been around for a while. Your "Ol' Reliable" GitOps stack is doing its job perfectly, but a ghost from Christmas Past still lingers around. An old, forgotten Jenkins job from before the great migration to GitOps, still running on a schedule. Some dumbass turned on instance meant to be decomissioned. As it turned out only for a never ending "quick peek" that should end like year ago... It uses its own manifest, one where the new feature flag is not declared, so it defaults to not enabled in pod.

Every three hours, the Jenkins job wakes up, connects to the cluster, and dutifully applies the old config. An hour later, your GitOps tool detects the drift and reverts the change - but it is not recorded as an "User" action. The feature flickers on and off.Damn Konrad you did not check automatic Sync as you trust it fully because it just work.

This is the kind of stuff that makes good engineers quit.

The secret to finding the ghost is buried in a Kubernetes feature called managedFields. It's supposed to be a tiny audit log for every field in your YAML, tracking which "manager" last touched it.

In theory, it's the answer. In practice, have you ever tried to read that thing? It's a solid wall of text, and it's useless in a firefight.

I got tired of the guesswork. So I built a tool.

To be completely honest, I've never done any "writing about my code" before. However, this issue has cost me so many hours that I felt compelled to create something. Finding others who have experienced this suffering and perhaps, just possibly, easing their lives is just right move.

I've been haunted by these automation ghosts. I've lost afternoon trying to trace which of a dozen different systems was responsible for a random change.

So I built vismo.

It's a simple tool that does one thing: it makes managedFields readable by actual humans.

The blame command, which I shamelessly named after git blame, shows you the owner of every single field.

So when the feature disappears, you run this:

$ vismo blame deployment/checkout-service -n staging

FIELD                                           OWNER                                     OPERATION   TIME
metadata.labels.app                             argo-cd                                   Update      2026-03-13T10:05:00Z
spec.replicas                                   argo-cd                                   Update      2026-03-13T10:05:00Z
...
spec.template.spec.containers.[0].env.[2].value system:serviceaccount:jenkins:default       Apply       2026-03-13T12:00:00Z

And there's your ghost.

The value isn't owned by argo-cd. It's owned by the jenkins:default service account. That's it. That's the lead you needed. You know, in seconds, that a rogue pipeline is the culprit.

It turns the blame game into a 10-second diagnosis.

If you've ever had that feeling in your gut when brutal reality doesn't match the repo, give it a try. I hope it helps you hunt down some ghosts of your own.

https://github.com/Veinar/vismo

DEV Community

My Cluster's haunted: A story about fighting Ghosts with Code

The Ghost in the Machine

I got tired of the guesswork. So I built a tool.

Top comments (0)