Three days I lost chasing a ghost that was already dead on disk

#python #discuss #devjournal #linux

I want to talk about the dumbest bug I've shipped to myself in months. Not because it was clever. Because it wasn't, and I burned three days on it anyway.

Mid-April. I'm doing what I called the Felix V3 graft - rewiring five components in the bot stack to share a single config accessor instead of each one reading env vars on its own. Boring plumbing work. The kind of refactor where you're not building anything new, you're just trying to stop the duplication that's been mocking you every time you open a file.

I got it done in an afternoon. Five components touched, accessor in place, fail-open default so if a key is missing the bot returns a safe value instead of crashing. I was proud of the fail-open part. Felt grown up.

Then the tests went sideways.

The logs were printing variable names that didn't exist in the new code. I'd grep the repo - nothing. Open the file - the line wasn't there. Restart the test. Same log line. Same ghost variable. I started questioning whether I was reading the right repo. Whether I had two checkouts. Whether the disk had silently corrupted. I ran find / -name "felix*.py" 2>/dev/null like a paranoid person at 2am.

I had never opened a terminal until a couple months ago, so my mental model of how a Linux service actually runs is, let's say, under construction. I knew the bot was a systemd service. I knew I'd been editing the files. I assumed - and this is the whole bug - that editing the file on disk meant the running process saw the new code. Because in my head, the file IS the program.

Day three, sitting there with coffee gone cold, I finally read the systemd docs for real instead of skimming. And there it was. When you change a unit file, or the code a daemon is executing, the running process keeps running the version it loaded into memory at startup. The disk has the new code. RAM has the old code. They are two different programs that happen to share a name.

sudo systemctl daemon-reload re-reads the unit definitions. sudo systemctl restart felix actually reloads the Python into a fresh process. I had been doing neither. I had been editing files and watching a corpse run.

The moment I restarted the service, the ghost variable vanished. The new accessor took over. The fail-open default kicked in for one missing key I hadn't noticed and the bot kept serving instead of dying. The thing I built actually worked. It had been working for three days. I just couldn't see it because I was watching the old version perform.

I'm still learning what half of these commands actually do under the hood, and senior devs reading this are going to wince. That's fine. The lesson I want to write down before I forget it:

The file on disk is not the program. The program is what's in memory. When you refactor a long-running service, the refactor isn't done until the process is restarted. Until then you are reading one book and grading a different one.

The second lesson, smaller but I like it more: fail-open accessors save your weekend. I had one config key go missing during the restart because I'd renamed it and forgot to update the env file. Old me would have woken up to a dead bot and angry logs. New me woke up to a bot that returned a sensible default and kept earning. Another Safety Pack went out the door while I was asleep being wrong about how Linux works.

How long did it take you to internalize the difference between the code on disk and the code in memory? And did anyone tell you, or did you have to lose three days to a ghost first?

DEV Community

Three days I lost chasing a ghost that was already dead on disk

Top comments (0)