DEV Community

Cover image for The Stale Feature Flag We Deleted That Turned a Feature Back On
Jake Lundberg
Jake Lundberg Subscriber

Posted on • Originally published at jakelundberg.dev

The Stale Feature Flag We Deleted That Turned a Feature Back On

Someone on our team was cleaning up feature flags. It was good instinct...we had a pile of them, and plenty were clearly dead. One in particular had been turned off for over a year. It looked about as safe to delete as anything could look. So they deleted it.

And it turned a feature back on.

Here's the part that still makes me put palm to forehead. Deleting the flag didn't delete the feature that depended on it. There was still code out there, half-forgotten, that checked that flag before doing its thing. While the flag existed and was off, that code stayed quiet. When the flag got deleted, the check didn't blow up...it fell back to the code's default value. And the default was true.

So a feature nobody had thought about in a year quietly switched itself on. "Cleanup" turned out to be a deploy.

On its own, that feature wasn't a big deal. The problem was it collided with another, more important feature. The two were never meant to be running at the same time, and when they were, the second one broke too. Now we had a real bug in production. And here's where I'll own a mistake...our observability wasn't as good as it should have been. We didn't catch it. We found out when a user told us.

Then it took us a long time to track down, for two reasons that both come straight back to the flag.

First, the flag didn't exist anymore. So when we went looking for what changed, there was nothing to find. No flag, no recent deploy of that feature, no changes to the code, no obvious culprit. The thing that caused the bug had been deleted, which is a special kind of frustrating...you're debugging the absence of something.

Second, the old configuration had targeted a subset of users. So the bug only showed up for that specific group. We couldn't reproduce it until we worked out which users were affected and what they had in common. A bug you can't reproduce is a bug you can't fix, and we burned real time just getting to where we could see it ourselves.

The lesson I took from it? Deleting a feature flag doesn't remove a decision. It hands the decision to whatever your code falls back to when the flag is gone. If you don't know what that fallback is, for every place the flag is read, you're not cleaning up...you're shipping a change and hoping.

The other way flags bite you is that they get tangled up with each other, and the tangle lives in nobody's head.

Years ago we had the same kind of feature split across two codebases...our API and our frontend. Same feature, basically, but each side had its own flag. They weren't nested. Neither one knew the other existed. But they were completely dependent on each other...the frontend side only worked if the API side was also on.

You can guess what happened. Someone turned on the frontend flag. The feature broke, because the API flag was still off. Two flags, two repos, one invisible dependency, and nothing anywhere told you they had to move together. The coupling was real...it just wasn't written down, because at the time, the feature flags didn't come with a place to write that down.

Put those two stories next to each other and it's the same pattern. The danger of a feature flag isn't the flag. It's that everything you need to know about it...what reads it, what it falls back to, what it's secretly coupled to...fades out of memory over time, while the flag itself sits there looking harmless.

A flag that's been off for a year isn't dead. It's dormant. And dormant is worse than dead, because dead implies someone confirmed it's safe to remove. Dormant just means everyone stopped looking. Unused is not the same as unimportant.

I don't have this fully solved, but a few things I'm a lot more careful about now.

Before you remove a flag, find every place it's read, AND know what the code does without it. Removing the flag is the easy part. Knowing your fallback is the whole job. If the answer is "it defaults to true," well...deleting the flag is how you turn the feature on.

Treat flag removal like a real change, not like tidying. It changes runtime behavior, so it deserves the same scrutiny as the PR that added the feature. Arguably more, because the people who understood it have most certainly moved on.

Make cross-system dependencies explicit somewhere a human will actually see them. If a frontend flag needs an API flag, that relationship has to live somewhere other than the memory of whoever built it. One idea I like...create the flag's removal task at the same time you create the flag, so cleanup is queued instead of remembered. It isn't free...a pile of "remove me later" tasks is its own kind of debt...but it beats discovering the dependency in production.

And have some way to notice when a dead feature comes back to life. We didn't. That's the gap that turned a small mistake into a user-reported incident. It doesn't take much...just something that would have raised a hand and said "this code path hasn't run in a year, and now it's running."

The reason this stuff is so easy to get wrong is that none of it looks risky. A flag that's been off for a year is the least threatening line in your config. No alarm on it, no recent change, no owner paying attention. It looks like nothing.

That's exactly what makes it dangerous. The risks that get you are rarely the ones that look risky. They're the ones that have sat quietly long enough that everyone forgot they were risks at all.

Top comments (0)