Feature Flags Are Not a Free Lunch

#featureflags #devops #softwareengineering #technicaldebt

Every article about trunk-based development eventually arrives at the same recommendation: use feature flags. Gate incomplete work behind a toggle. Merge to main freely. Turn features on when they are ready.

I have implemented this. And over the past few years, I have watched it create a category of technical debt that nobody warned me about.

This is not an argument against feature flags. They are a useful tool. But the advice to "just use feature flags" leaves out what happens six months later when nobody cleans them up.

The plan is always clean

You are building a new feature. It is not ready for all users yet. So you wrap it behind a feature flag. During development, the flag is off in production. Your code lives in main, safely dormant.

When the feature is ready, you turn the flag on. Once all users have adopted it and the old code path is no longer needed, you remove the flag.

Development, release, adoption, cleanup. Four steps.

In practice, step four almost never happens.

Why flags become permanent

The first time a feature flag sticks around longer than planned, it feels harmless.

Client A adopted the new feature. Client B did not want to move yet. The flag stays on for A and off for B. No big deal. We will clean it up next quarter.

But next quarter, a new feature gets built on top of the flag-on code path. Now removing the flag means untangling the new feature from the old toggle. The effort is not justified, so it gets pushed again.

Meanwhile, Client C onboards and gets a third configuration. The flag is no longer a temporary gate. It is load-bearing infrastructure that was never designed to be permanent.

I have worked on codebases where feature flags that were supposed to last a sprint were still active two years later. Not because anyone decided to keep them. Just because removing them required more effort than anyone could justify in a sprint cycle.

One flag doing too much

Feature flags start simple: on or off.

Over time, they absorb responsibilities they were never meant to carry.

In one project, a single feature flag controlled two completely unrelated things: the data source for a report and the visibility of certain columns in the UI. The flag was originally created for the data source migration. The column visibility logic was added later because it was "related enough" and creating a separate flag felt like overkill.

When the team needed to decouple those behaviors for different clients, it turned into a multi-sprint effort. The simple toggle had become a load-bearing wall in the application logic.

This pattern repeats. Flags accumulate side responsibilities because it is faster to add a condition to an existing flag than to create and manage a new one. Each addition makes the eventual cleanup harder.

The client matrix problem

Feature flags become genuinely dangerous when combined with client-specific targeting.

In a multi-tenant application, it is common to enable features per client. Client A gets the new engine. Client B stays on the old one. Client C gets the new engine but not the new reporting module. Each combination is a distinct application state.

Now multiply that across several flags. Flag A controls data source. Flag B controls UI behavior. Flag C controls a calculation engine. Different clients have different combinations. The number of possible states grows fast.

Nobody tests all of those combinations. The QA team tests the most common configurations. But an untested combination is a production incident waiting to happen.

I have spent hours debugging issues that turned out to be caused by an unexpected interaction between two flags that were never designed to work together. The code was correct for each flag individually. It just had not been tested in that particular combination.

Migrating flag platforms

Here is something that should set off alarm bells: if your team is migrating from one feature flag platform to another, the flag sprawl has already gotten out of hand.

I have been through one of these migrations. The stated goal was modernization and cost reduction. The unstated reason was that the old platform had accumulated so many stale, undocumented flags that nobody trusted it anymore.

What actually happened: every active flag had to be recreated in the new platform. Every client-specific targeting rule had to be rebuilt. Every code reference had to be updated. Tickets were created just to promote individual flags from dev to staging to production. The migration took multiple sprints and added zero user-facing value.

And the new platform? Within months, it started accumulating the same kind of sprawl. Because the problem was never the platform. It was the lack of lifecycle governance.

The naming problem

When multiple teams create feature flags independently, naming conventions diverge fast.

In one codebase, I saw flags named like team-feature-enable alongside flags named like project-component-updates. Different teams, different conventions. Some flags used feature names. Others used team abbreviations. There was no central registry and no way to look at a flag name and understand what it controlled without reading the code.

A new developer picking up a ticket that involved feature flags had to trace each flag through the codebase to understand what it did, which clients it applied to, and whether it was still active. That detective work added hours to what should have been straightforward tasks.

What I think about now

After living through flag chaos more than once, there are things I would set up before adopting feature flags at scale.

Every flag gets a creation date, an owner, and an expected removal date. If the flag is still active past its expected date, it shows up somewhere visible and someone is accountable for it.

One flag, one responsibility. If you are tempted to add a second behavior to an existing flag, create a new flag instead. The cost of managing an extra flag is far lower than untangling an overloaded one later.

A naming convention before the first flag exists. The format matters less than the consistency.

Dedicated time every quarter to audit active flags, remove stale ones, and consolidate duplicates. This never feels urgent, which is exactly why it needs to be scheduled.

And if you have client-specific targeting, maintain a test matrix of the most common flag combinations. At minimum, document which combinations exist in production and make sure QA covers them.

Feature flags are a useful tool. But without lifecycle discipline, they create the same kind of mess they were supposed to prevent. The chaos just moves to a different layer.