Nex Tools

Posted on May 13

Claude Code for Feature Flags: How I Ship Risky Changes Without Losing Sleep

#claudecode #devops #programming #productivity

The riskiest deployment I have ever done was a payment processor migration. The old processor was being deprecated. The new one had better rates but a completely different API. The migration touched the most sensitive code path in the business. A bug in the new path would silently lose revenue or charge customers incorrectly. There was no acceptable amount of downtime.

I shipped that migration on a Tuesday afternoon at three in the morning. No, that is not a typo. I shipped it in the middle of a normal Tuesday because feature flags let me. The new path was already in production behind a flag set to zero percent of traffic. I ran a script that increased the percentage gradually: one percent, then ten, then fifty, then one hundred. Each step, I watched the metrics. If anything looked wrong, I would have rolled back to zero with a single command. Nothing looked wrong. The migration completed in about ninety minutes and nobody on the team even knew it was happening except me.

That is what feature flags do when they work right. They turn a scary deployment into a routine one. They let you separate the act of shipping code from the act of activating it. They give you the ability to react to problems in seconds instead of minutes or hours. But they only work right when the infrastructure around them is solid. Building that infrastructure is what Claude Code helps with.

Here is the workflow I use to run feature flags across multiple products.

What Feature Flags Are Actually For

Feature flags get pitched as a way to do A/B testing. That is the marketing pitch and it sells products. The actual reason to have feature flags in your codebase is more boring and more important. They let you decouple deployment from release.

When the deployment ships, the new code is in production but the new behavior is not active. When you decide the behavior should be active, you flip the flag and the behavior turns on without redeploying. When you discover a problem, you flip the flag back and the behavior turns off without redeploying.

The value of feature flags is not in the experimentation they enable. The value is in the asymmetry they create between deployment risk and release risk. A bad deployment forces a rollback. A bad release flips a flag. The difference is the difference between a Sunday night incident and a Tuesday afternoon decision.

The challenge is that feature flags themselves become a source of complexity. Code that is gated behind flags is harder to reason about. Tests have to cover both branches. The flag state has to be consistent across requests. The flag has to be cheap to evaluate. The flag has to be possible to clean up when the experiment is done.

Most teams I have worked with started with a simple flag system and then watched it grow into something unmanageable. Hundreds of flags. Stale flags from features shipped years ago. Flags that nobody remembers what they do. Flag configurations that disagree across environments. The complexity of the flag system eventually exceeds the complexity it was meant to manage.

The Claude Code workflow tackles this by automating the lifecycle of flags from creation through cleanup, and by enforcing the patterns that keep the system manageable.

The Flag Creation Skill

The flag creation skill takes a feature request and produces the flag-gated implementation. The skill handles the boilerplate that makes flags consistent and the discipline that prevents bad patterns.

The skill creates the flag definition in the central flag registry. The registry has a single source of truth for every flag, including its name, its description, its expected lifetime, its owner, and its allowed values. The flag does not exist if it is not in the registry. New flags require an explicit registration step that captures the metadata.

The skill instruments the new code with the flag check at the right level. The check is at the boundary where the new behavior diverges from the old. Putting the check too deep means you have to thread the flag value through many layers. Putting the check too shallow means the entire request path has to be duplicated. The right place is the smallest scope that contains all the diverging logic.

The skill writes both branches of the code. The new branch is the new behavior. The old branch is the existing behavior, preserved unchanged. Both branches have tests. The tests run for both branches in CI, ensuring that turning the flag on or off does not break the build.

The skill also writes the migration plan. The plan documents how the flag will be rolled out: starting percentage, ramp schedule, success criteria, and rollback criteria. The plan goes into the PR description and gets reviewed alongside the code. Without a plan, the flag has no path to being fully on or fully off.

The Rollout Skill

Once a flag exists, it has to be rolled out. The rollout skill handles the progressive activation that turns a flag from zero percent to one hundred percent.

The skill operates on a rollout schedule. The schedule has stages, each with a target percentage and a duration. Stage one might be one percent for one hour. Stage two might be ten percent for one day. Stage three might be fifty percent for one day. Stage four is one hundred percent.

Between stages, the skill checks the health metrics for the rollout. The metrics include the error rate for the new behavior, the latency, the conversion rate if it is a user-facing change, and any custom metrics specified in the rollout plan. If the metrics are within the acceptable range, the rollout proceeds to the next stage. If they are outside the range, the rollout halts and a human is notified.

The skill never advances past a stage on a schedule alone. The schedule sets the maximum pace, but the metrics set the actual pace. A rollout that looks fine on the schedule but has degrading metrics will stop. A rollout that has clean metrics but is ahead of schedule will not skip the wait, because the wait is what lets long-tail issues surface.

The skill also handles segment-based rollouts. Sometimes you want to roll out to specific users first: internal staff, then beta testers, then a geographic region, then everyone. The skill expresses these segments in the rollout plan and applies them in sequence. The segment-based rollout catches issues that percentage-based rollouts miss, because a one percent rollout might still miss specific cohorts where the bug manifests.

The Evaluation Skill

The flag has to be evaluated at runtime, and the evaluation has to be fast and consistent. The evaluation skill produces the runtime code that checks whether a flag is on for a given context.

The skill produces an evaluation function that takes a context (user ID, request attributes, environment) and returns the flag value. The function is deterministic. Given the same flag state and the same context, it always returns the same value. This consistency is important for testing and debugging. If you cannot reproduce a flag evaluation, you cannot debug the resulting behavior.

The skill caches flag values aggressively. Flag definitions change rarely. The flag values for a given context can be computed once per request and reused everywhere. The skill produces an in-request cache that avoids redundant evaluations. For longer-lived contexts like user sessions, there is also a session-level cache.

The skill also handles flag dependencies. Sometimes one flag depends on another. The dependent flag should not be evaluated if the parent flag is off. The skill produces evaluation code that respects the dependency graph and avoids spurious evaluations.

The evaluation has to be cheap. Every request might evaluate dozens of flags. If each evaluation takes a millisecond, the request latency adds up quickly. The skill produces evaluation code that completes in microseconds for the cached case and in milliseconds even for the cold case. The performance budget for flag evaluation is much tighter than most teams realize.

The Observability Skill

Flags need observability for the same reason any production code needs observability. You have to know what the flag is doing. The observability skill adds the instrumentation that makes flags debuggable.

Every flag evaluation produces a log line. The line includes the flag name, the context attributes that mattered, the value returned, and the reason the value was chosen. The reason is important. A flag returning true might be returning true because the user is in the rollout cohort, or because the user is in a force-on list, or because the flag is fully on. The reason tells you which path was taken.

Every flag value gets a metric. The metric tracks the distribution of values for the flag over time. You can see at a glance whether a flag is at one percent, ten percent, or one hundred percent. You can see when a flag was changed, by looking for the inflection point in the metric. You can see whether the rollout is consistent with what the configuration says, by comparing the metric to the configured percentage.

The skill also produces trace spans that show which flags were evaluated during a request and what values they returned. The trace span is what lets you debug behavior that depends on flag values. When a user reports an issue, you can pull up their request trace and see exactly which flag values applied to their session.

The observability is what makes the flag system trustworthy. Without it, flag-related bugs are hard to diagnose because you cannot tell what the flag system was doing. With it, the flag system is transparent and bugs are quick to find.

The Cleanup Skill

The biggest source of flag debt is flags that should have been removed but were not. The cleanup skill is what keeps the flag system from growing unbounded.

The skill watches the flag registry for flags that are eligible for cleanup. A flag is eligible if it has been at one hundred percent or zero percent for a sufficient period, with no recent changes. The threshold is configurable, typically a few weeks for fully rolled out flags and a few days for fully rolled back ones.

When a flag is eligible, the skill produces a cleanup PR. The PR removes the flag check from the code, keeping only the active branch. The removed branch is the dead branch, the one not in use. The flag definition is also removed from the registry. The tests are updated to remove the branch coverage that no longer applies.

The cleanup PR goes through normal review. A human looks at it, confirms the removed branch is actually dead, and merges. The merge is what completes the flag lifecycle. The flag existed for as long as it needed to. Now it does not exist and the codebase is simpler.

The skill also surfaces flags that have not had any activity for a long time. These are zombie flags, where the rollout stalled and was never completed. The zombie flag should either be rolled out the rest of the way or rolled back. The skill produces a report that surfaces the zombies and asks for a decision.

The Coordination Skill

Multiple flags interact. A flag for a new payment processor might depend on a flag for the new checkout UI, which might depend on a flag for the new auth flow. The coordination skill manages these interactions.

The skill maintains a dependency graph of flags. The graph encodes which flags depend on which. When you change one flag, the skill checks whether the change is consistent with the dependent flags. Turning on a flag that depends on another flag that is off produces an error.

The skill also handles experiment conflicts. Two experiments running at the same time might target overlapping segments. The skill detects the overlap and warns. Sometimes the overlap is fine because the experiments are independent. Sometimes the overlap is a problem because the experiments interfere with each other.

The skill produces a calendar of flag activity. Each flag rollout shows up on the calendar with its start date and end date. The calendar helps the team see when the system has many things in flight versus when it is quiet. Scheduling a risky rollout for a quiet period reduces the chance of conflicting changes.

The Testing Skill

Flag-gated code has to be tested for both branches. The testing skill ensures this happens automatically.

The skill produces a test matrix for each flag. The matrix enumerates the relevant flag values: off, on, partial. For each value, the existing test suite runs against that flag state. If any test fails for any flag value, the build fails. The matrix ensures that turning the flag on or off does not break the system.

The skill also produces flag-specific tests. These tests target the boundary between the two branches and verify that the boundary works correctly. The transition tests are what catch bugs where the flag check is in the wrong place or where the two branches diverge in unexpected ways.

For more complex flags, the skill produces integration tests that exercise the full flow through both branches. The integration tests are slow but catch the kinds of bugs that unit tests miss. Integration tests run on a schedule rather than on every commit, so the slowness does not block the team.

The skill also handles snapshot testing for user-visible changes. When a flag is going to change a user interface, the skill produces snapshots for both states. The snapshots get reviewed before the flag rolls out, ensuring that the visual change is intended.

How the Skills Compose

The skills compose into a flag lifecycle. A new feature comes in. The creation skill produces the flag-gated implementation. The rollout skill activates the flag gradually. The evaluation skill makes the flag fast at runtime. The observability skill makes the flag visible. The coordination skill keeps the interactions clean. The testing skill catches the branch bugs. The cleanup skill removes the flag when it is done.

The team interacts with the skills through normal PR flow. The creation PR adds the flag and the gated code. The rollout PR sets the schedule. The cleanup PR removes the completed flag. Each PR is small and reviewable. The skills handle the boilerplate but the humans make the decisions.

The result is a flag system that scales to many flags without becoming a source of pain. Flags get created, rolled out, observed, and cleaned up on a regular cadence. The codebase stays manageable. The risk of changes goes down.

What This Costs

The skills took about a month to build. The evaluation skill was the most complex piece because the performance requirements are tight and the consistency requirements are strict. The cleanup skill required the most ongoing tuning to avoid false positives that would propose removing flags that were still in use.

The benefit is in the rate at which the team can ship. Before the flag system, every risky change required a deployment that activated the change immediately. The deployment had to be timed carefully and watched closely. With the flag system, the deployment is decoupled from the activation. Risky changes ship at any time and activate when the conditions are right.

The benefit also shows up in the rate of recovery from problems. A bug that gets discovered after a flag-gated rollout takes seconds to mitigate. The flag flips off and the bug stops happening. The fix can be deployed at a normal pace because the production impact is already neutralized.

What the Skills Do Not Do

The skills do not pick your flag platform. Whether you use a SaaS flag service, an open source flag system, or a homegrown one, the skills produce code that fits the platform's evaluation interface. The platform itself is your choice.

The skills also do not write your feature code. They handle the flag gating and the lifecycle, but the feature behavior is yours to design and implement. The flag is the wrapping around the feature, not the feature itself.

The skills also do not decide which features should be flag-gated. Not every change needs a flag. Small changes, low-risk changes, and changes that have no rollback path do not benefit from flagging. The decision of when to use flags is a judgment call that the skills support but do not make.

Setting Up Your Own Flag System

Start with the registry. The registry is the foundation. Every flag exists in the registry, with its metadata. Without the registry, the flag system is not manageable at scale.

Add the evaluation library next. The library is what application code uses to check flags. The library should be small, fast, and well-tested. It is the most performance-sensitive part of the system.

Add observability third. Get flag evaluations into your logs and metrics. This is what makes the flag system debuggable when it does not behave as expected.

Add the rollout tooling after that. The progressive rollout is what turns flags from a binary on-off switch into a graduated mechanism for managing risk.

Add cleanup last. The cleanup matters but it can wait until you have a few flags in flight and need to start removing them.

The Bigger Picture

Feature flags are one of those tools that pay off enormously when they work and create chaos when they do not. The difference is the infrastructure around them. A flag system without a registry is a mess of strings scattered through the codebase. A flag system without observability is a black box that nobody trusts. A flag system without cleanup is a graveyard of dead branches that accumulate over years.

The pattern in this workflow is the pattern I keep applying. The repetitive parts of flag management get automated. The judgment parts stay human. The infrastructure is what lets the team use flags confidently. Without the infrastructure, flags become risky. With it, flags become routine.

If you have a codebase that needs more sophisticated release management than your current deployment pipeline supports, the answer is probably not to invest in faster rollbacks. The answer is to invest in flags. The flags are what let you take the deployment pressure off, ship more often, and recover faster when things go wrong.

The first concrete step is the registry. Create a single source of truth for flags. Make it impossible to use a flag that is not registered. Once the registry exists, every other piece of the system has a foundation to build on. Without it, the flags drift and the system becomes ungovernable.

Build the registry. Add one flag. Watch it through its full lifecycle. Then add the next one. The discipline compounds quickly.

FAQ

Should every change go behind a flag? No. Small changes and changes that you can revert with a quick deploy do not benefit from the overhead. Use flags for changes where the risk of a bad release is high.

How long should a flag live? As short as possible. A flag that is fully rolled out should be cleaned up within weeks. A flag that is fully rolled back should be cleaned up within days. Flags that live for years are usually a sign of a stalled rollout.

What about flags for permissions? Permission flags are different from rollout flags. Permission flags live forever and are part of the application's permanent logic. The same registry can hold both, but they should be tagged differently and treated differently by the cleanup skill.

What about server-side versus client-side flags? The same patterns apply but the implementation differs. Server-side flags evaluate on every request. Client-side flags evaluate once per session. The evaluation skill handles both modes with the same interface.

What is the biggest mistake to avoid? Adding flags without a plan to remove them. Every flag should have an expected end state. Without a plan, the flag accumulates and the system gets messy.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

DEV Community