Samson Tanimawo

Posted on Apr 28

Feature Flags as a Reliability Tool, Not Just an A/B Platform

#featureflags #reliability #devops #deployments

Most Teams Use Feature Flags Wrong

They wire up LaunchDarkly or Unleash, use it for two A/B tests, then forget about it.

Meanwhile, their production is full of if (isNewCheckoutEnabled) blocks that nobody remembers how to toggle.

Feature flags are not primarily an experimentation tool. They're a reliability tool.

The Real Value

Feature flags let you separate deploy from release. You ship code to production cold, then turn it on gradually for real users.

When things break, you flip the switch back in 10 seconds. No rollback, no redeploy, no PR reverts.

The Four Reliability Patterns

1. Kill Switches

Every risky new feature ships behind a kill switch:

if (featureFlags.isEnabled('new_payment_flow', userId)) {
return newPaymentFlow();
}
return legacyPaymentFlow();

When the new flow has a bug, you don't rollback. You flip the flag.

2. Gradual Rollouts

new_search_algorithm:
rollout_percentage: 1 # Start at 1% of users
rules:
- if: "user.tier == 'internal'"
enabled: true # Internal users always see it

Deploy to 1%, watch metrics, go to 5%, watch, 25%, 50%, 100%. Takes 2-4 hours per rollout instead of a single risky deploy.

3. Circuit Breakers

external_recommendations_service:
enabled: true
automatic_disable_if:
error_rate_above: 5%
for_minutes: 5

If a downstream service starts failing, the flag auto-disables that feature. Your product degrades gracefully instead of crashing.

4. Load Shedding

expensive_realtime_dashboard:
enabled_when:
cpu_utilization_below: 70%
active_users_below: 50000

Under load, disable non-critical features to preserve the critical path.

The Anti-Pattern: Permanent Flags

After a feature is 100% rolled out, the flag should be deleted within 2 weeks. Every flag left in the codebase is technical debt.

Flag hygiene rules:

- Every flag has an expiration date (90 days max)
- Every flag has an owner in CODEOWNERS
- CI fails if a flag is older than 180 days
- Monthly flag cleanup is part of standard operations

We track "flag count" as a reliability metric. If it grows unbounded, we're doing it wrong.

The Architecture

A solid feature flag system has three parts:

1. Definition store

Source of truth for all flags
Versioned in Git or a managed service (LaunchDarkly, Unleash, GrowthBook)
Audit log for every change

2. Client SDK

In-app flag evaluation
Falls back to defaults if the service is unreachable
Caches decisions for 60 seconds
Emits telemetry for flag usage

3. Admin interface

Change flags without deploying code
See current state across environments
Role-based access (not everyone can flip prod flags)
Approval workflow for high-risk flags

Evaluating at the Right Layer

Flags can live at multiple layers:

CDN edge use for marketing experiments
Load balancer use for blue/green deploys
App server use for feature experiments
Database use for schema migrations

The deeper the layer, the faster the rollout. CDN flags flip in seconds. Database flags take minutes to propagate.

The Reliability Metric

Track: mean time to mitigate (MTTM).

If your team can mitigate an incident in under 30 seconds via a feature flag flip, that's a win. If you have to redeploy to mitigate, your reliability is bottlenecked by deploy time.

Good teams: MTTM under 60 seconds
Great teams: MTTM under 15 seconds

Common Gotchas

Stale flags skew A/B results clean them up after experiments
Flags without defaults cause prod outages every flag must have a safe fallback
Flag flips mid-request cause weird bugs evaluate at request start, cache for the request lifetime
Nested flags (flags inside flags) are impossible to reason about avoid

A Reliability-First Flag Strategy

Start simple:

Every new feature ships behind a kill switch
Gradual rollouts for anything touching the critical path
Circuit breakers for external dependencies
Flag cleanup is a monthly ritual
Track MTTM and optimize it

Feature flags are the most underrated reliability tool in modern engineering. Treat them that way.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community