Designing a Reliable Feature Flag Control Plane
Designing a Reliable Feature Flag Control Plane
Feature flags are not just booleans in a config file; at scale, they become a distributed control system for safely shipping code, limiting blast radius, and coordinating releases across many services and clients. This guide walks through a practical architecture for building a feature flag platform that is fast, safe, auditable, and easy to operate.
Why this architecture matters
A good feature flag system has to answer two questions quickly: “Is this feature on for this request?” and “Why was this decision made?”. The first is about low latency and high availability, while the second is about explainability, audits, and debugging. A weak design tends to create hidden complexity, stale flags, and inconsistent behavior between environments.
The core design challenge is that flag evaluation happens on the critical path of user requests, but flag management is a control-plane problem that changes constantly. That means you need to separate the write-heavy admin side from the read-heavy runtime side. In practice, the runtime should rely on local snapshots and fast refresh mechanisms instead of calling a remote database for every request.
Reference architecture
A solid feature flag platform usually has five layers: authoring, storage, distribution, evaluation, and observability. The authoring layer is your admin UI or API where teams define flags, targeting rules, expiration dates, and environments. The storage layer keeps canonical flag definitions and change history. The distribution layer pushes updates to clients or edge caches. The evaluation layer decides flag values for a specific request. The observability layer records decisions, changes, and anomalies.
A simple architecture looks like this:
Admin UI / API
↓
Flag Store (Postgres, DynamoDB, etc.)
↓
Change Stream / PubSub / Webhooks
↓
SDK Cache / Edge Cache / Relay Proxy
↓
App Request → Local Evaluation → Feature Decision
↓
Decision Logs / Metrics / Audit Trail
The important idea is that application servers should not depend on the admin database for every evaluation. They should evaluate from an in-memory snapshot that gets refreshed asynchronously. That keeps request latency predictable even when the flag control plane is under load or partially unavailable.
Data model
At minimum, store four entities: flag, rule, segment, and audit event. A flag describes the feature and its default state. A rule describes conditions like user ID, plan, region, app version, or percentage rollout. A segment is a reusable audience definition such as “internal users” or “beta testers.” An audit event records who changed what and when.
Example schema:
CREATE TABLE feature_flags (
id BIGSERIAL PRIMARY KEY,
key TEXT NOT NULL UNIQUE,
name TEXT NOT NULL,
description TEXT,
enabled BOOLEAN NOT NULL DEFAULT FALSE,
environment TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT now(),
updated_at TIMESTAMP NOT NULL DEFAULT now()
);
CREATE TABLE feature_flag_rules (
id BIGSERIAL PRIMARY KEY,
flag_id BIGINT NOT NULL REFERENCES feature_flags(id),
priority INT NOT NULL,
rule_type TEXT NOT NULL,
rule_json JSONB NOT NULL,
percentage_rollout INT CHECK (percentage_rollout BETWEEN 0 AND 100)
);
CREATE TABLE feature_flag_audit_log (
id BIGSERIAL PRIMARY KEY,
flag_key TEXT NOT NULL,
actor TEXT NOT NULL,
action TEXT NOT NULL,
diff_json JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT now()
);
A practical rule design is to keep evaluation data compact and immutable enough for caching, while allowing rich metadata for the admin side. The runtime does not need the entire editing history; it needs the latest effective config plus enough metadata to explain the decision.
Evaluation flow
Flag evaluation should be deterministic and cheap. The SDK receives a context object for the current request, such as user_id, org_id, region, device, and app_version. It then evaluates rules in priority order and falls back to the default state if nothing matches. Percentage rollouts should use a stable hash so a user stays in the same bucket between requests.
Example TypeScript evaluator:
type Context = {
userId?: string;
orgId?: string;
region?: string;
appVersion?: string;
};
type Rule =
| { type: "user_in"; userIds: string[]; value: boolean }
| { type: "region_in"; regions: string[]; value: boolean }
| { type: "percentage"; pct: number; value: boolean };
type Flag = {
key: string;
defaultValue: boolean;
rules: Rule[];
};
function hashTo100(input: string): number {
let hash = 0;
for (const ch of input) hash = (hash * 31 + ch.charCodeAt(0)) >>> 0;
return hash % 100;
}
function evaluate(flag: Flag, ctx: Context): { value: boolean; reason: string } {
for (const rule of flag.rules) {
if (rule.type === "user_in" && ctx.userId && rule.userIds.includes(ctx.userId)) {
return { value: rule.value, reason: "matched user rule" };
}
if (rule.type === "region_in" && ctx.region && rule.regions.includes(ctx.region)) {
return { value: rule.value, reason: "matched region rule" };
}
if (rule.type === "percentage" && ctx.userId) {
const bucket = hashTo100(`${flag.key}:${ctx.userId}`);
if (bucket < rule.pct) {
return { value: rule.value, reason: `matched ${rule.pct}% rollout` };
}
}
}
return { value: flag.defaultValue, reason: "default" };
}
This model keeps the business logic straightforward and makes the result explainable to developers and support teams. If a user reports seeing a feature unexpectedly, the system can show the exact rule and bucket that produced the decision.
Distribution strategy
The fastest systems evaluate flags locally and refresh data in the background. There are three common distribution patterns: polling, streaming, and relay proxy. Polling is simplest but may be stale for longer. Streaming via SSE, websockets, or a pub/sub channel gives faster updates. A relay proxy sits between app servers and the flag backend so SDKs talk to one local endpoint instead of many remote services.
A strong production design often combines them. For example, apps can boot from a snapshot file, then subscribe to a stream for updates, and fall back to polling if the stream breaks. This makes the control plane resilient without sacrificing freshness. It also prevents a broken admin backend from taking down your product experience.
Safety and rollout
Flags are most useful when they make risky changes reversible. Start with internal dogfooding, then target a small user segment, then expand gradually. For example, you might begin with 1% of traffic, move to 10%, then 50%, then 100% if error rates stay healthy. A feature flag platform should support kill switches so teams can disable a feature quickly during incidents.
Best practices:
- Always attach an owner, expiration date, and purpose to every flag.
- Use permanent flags sparingly; most flags should be retired after rollout.
- Make rollout rules explicit and auditable.
- Separate operational flags from product-experiment flags so they have different lifecycle rules.
That expiration-date discipline matters because stale flags create “flag graveyards” and make the system harder to reason about over time. A clean lifecycle is part of the architecture, not an afterthought.
Observability and audit
Every flag decision should be explainable. Log the flag key, evaluation result, matched rule, context hash, version, and timestamp. That gives you debugging power without storing excessive user data. For compliance and internal trust, audit logs should be append-only and tamper-evident where possible.
Useful metrics include:
- Evaluation latency.
- Cache hit rate.
- Update propagation delay.
- Number of stale snapshots served.
- Flag changes by team and environment.
Here is a simple decision log format:
{
"flag_key": "new-checkout",
"value": true,
"reason": "matched 10% rollout",
"user_id": "u_12345",
"snapshot_version": "2026-05-31T19:35:00Z",
"source": "local-cache"
}
When incidents happen, this log becomes the backbone of your root-cause analysis. It helps distinguish “the flag was off” from “the wrong snapshot was loaded” or “the targeting rule was misconfigured.”
Failure modes
The biggest danger is letting the flag platform become a single point of failure. If the flag backend is unavailable, application traffic should keep flowing using the last known good snapshot. Another common issue is inconsistent evaluation between clients and servers, especially when each side implements its own logic. To prevent that, centralize the rule semantics in SDKs that share test fixtures and versioned behavior.
Other failure modes include:
- Cache staleness that lasts too long.
- Non-deterministic percentage rollouts.
- Duplicate or conflicting rules.
- Missing fallback values.
- No lifecycle cleanup for retired flags.
A practical rule is to fail safe. For product features, “off” is often safer than “on.” For infrastructure kill switches, choose the default that reduces blast radius without breaking core request handling.
Deployment checklist
Use this checklist before shipping a feature flag system:
- Define the flag ownership model and add expiration dates.
- Store canonical definitions separately from runtime snapshots.
- Implement local evaluation in SDKs or a relay proxy.
- Add stable hashing for percentage rollout.
- Emit audit logs for every change and every evaluation failure.
- Add fallback behavior for offline or stale control planes.
- Build cleanup automation for expired or unused flags.
A small launch example makes the workflow concrete: roll out a redesigned checkout button to internal users first, then 5% of logged-in customers, then 50%, while watching conversion and error metrics. If checkout errors rise, flip the flag off instantly while you investigate.
Closing model
Think of a feature flag platform as a distributed control plane with a local execution engine. The control plane stores truth, policy, and history; the runtime needs speed, stability, and a clear fallback path. That separation is what keeps flag systems useful instead of becoming another source of operational risk.
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)