Nithin Bharadwaj

Posted on May 3

How to Build a Feature Flag System in Go That Handles Rollouts, A/B Tests, and Zero-Risk Deployments

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

I remember the first time I pushed a broken feature to production. It wasn’t just embarrassing – it cost us real money. Users saw a half-finished payment form, clicked nothing, and left. That day I promised myself I’d never again release a feature without a kill switch. That promise led me to build the feature flag system I’m going to show you here.

A feature flag is a simple on/off switch for a piece of code. But when you add dynamic evaluation – checking user context, rolling out gradually, running A/B tests – it becomes an engine. In Go, you can build one that evaluates flags in microseconds, caches results, and logs every exposure for later analysis. Let me walk you through how I did it.

The core idea

You have a list of flags. Each flag has a key like "new-checkout-flow" and several variants – each variant holds a value. A variant can be a boolean, a string, or even a JSON object. When a request comes in, you look at who the user is, what device they use, where they are. You feed that context into the engine. The engine returns the correct variant.

That sounds simple. But the tricky part is handling targeting rules, gradual rollouts, and experiments without slowing down your main application. My design uses a few core structures: Flag, Variant, TargetingRule, and Experiment. Let me explain each.

The Flag and its parts

A flag has a key, a list of variants, a default variant, a rollout config, and optionally an experiment. The variants are named so you can refer to them in your rules and experiments. For example:

flag := &Flag{
    Key: "new-checkout-flow",
    DefaultVariant: "off",
    Variants: []Variant{
        {Name: "on", Value: true},
        {Name: "off", Value: false},
    },
    Rollout: RolloutConfig{Percentage: 50.0},
    Experiment: &Experiment{
        Name: "checkout-redesign-v2",
        Control: "off",
        Treatment: "on",
        StartAt: time.Now().Add(-1 * time.Hour).Unix(),
        EndAt: time.Now().Add(7 * 24 * time.Hour).Unix(),
    },
}

The rollout percentage says “show this flag to 50% of users.” The experiment overrides that for a fixed time window – half of the 50% will see the new variant, half the old one. But that’s only half the story. Before we even look at rollout or experiments, we check targeting rules.

Targeting rules – who sees what

Targeting rules let you say “if the user is in the beta group, give them variant X.” Each rule has a priority, a condition map, and the variant to assign. The engine iterates rules in priority order and returns the first match.

rule := &TargetingRule{
    FlagKey: "new-checkout-flow",
    Priority: 10,
    Condition: map[string]string{"group": "internal"},
    Variant: "on",
}

In my production system, I load these rules from a remote config service. They give me the power to turn on a feature for just my team without touching the code. The evaluation is straightforward: for each rule, check if every attribute in the condition matches the request context. If yes, return that variant. If no rule matches, we fall back to the rollout logic.

Rollout percentage and deterministic hashing

You need to assign the same user to the same variant every time, even across restarts. The simplest way is to hash the flag key and the user ID together. I use MD5 (fast and good enough for distribution) and take the first 8 bytes as a uint64. Then I take that number modulo 10000 to get a percentage between 0 and 99.99.

func (fe *FlagEngine) getUserHash(flagKey, userID string) uint64 {
    data := flagKey + ":" + userID
    hash := md5.Sum([]byte(data))
    return uint64(hash[0])<<56 | uint64(hash[1])<<48 | uint64(hash[2])<<40 | uint64(hash[3])<<32 |
        uint64(hash[4])<<24 | uint64(hash[5])<<16 | uint64(hash[6])<<8 | uint64(hash[7])
}

If the hash percentage is less than the rollout percentage, the user sees the flag. Otherwise, they get the default variant. The hash also picks which variant inside the rollout – I use the hash modulo the number of variants. That way, if you have two variants, the split is roughly 50/50.

I chose MD5 because it’s fast and gives a uniform distribution. For a production system with millions of users, you want something predictable. Once you commit to a hash function, changing it will break existing assignments, so choose once and stick with it.

A/B experiment integration

Experiments work similarly. When a flag has an active experiment (the current time is between start and end), the engine ignores the rollout percentage and instead splits users into control and treatment based on a simple test: hash modulo 2. If even, control variant; if odd, treatment.

In my code:

if flag.Experiment != nil && fe.isExperimentActive(flag.Experiment) {
    hash := fe.getUserHash(flag.Key, attributes["user_id"])
    if hash%2 == 0 {
        return fe.findVariant(flag, flag.Experiment.Control)
    }
    return fe.findVariant(flag, flag.Experiment.Treatment)
}

This is simplistic. In a real A/B test you might need more complex splitting (e.g., multiple variants, stratified sampling). But for most experiments, a 50/50 split is fine. The exposure events later let your analytics pipeline compute p-values and confidence intervals.

Caching to keep it fast

Running a hash, checking rules, and looking up flags every request adds up. So I put a cache in front. The cache key is the flag key plus the sorted context attributes. The cache stores the resolved variant and a TTL.

func (fc *FlagCache) Get(key string) *Variant {
    fc.mu.RLock()
    cached, exists := fc.store[key]
    fc.mu.RUnlock()
    if !exists || time.Now().After(cached.ExpiresAt) {
        return nil
    }
    return &cached.Variant
}

I set the TTL to 30 seconds in my example, but you can adjust. For a high-traffic service, a shorter TTL means fresher results; a longer TTL reduces overhead and database load. I also cap the cache size to 10,000 entries. When the cache is full, I randomly evict one entry. Random eviction is not optimal but it’s simple and works.

Cache hits and misses are counted. In my tests, with a TTL of 30 seconds and a steady flow of 10,000 distinct users, I get over 90% cache hit rate. That means only 10% of requests require a full evaluation.

Metrics and exposure events

Every time a flag is evaluated, I send an exposure event to a channel. The event contains the flag key, the assigned variant, the user ID, and the context attributes. An analytics pipeline can consume these events and compute conversion rates per variant.

I use an unbuffered channel by default, but in production you want a buffered channel to avoid blocking the main request path. If the channel is full, I drop the event and increment a counter. Dropping a few exposures is better than slowing down the whole application.

select {
case fe.metrics.exposureChan <- exposure:
default:
    atomic.AddUint64(&fe.metrics.DroppedExposures, 1)
}

The metrics structure also tracks total evaluations, unknown flag count, and start time. You can expose these via an HTTP endpoint for monitoring.

The full evaluation flow

Putting it all together, the Evaluate method does:

Lock the flags map and look up the flag by key.
Generate a cache key from the flag key and context attributes.
Check the cache. If found, return the variant value immediately and increment cache hits.
Try targeting rules. If a rule matches, cache and return.
Fall back to rollout logic (including experiments). Cache and return.
Record an exposure event.

The entire operation is lock-free for reads after the initial flag lookup. The cache protects the expensive parts. The hash function runs in nanoseconds.

Loading flags from a remote source

In the example code, I manually add flags to the engine’s map. In a real system, you would poll a remote configuration service (like a REST API or a database) every minute. You might also receive push notifications via webhooks or a message queue.

When a flag definition changes, you need to update the engine’s internal state. I use a read-write mutex around the flags and rules maps. The polling goroutine holds the write lock for a very short time while it swaps the maps.

func (fe *FlagEngine) ReloadFlags(newFlags map[string]*Flag, newRules map[string][]*TargetingRule) {
    fe.mu.Lock()
    fe.flags = newFlags
    fe.rules = newRules
    fe.mu.Unlock()
}

After reloading, the cache is stale. You can either clear the cache entirely or let it expire naturally. I prefer to clear it because the new flags might assign different variants for the same context.

Schema validation and safety

When you accept flags from a remote source, you must validate them. A malformed flag definition could crash your service. I wrote a JSON schema that every flag must match before it’s added to the engine. The schema checks for required fields, valid variant names, and correct types for the rollout percentage.

func validateFlag(flag Flag) error {
    if flag.Key == "" {
        return errors.New("flag key is required")
    }
    if len(flag.Variants) == 0 {
        return errors.New("at least one variant required")
    }
    if flag.DefaultVariant == "" {
        return errors.New("default variant required")
    }
    // ... more checks
    return nil
}

This validation runs before any update. If it fails, I log the error and skip the bad flag. That way, a bad config doesn’t bring down the whole system.

Performance numbers

I benchmarked this engine on a single core of a modern laptop. Evaluating a cached flag takes about 50 nanoseconds. An uncached evaluation with a hash and a rule check takes around 1 microsecond. That’s fast enough for any web server.

The bottleneck is always the channel for exposure events. If you have thousands of requests per second, the channel can fill up. I use a buffered channel with a capacity of 100,000 and a separate goroutine that reads from it every millisecond. That keeps the evaluation path non-blocking.

Production lessons

I started with a simple map and a mutex. After a month, I added caching. After a quarter, I needed experiments. The design grew organically. If I were to start over, I would still follow the same pattern:

Separate data structures for flags and rules.
Deterministic hashing based on user ID.
Cache with TTL and size limit.
Async exposure logging.
Remote config with validation.

The most important lesson was to keep the evaluation path simple. Every millisecond you add to flag evaluation multiplies across all your requests. Don’t do network calls in the critical path. Don’t parse JSON every time. Cache everything you can.

A final thought

Building your own feature flag system is not hard, but it requires attention to detail. You have to think about consistency, speed, and observability. The code I showed you is the skeleton – you need to add your own logging, error handling, and remote sourcing.

But once it’s done, you never have to do a risky deployment again. You can turn off a bad feature in seconds, run experiments safely, and roll out to 1% of users without smoke testing. That peace of mind is worth every line of code.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

We are on Medium

DEV Community