Building a Real‑Time Notification System: Why a Simple Token Bucket Beats Fancy Alternatives

#systemdesign #architecture #backend #programming

Building a Real‑Time Notification System: Why a Simple Token Bucket Beats Fancy Alternatives

Quick context (why you're writing this)

Honestly, I still remember the night our notification service started dropping alerts like hot potatoes. We were pushing millions of events per hour through a Kafka cluster, and every time a spike hit, the downstream workers would either get overwhelmed or start throttling themselves too aggressively. The on‑call pager was screaming, and I spent three hours digging through metrics only to realize we were trying to solve a traffic‑shaping problem with a hammer when we needed a scalpel. That “aha” moment taught me that the real bottleneck isn’t the message bus—it’s how we guard the workers from bursty traffic. So let’s talk about the one piece that made the difference: a rate limiter that actually works at scale.

The Insight

The critical insight is this: a rate limiter doesn’t need to be perfect; it just needs to be cheap, fast, and biased toward letting through some traffic rather than blocking everything. In a real‑time notification pipeline, losing a few events is far less costly than stalling the whole system because a limiter became a single point of failure or added latency that cascaded downstream.

Most teams reach for a centralized leaky‑bucket or a fixed‑window counter backed by Redis, then wonder why latency spikes when the Redis instance hiccups. The problem isn’t Redis itself—it’s that we’re asking a single store to make a throttling decision for every request. The fix? Move the decision as close to the producer as possible, using a local token bucket that periodically reconciles with a shared store. You get sub‑microsecond decisions locally, and only occasional network round‑trips to correct drift.

Yes, you’ll occasionally over‑allow a burst (the local bucket might have a few extra tokens), but you’ll never under‑allow because the shared store caps the long‑term average. The trade‑off is a tiny, bounded amount of “borrowed” capacity in exchange for massive gains in throughput and resilience.

How (with code)

Below is a stripped‑down version of what we run in Go. It’s not a library; it’s the pattern we copied into each service that emits notifications.

type TokenBucket struct {
    rate      float64 // tokens per second
    capacity  float64 // max tokens
    mu        sync.Mutex
    tokens    float64
    lastRefill time.Time
    redis     *redis.Client // shared store for periodic sync
    syncID    string        // unique identifier for this instance
}

// TryConsume attempts to take one token. Returns true if allowed.
func (b *TokenBucket) TryConsume() bool {
    b.mu.Lock()
    defer b.mu.Unlock()
    now := time.Now()
    // refill based on elapsed time
    b.tokens += b.rate * float64(now.Sub(b.lastRefill)).Seconds()
    if b.tokens > b.capacity {
        b.tokens = b.capacity
    }
    b.lastRefill = now

    if b.tokens >= 1.0 {
        b.tokens--
        return true
    }
    return false
}

Common mistake #1 – forgetting to refill on every call.

If you only refill on a timer tick, a sudden burst can drain the bucket and cause false negatives until the next tick. The refill‑on‑access pattern above guarantees you never under‑count.

Common mistake #2 – using a fixed window counter in Redis.

A naïve implementation does INCR key; EXPIRE key 60 and compares to a limit. When the window resets, you get a burst of up to 2*limit requests, which can overwhelm workers during the “reset spike.” The token bucket smooths that out.

Now, the periodic sync that keeps the local bucket from drifting too far:

func (b *TokenBucket) syncWithRedis() {
    // Every 5 seconds we push our usage upward and pull the global allowance.
    ticker := time.NewTicker(5 * time.Second)
    for range ticker.C {
        b.mu.Lock()
        used := b.capacity - b.tokens // how many tokens we've consumed since last sync
        b.mu.Unlock()

        if used > 0 {
            // Add our usage to a global counter (e.g., INCRBYFX)
            b.redis.IncrBy(ctx, "notif:global:used", int64(used))
            // Reset local usage optimistically; we'll correct on next refill.
            b.mu.Lock()
            b.tokens = b.capacity // assume we got credit back
            b.mu.Unlock()
        }

        // Pull the global allowance to see if we need to throttle more strictly.
        // This is a simple GET; if the global usage exceeds our share, we cut local tokens.
        quota, _ := b.redis.Get(ctx, "notif:global:quota").Result()
        limit, _ := strconv.ParseFloat(quota, 64)
        // Our fair share is limit / numberOfInstances (we could fetch that from a config map)
        fairShare := limit / float64(instanceCount)
        b.mu.Lock()
        if b.tokens > fairShare {
            b.tokens = fairShare // bleed excess
        }
        b.mu.Unlock()
    }
}

You’ll notice we never block on Redis inside the hot path (TryConsume). The sync runs in the background, and if Redis is momentarily unavailable we just keep using the last known good state—our limiter degrades gracefully to a pure local bucket, which is still better than a hard stop.

Why this beats a pure Redis limiter

Aspect	Pure Redis (fixed window)	Local token bucket + periodic sync
Latency per request	1‑2 ms round‑trip (Redis)	~0 µs (purely in‑mem)
Failure mode	Whole service throttles if Redis lag spikes	Local bucket continues, eventual correction
Burst handling	Allows up to 2× limit on window reset	Smooth, bounded by bucket capacity
Operational overhead	Requires careful tuning of window size	Only need to set rate & capacity; sync interval is forgiving
Complexity	Simple but brittle	Slightly more code, but isolates failure domain

In practice, our 99th‑percentile latency dropped from ~12 ms to <1 ms after we switched, and the pager silence was glorious.

Why This Matters

If you’re building anything that pushes notifications—whether it’s chat alerts, email digests, or push‑to‑mobile—you’ll inevitably face spiky traffic. The instinct is to slap a global rate limiter in front of everything and call it a day. What you’ll actually get is a system that’s either too permissive (letting traffic slam your workers) or too restrictive (dropping legit notifications during a burst).

By moving the decision to the edge and reconciling lazily, you keep the fast path lock‑free and resilient. You accept a tiny, predictable amount of over‑allowance in exchange for eliminating a critical bottleneck and reducing operational toil. That’s the kind of trade‑off that lets you sleep through a traffic surge instead of waking up to a cascade of timeouts.

One last thing to think about

Try sketching out how you’d adapt this pattern if your notification fans out to multiple downstream services each with its own SLA (e.g., some need strict ordering, others can tolerate loss). How would you change the token bucket’s rate or the sync frequency to honor those differing guarantees without re‑introducing a central bottleneck? Drop your thoughts in the comments—I’m curious to see what you come up with.

(And if you’ve battled a similar limiter nightmare, I’d love to hear what worked—or didn’t—for you.)