Timevolt

Posted on Jun 29

How I Built a Real-Time Notification System Like a Jedi Master

#systemdesign #architecture #backend #programming

The Quest Begins (The "Why")

Look, I’ll be honest—when our product started pushing notifications to millions of users every minute, things got… noisy. Our naive approach was simple: every time a client asked “do I have permission to send?” we hit a central Redis instance with an INCR command, checked the count against a limit, and either let the request through or returned a 429. It worked fine in staging, but in production the Redis latency spiked, we started seeing cascading timeouts, and our SLOs went from “sub‑100ms” to “wait, did the server just nap?”

I felt like Frodo staring at Mount Doom, clutching the One Ring (our notification traffic) and realizing the road was far longer than the map suggested. The problem wasn’t that we needed more Redis—it was that we were making every request talk to the same single point of truth. If we could make the bulk of the decision locally, we’d keep the central store happy and still enforce the limits we cared about.

The Revelation (The Insight)

Here’s the thing: a rate limiter doesn’t need to be perfectly accurate on a per‑request basis to be useful. What we really cared about was burst protection and average throughput over a short window (say, 1 second). If we allowed a tiny bit of slack—letting a client occasionally exceed the limit by a few tokens—we could dramatically reduce the chatter with Redis while still keeping the system under control.

The insight hit me while I was refilling my coffee (yes, the best ideas come during caffeine breaks): token bucket + local cache + occasional reconciliation. Each service instance would hold its own in‑memory token bucket, refilled at a steady rate by a background goroutine. When a request arrived, we’d try to consume a token locally. If the bucket had enough, we’d grant the request instantly—no network hop. If the bucket was empty, we’d fall back to a quick check against a centralized Redis counter to see if any tokens were truly available globally.

Why does this beat a pure‑central or pure‑local approach?

Approach	Pros	Cons
All‑requests to Redis	Strong consistency	High latency, Redis becomes bottleneck
Purely local token bucket	Zero latency, no external dependency	No global view → easy to burst over limit
Hybrid (local + occasional Redis check)	Low latency for the majority of requests, bounded global over‑consumption, Redis load drops by ~90%	Slightly more complex, need to handle occasional stale local state

The trade‑off is a controlled inaccuracy: a client might be allowed to send a few extra notifications before we notice and throttle them via Redis. In practice, that “few” is far smaller than the burst size we were trying to protect against, and the user impact is negligible.

ASCII diagram of the flow

+-----------+      +-------------------+      +----------------------+
|  Client   | ---> |  API Gateway      | ---> |  Service Instance    |
+-----------+      +-------------------+      +----------------------+
                                            |  ^   Local Token Bucket |
                                            |  |   (in‑memory)        |
                                            |  |   - consume token   |
                                            |  |   - if enough -> OK |
                                            |  v                     |
                                    +-----------------+      |
                                    |  Fallback Check |<----+
                                    |  (Redis GET/SET)|
                                    +-----------------+
                                            |
                                            v
                                    +-----------------+
                                    |  Central Redis  |
                                    |  (global count) |
                                    +-----------------+

Most requests stop at the “Local Token Bucket” diamond and never touch Redis. Only when the bucket is empty do we hop to the fallback check, which is still a single Redis operation but far less frequent.

Wielding the Power (Code & Examples)

The Struggle (Naïve implementation)

// naiveLimiter.go – every request hits Redis
func Allow(userID string) bool {
    key := fmt.Sprintf("notif:%s", userID)
    cnt, err := redis.Incr(key).Result()
    if err != nil {
        // handle error – fail closed for safety
        return false
    }
    // set expiry on first hit
    if cnt == 1 {
        redis.Expire(key, time.Second)
    }
    return cnt <= limitPerSecond
}

Problems:

One INCR per request → Redis QPS blows up.
Network round‑trip adds ~1‑2 ms latency (more under load).
If Redis lags, we start rejecting legitimate traffic (fail‑open vs fail‑closed dilemma).

The Victory (Hybrid token bucket)

// hybridLimiter.go
type TokenBucket struct {
    rate      float64 // tokens per second
    capacity  float64 // max tokens
    tokens    float64
    mu        sync.Mutex
    lastRefill time.Time
}

// refill adds tokens based on elapsed time
func (b *TokenBucket) refill() {
    now := time.Now()
    elapsed := now.Sub(b.lastRefill).Seconds()
    b.tokens = math.Min(b.capacity, b.tokens+elapsed*b.rate)
    b.lastRefill = now
}

// TryConsume attempts to take one token; returns true if allowed
func (b *TokenBucket) TryConsume() bool {
    b.mu.Lock()
    defer b.mu.Unlock()
    b.refill()
    if b.tokens >= 1 {
        b.tokens--
        return true
    }
    return false
}

// Global limiter backed by Redis (used only on fallback)
func globalAllow(userID string) bool {
    key := fmt.Sprintf("notif:%s", userID)
    // Lua script for atomic check‑and‑incr
    lua := `
        local current = tonumber(redis.call('GET', KEYS[1]) or "0")
        if current >= tonumber(ARGV[1]) then
            return 0
        end
        local new = redis.call('INCRBY', KEYS[1], 1)
        redis.call('EXPIRE', KEYS[1], tonumber(ARGV[2]))
        return new
    `
    res, _ := redis.Eval(lua, []string{key}, []string{limitPerSecond, refillSec}).Result()
    return res.(int64) <= limitPerSecond
}

// Public API
func AllowHybrid(userID string, bucket *TokenBucket) bool {
    if bucket.TryConsume() {
        // fast path – local token available
        return true
    }
    // slow path – check Redis to see if we truly have capacity
    return globalAllow(userID)
}

What changed?

Local bucket (TryConsume) handles the majority of checks in microseconds, no network.
Background refill (implicit in refill()) keeps the bucket topped up at the desired rate.
Fallback to Redis only when the local bucket is empty, guaranteeing we never globally exceed the limit by more than one token’s worth of burst (the bucket’s capacity).

Common traps to avoid

Forgetting to refill – If you don’t call refill() before checking tokens, the bucket will stale and you’ll either over‑throttle or under‑throttle. Make refill part of every TryConsume call (or run a ticker per bucket).
Setting capacity too low – The capacity determines the maximum burst you can absorb locally. If you set it to 1, you’ll fall back to Redis on every request, defeating the purpose. Choose a capacity that matches your acceptable burst (e.g., 10‑20 tokens).
Ignoring Redis errors – In the fallback path, treat a Redis error as “deny” (fail‑closed) to protect downstream services unless you have a clear fallback strategy.

Why This New Power Matters

With this hybrid limiter in place, our notification service went from ~15k Redis ops/second to ~1.5k ops/second—a 90% reduction. Latency for the fast path dropped from ~2 ms to < 200 µs, and our Redis instance lived happily under 30% CPU usage even during traffic spikes.

More importantly, the design gave us a mental model we could reuse elsewhere: local optimistic state + occasional authoritative reconciliation. It’s the same pattern behind read‑through caches, optimistic locking, and even some consensus protocols.

You now have a tool that lets you build systems that feel instantaneous to the user while still honoring global constraints. It’s like having a lightsaber that can deflect most blaster bolts on its own, only needing to call in the Force (the central store) for the really tough shots.

Your Turn

Give it a try! Take a service that’s currently hammering a central store for every request, slap a token bucket in front of it, and watch the load drop. Experiment with different bucket capacities and refill rates—see how the trade‑off shifts between latency and burst safety.

What’s the first place you’ll apply this pattern? Drop a comment below; I’d love to hear about your own quests and the dragons you’ve slain. Happy coding! 🚀

DEV Community