Rate Limiting: The Microservice vs Monolith Showdown – A Jedi’s Choice

#systemdesign #architecture #backend #programming

The Quest Begins (The “Why”)

Picture this: I’m knee‑deep in a legacy e‑commerce app that started life as a tidy monolith. Users were happy, traffic was modest, and the rate limiter lived as a simple while loop that checked an in‑memory counter. Life was good… until Black Friday hit.

Suddenly, the same counter was being hammered by dozens of API instances spread across multiple servers. Each node kept its own bucket, so the global limit was a joke—clients could burst past the threshold by simply hitting different nodes. I felt like I’d stepped into a Star Wars trash compactor: walls closing in, and I had no idea which lever to pull.

That night, after a third cup of coffee and a frustrated mutter about “distributed systems being the dark side,” I asked myself: When does it make sense to pull the rate‑limiting logic out of the monolith and give it its own microservice? The answer wasn’t just about scaling; it was about consistency, observability, and the sanity of the team on call.

The Revelation (The Insight)

The critical insight hit me like a lightsaber duel: a rate limiter only needs to be a microservice when the cost of inconsistent limits outweighs the operational overhead of running a separate service.

In a monolith, an in‑process token bucket is blisteringly fast—no network hop, no serialization. But it assumes a single source of truth for the counter, which falls apart the moment you run more than one instance. You could shard the counter with a shared store (Redis, Memcached, etc.), but then you’ve already introduced a distributed dependency; you might as well make the limiter its own service and reap the side‑benefits:

Global visibility – metrics, logs, and alerts live in one place.
Independent scaling – you can spin up more limiter pods without touching the core business logic.
Policy evolution – change the algorithm (e.g., move from fixed‑window to sliding‑window) without redeploying every service that calls it.
Fault isolation – a buggy limiter won’t bring down the checkout flow; it’ll just return 429s in a predictable way.

If your traffic is low, your instance count is static, and you’re okay with “best‑effort” limits, the monolith wins. Simplicity is king. But once you cross the threshold where inconsistent limits could lead to revenue loss or abusive traffic, the microservice route becomes the Jedi’s path—elegant, powerful, and worth the training.

Wielding the Power (Code & Examples)

The Monolith Struggle

Here’s what the naïve in‑process limiter looked like in our Go monolith:

type simpleLimiter struct {
    limit     int
    interval  time.Duration
    mu        sync.Mutex
    count     int
    lastReset time.Time
}

func (l *simpleLimiter) Allow() bool {
    l.mu.Lock()
    defer l.mu.Unlock()

    now := time.Now()
    if now.Sub(l.lastReset) > l.interval {
        l.count = 0
        l.lastReset = now
    }

    if l.count >= l.limit {
        return false // reject
    }
    l.count++
    return true
}

The trap: each replica of the service owns its own simpleLimiter. Under load, the effective limit becomes N replicas could each allow limit requests, giving a total of N × limit—far above what we intended. I spent three hours debugging why our “100 req/s” ceiling was actually seeing 800 req/s on a five‑node cluster. It felt like trying to defeat a boss with a wooden sword.

The Microservice Victory

We extracted the limiter into its own service, backed by Redis for atomic token‑bucket operations. The service exposes a tiny gRPC endpoint:

service RateLimiter {
  rpc Allow (AllowRequest) returns (AllowResponse);
}

message AllowRequest {
  string key = 1;      // e.g., user_id or API endpoint
  int64  limit = 2;    // requests per window
  int64  windowSec = 3;
}

message AllowResponse {
  bool allowed = 1;
  int64  remaining = 2;
}

Implementation (Go, using go-redis/redis):

func (s *server) Allow(ctx context.Context, req *AllowRequest) (*AllowResponse, error) {
    key := fmt.Sprintf("rl:%s:%d", req.Key, req.WindowSec)
    lua := `
        local current = redis.call("GET", KEYS[1])
        if current == false then
            redis.call("SET", KEYS[1], 1)
            redis.call("EXPIRE", KEYS[1], ARGV[2])
            return {1, tonumber(ARGV[1]) - 1}
        end
        if tonumber(current) >= tonumber(ARGV[1]) then
            return {0, 0}
        end
        local newval = redis.call("INCR", KEYS[1])
        return {1, tonumber(ARGV[1]) - newval}
    `
    res, err := s.rdb.Eval(ctx, lua, []string{key},
        strconv.FormatInt(req.Limit, 10),
        strconv.FormatInt(req.WindowSec, 10)).Result()
    if err != nil {
        return nil, err
    }
    vals := res.([]interface{})
    allowed := vals[0].(int64) == 1
    remaining := vals[1].(int64)
    return &AllowResponse{
        Allowed:   allowed,
        Remaining: remaining,
    }, nil
}

Why this beats the monolith version:

The Lua script runs atomically inside Redis, guaranteeing a single source of truth across any number of limiter instances.
Network hop adds ~1‑2 ms latency—acceptable for most APIs, and we can mitigate it with local caching (e.g., a short‑lived in‑memory fallback) if needed.
Observability: we export Prometheus metrics directly from the limiter service (limiter_allowed_total, limiter_blocked_total). No need to instrument every client.

Common mistake to avoid: forgetting to set an expiration on the Redis key, causing stale counters to accumulate and eventually block legitimate traffic. The EXPIRE call in the script prevents that zombie‑state.

Why This New Power Matters

Now that the limiter lives in its own microservice, our system behaves like a well‑tuned starfighter squadron: each ship (service) can focus on its mission while the carrier (limiter service) handles traffic policing centrally.

Scalability: we can autoscale the limiter pods based on Redis CPU or request latency without touching the checkout, recommendation, or payment services.
Safety: a bug in the limiter algorithm only affects the limiter service; we can roll it out with a canary deployment and watch the metrics in real time.
Flexibility: swapping from a fixed‑window to a sliding‑window or leaky‑bucket algorithm is a single service deploy—no need to coordinate a rolling update across dozens of services.

If you’re still on the fence, ask yourself: How costly would it be if my rate limit were off by a factor of two? If the answer is “a lot,” give the microservice path a try. If you’re running a small internal tool with a single instance, keep it simple—let the monolith shine.

Your Turn

Grab a service that’s currently using an in‑process counter or a naive distributed lock, sketch out the token‑bucket logic in Redis, and expose it through a tiny gRPC or HTTP API. Test it under load with hey or wrk, watch the metrics roll in, and feel that quiet satisfaction of a limiter that actually limits.

What system will you extract next? Share your war stories in the comments—I’d love to hear how you turned a chaotic free‑for‑all into a disciplined Jedi‑approved defense line. May the force be with your rate limits!