Building a Rate Limiter That Actually Works

#systemdesign #architecture #backend #programming

Building a Rate Limiter That Actually Works

Quick context (why you're writing this)

I was knee‑deep in a side‑project API last month when the service started returning 429s out of nowhere. The client library I was using had a naïve “max requests per second” check that would burst all at once, hammer downstream services, and then go silent for a full second. I spent three hours staring at logs, wondering why my “simple” limiter was either too lax or too brutal. It hit me: the problem wasn’t the limit itself—it was how we were counting time.

If you’ve ever built an endpoint that needs to protect a downstream DB, a third‑party API, or just keep your own service from melting down, you’ve probably run into the same thing. The usual “reset every second” approach feels intuitive, but it hides a subtle flaw that shows up under real traffic.

The Insight

The key insight is this: a rate limiter should smooth request arrival over time, not just count them in fixed windows.

Think of a bucket that leaks at a steady rate. Requests drop in; if the bucket has space, they’re allowed through; if it’s full, they’re dropped or queued. This is the token bucket algorithm. It gives you two valuable properties that a simple fixed‑window counter lacks:

Burst tolerance – you can save up tokens during idle periods and spend them in a short spike without immediately exceeding the limit.
Smooth outflow – the output rate never exceeds the configured refill rate, protecting downstream systems from thundering‑herd spikes.

The trade‑off is a tiny bit of state (the current token count) and the need to refill tokens periodically. Compared to a fixed window, you avoid the “all‑or‑nothing” reset that can cause a surge of requests right after the window slides. Compared to a leaky bucket (which processes requests at a constant rate), the token bucket lets you keep bursts when they’re actually useful, like handling a batch of webhook retries.

How (with code)

Below is a compact, production‑ready token bucket in Go. I’ve kept the dependencies to the standard library because you can drop this into any service without pulling in Redis or another external store—perfect for a single‑instance limiter.

package ratelimiter

import (
    "sync"
    "time"
)

// TokenBucket implements a classic token bucket limiter.
type TokenBucket struct {
    rate       float64 // tokens added per second
    capacity   float64 // max tokens the bucket can hold
    tokens     float64 // current tokens
    lastRefill time.Time
    mu         sync.Mutex // protects tokens and lastRefill
}

// NewTokenBucket creates a bucket that refills at `rate` tokens/sec,
// with a maximum burst of `capacity`.
func NewTokenBucket(rate float64, capacity float64) *TokenBucket {
    return &TokenBucket{
        rate:       rate,
        capacity:   capacity,
        tokens:     capacity, // start full so we can burst initially
        lastRefill: time.Now(),
    }
}

// Allow attempts to consume one token. It returns true if the request
// is allowed, false otherwise.
func (b *TokenBucket) Allow() bool {
    b.mu.Lock()
    defer b.mu.Unlock()
    now := time.Now()
    // Add tokens based on elapsed time since last refill.
    elapsed := now.Sub(b.lastRefill).Seconds()
    b.tokens += elapsed * b.rate
    if b.tokens > b.capacity {
        b.tokens = b.capacity
    }
    b.lastRefill = now

    if b.tokens < 1.0 {
        // Not enough tokens – reject.
        return false
    }
    // Consume one token and allow the request.
    b.tokens--
    return true
}

What makes this different from the usual “reset every second” approach?

A common mistake is to store a counter and a timestamp, then reset the counter when a second has passed:

// ❌ Bad: fixed‑window counter (prone to bursts)
var (
    count   int
    lastSec time.Time
)

func AllowBad() bool {
    now := time.Now()
    if now.Sub(lastSec) > time.Second {
        count = 0
        lastSec = now
    }
    if count >= limit { // limit is requests per second
        return false
    }
    count++
    return true
}

The problem? Imagine 100 requests arrive at 0.9 s into the window. The counter is already at 90, we allow 10 more, then at 1.0 s the window resets and another burst of 100 can slam through. The token bucket smooths that out because tokens only accumulate at the refill rate; you can’t suddenly have a surplus of 100 tokens unless you’ve been idle for a while.

Another pitfall is trying to refill with a time.Ticker inside a goroutine and updating the token count without proper synchronization. If you forget the mutex (or use an atomic incorrectly), you’ll get race conditions where two goroutines both think they have a token and both decrement the count, leading to over‑allowance. The mutex in the snippet above guarantees that the read‑modify‑write of tokens is atomic.

If you need a distributed limiter (multiple service instances), you can replace the in‑memory state with a Redis-backed script that does the same math atomically—EVAL a Lua script that reads the key, updates tokens, and writes back. The core algorithm stays identical; only the storage changes.

Why This Matters

Using a token bucket gives you predictable traffic shaping without sacrificing the ability to handle legitimate spikes. In my side‑project, after swapping the naïve fixed‑window limiter for the token bucket above, the downstream DB saw a steady query rate, latency stopped spiking, and the 429s disappeared. The service could still absorb a short burst of webhook retries because the bucket had saved tokens during quieter periods.

The trade‑off is modest: you keep a float for token count and need a mutex (or a lock‑free atomic if you’re comfortable with the extra complexity). For most services that run on a single node or a few behind a load balancer, that cost is negligible. If you’re operating at massive scale and need to share state across hundreds of nodes, you’ll inevitably reach for a centralized store like Redis or DynamoDB—but the underlying idea stays the same: smooth, not slice.

If you’re still using a “reset every N seconds” counter in production, I’d challenge you to replace it with a token bucket (or at least a sliding‑window log) and watch the difference in your metrics.

Something to think about

How would you adapt this token bucket to work with a dynamic rate that changes based on load‑shedding signals (e.g., back‑pressure from a downstream service)? Sketch out the changes you’d need, and consider whether you’d keep the same bucket or spin up a new one each time the rate updates.

Give it a try, and let me know what you discover. Happy limiting!