Athreya aka Maneshwar

Posted on May 20

Rate Limiting Strategies in Go: Token Bucket, Leaky Bucket, and Sliding Window

#webdev #go #beginners #programming

Hello, I'm Maneshwar. I'm building git-lrc, a Micro AI code reviewer that runs on every commit. It is free and source-available on Github. Star git-lrc to help devs discover the project. Do give it a try and share your feedback for improving the project.

Every backend service eventually meets a client that does not know when to stop. Sometimes it is a buggy retry loop, sometimes it is a scraper, sometimes it is your own well-meaning cron job firing a thousand requests in the same second.

The fix is the same: rate limiting.

In this post we will walk through the three algorithms you will encounter again and again — token bucket, leaky bucket, and sliding window — and how to actually use them in Go without writing them from scratch.

Go has solid, well-tested libraries for each, and unless you have a very good reason, you should use them.

We will keep this single-node.

Distributed rate limiting with Redis is its own rabbit hole, and I will save that for a follow-up post.

The three algorithms in one minute

Before touching code, here is the mental model you actually need.

Token bucket has a bucket that holds tokens.
Tokens are added at a steady rate up to some capacity.
Every request takes a token. No token, no request.
Because the bucket can fill up, it allows short bursts — useful when you want to be strict on average but tolerant of brief spikes.

Leaky bucket flips the perspective.
Requests go into a bucket; the bucket drains (leaks) at a steady rate.
If the bucket is full, new requests are dropped or made to wait.
It enforces a strict, smooth output rate. No bursts.

Sliding window counts requests inside a moving time window.
If you say "100 requests per minute," it really means "no more than 100 in any 60-second span."
It is the most accurate of the three but also the most expensive to compute precisely.

A quick comparison:

Algorithm	Allows bursts?	Output shape	Typical use
Token bucket	Yes	Bursty within limits	API endpoints, user quotas
Leaky bucket	No	Perfectly smooth	Outbound calls to a strict upstream
Sliding window	Configurable	Accurate over time	"N per minute" billing-style limits

Now let us actually use them.

1. Token bucket with `golang.org/x/time/rate`

The golang.org/x/time/rate package is the de facto standard for token bucket rate limiting in Go.

It is an official Go subrepository — same maintainers as the standard library, just versioned separately so it can evolve outside the standard library's compatibility lockstep.

You still have to go get it.

go get golang.org/x/time/rate

The core type is rate.Limiter.

You create one with NewLimiter(r, b) where r is the refill rate (tokens per second) and b is the burst size (bucket capacity).

package main

import (
    "context"
    "fmt"
    "time"

    "golang.org/x/time/rate"
)

func main() {
    // 5 tokens per second, bucket holds up to 10
    limiter := rate.NewLimiter(5, 10)

    ctx := context.Background()
    for i := 0; i < 20; i++ {
        // Wait blocks until a token is available (or ctx is cancelled)
        if err := limiter.Wait(ctx); err != nil {
            fmt.Println("error:", err)
            return
        }
        fmt.Printf("request %d at %s\n", i, time.Now().Format("15:04:05.000"))
    }
}

Limiter gives you three methods, and the difference between them is the whole point of the package:

Allow() — returns true if a token is available right now, otherwise false. Non-blocking. Use this when you want to drop excess requests (e.g. return 429 Too Many Requests).
Wait(ctx) — blocks until a token is available. Respects context cancellation. Use this for background workers that should slow down, not fail.
Reserve() — returns a Reservation telling you how long to wait. Use this when you want to make the decision yourself — for example, fail fast if the delay exceeds some threshold.

Drop-style: HTTP middleware with `Allow()`

The most common use case is "limit incoming HTTP requests and reject the overflow." Here is the middleware:

package main

import (
    "encoding/json"
    "net/http"

    "golang.org/x/time/rate"
)

func rateLimitMiddleware(next http.Handler) http.Handler {
    // 10 req/sec sustained, 20 burst
    limiter := rate.NewLimiter(10, 20)

    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        if !limiter.Allow() {
            w.Header().Set("Content-Type", "application/json")
            w.WriteHeader(http.StatusTooManyRequests)
            json.NewEncoder(w).Encode(map[string]string{
                "error": "rate limit exceeded, please retry later",
            })
            return
        }
        next.ServeHTTP(w, r)
    })
}

func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/hello", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("hello\n"))
    })

    http.ListenAndServe(":8080", rateLimitMiddleware(mux))
}

That is it. One limiter, shared across every request, dropping anything over 10/sec with a 20-request burst headroom.

Per-client limiting

A single global limiter is rarely what you want.

Usually you want each client (IP, API key, user ID) to have its own bucket so one noisy client cannot starve everyone else.

The pattern is a map of limiters keyed by client identifier.

package main

import (
    "net"
    "net/http"
    "sync"
    "time"

    "golang.org/x/time/rate"
)

type clientLimiter struct {
    limiter  *rate.Limiter
    lastSeen time.Time
}

type IPRateLimiter struct {
    clients map[string]*clientLimiter
    mu      sync.Mutex
    rate    rate.Limit
    burst   int
}

func NewIPRateLimiter(r rate.Limit, b int) *IPRateLimiter {
    rl := &IPRateLimiter{
        clients: make(map[string]*clientLimiter),
        rate:    r,
        burst:   b,
    }
    // Janitor: evict idle clients every minute
    go rl.cleanup()
    return rl
}

func (rl *IPRateLimiter) getLimiter(ip string) *rate.Limiter {
    rl.mu.Lock()
    defer rl.mu.Unlock()

    c, ok := rl.clients[ip]
    if !ok {
        lim := rate.NewLimiter(rl.rate, rl.burst)
        rl.clients[ip] = &clientLimiter{limiter: lim, lastSeen: time.Now()}
        return lim
    }
    c.lastSeen = time.Now()
    return c.limiter
}

func (rl *IPRateLimiter) cleanup() {
    for {
        time.Sleep(time.Minute)
        rl.mu.Lock()
        for ip, c := range rl.clients {
            if time.Since(c.lastSeen) > 3*time.Minute {
                delete(rl.clients, ip)
            }
        }
        rl.mu.Unlock()
    }
}

func (rl *IPRateLimiter) Middleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // RemoteAddr is host:port; we want the host as the key
        host, _, err := net.SplitHostPort(r.RemoteAddr)
        if err != nil {
            host = r.RemoteAddr
        }
        if !rl.getLimiter(host).Allow() {
            http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
            return
        }
        next.ServeHTTP(w, r)
    })
}

A few things to note. First, the janitor goroutine: without it the map grows forever as new IPs keep showing up.

Second, r.RemoteAddr is host:port, so if you use it as the map key directly, the same client on a different ephemeral port gets a fresh bucket — which is not what you want. net.SplitHostPort fixes that.

Third, in any real deployment behind a load balancer or CDN, even the host portion of RemoteAddr is your proxy's IP — you need to extract the client IP from X-Forwarded-For or X-Real-IP (and verify the proxy is trusted, or you have just made spoofing trivial).

Wait-style: throttling outbound calls

When you are the client hitting some upstream API with a 100 req/sec ceiling, you want to slow down, not error out.

Wait is the right tool here.

func fetchAll(ctx context.Context, urls []string) {
    // ~100/sec, burst of 1 (no headroom beyond the steady rate)
    limiter := rate.NewLimiter(rate.Every(10*time.Millisecond), 1)

    var wg sync.WaitGroup
    for _, u := range urls {
        if err := limiter.Wait(ctx); err != nil {
            break // context cancelled
        }
        wg.Add(1)
        go func(u string) {
            defer wg.Done()
            fetch(u)
        }(u)
    }
    wg.Wait()
}

Two things worth flagging in this snippet.

First, burst of 1 means a single token sits in the bucket, so the first call returns immediately and the rest are spaced ~10ms apart.

A burst of 0 would reject everything and Wait would block forever — 1 is the minimum useful value when you want "no real burst headroom."

Second, the Wait has to happen before the goroutine spawns, not inside it. If you put it inside, you launch a thousand goroutines in a microsecond and they all race through Wait together — you have throttled goroutine starts, but the actual HTTP calls all fire as soon as their tokens come in, which is not the same as throttling the calls.

Doing the wait in the calling loop and then spawning is what gives you actual rate-limited outbound traffic.

rate.Every(d) is a small helper that converts a desired interval into a rate.Limit. rate.Every(10*time.Millisecond) is the same as rate.Limit(100) but reads more naturally when you are thinking in terms of "one request every N milliseconds."

2. Leaky bucket with `go.uber.org/ratelimit`

Token bucket allows bursts.

Sometimes you do not want that.

If you are calling an upstream that absolutely cannot tolerate spikes — say, a partner API that throttles you on the millisecond, not the second — you want a leaky bucket: smooth, evenly-spaced output.

Uber's go.uber.org/ratelimit is the simplest leaky-bucket implementation in the Go ecosystem. The whole API is one method: Take().

go get go.uber.org/ratelimit

package main

import (
    "fmt"
    "time"

    "go.uber.org/ratelimit"
)

func main() {
    rl := ratelimit.New(100) // 100 ops/sec, evenly spaced

    prev := time.Now()
    for i := 0; i < 10; i++ {
        now := rl.Take()
        fmt.Println(i, now.Sub(prev))
        prev = now
    }
}

If you run this, every iteration after the first will print roughly 10ms.

The limiter does not give you 100 in a burst and then make you wait — it spaces them out exactly.

That is the leaky bucket guarantee.

Slack: a controlled amount of burstiness

Pure leaky bucket can be too strict.

If your producer is slightly bursty by nature, you may not want every single hiccup to cause queueing.

Uber's library has a "slack" knob for this.

With slack, the limiter can accumulate a small number of unspent requests during idle periods and let you burn through them in a burst later.

Worth being precise about what slack is, because it is not the same thing as a token bucket's burst capacity.

A token bucket refills continuously up to its capacity, so even a steady stream of requests can build up headroom if it briefly outpaces the consumer.

Slack only accumulates during idle time — if you are calling Take() continuously, slack does nothing.

It is a one-shot "you went quiet, so we will let you catch up" allowance, not an ongoing buffer.

// Default: allows up to 10 slack tokens (small built-in burst tolerance after idle)
rl := ratelimit.New(100)

// Strict mode: zero slack, perfectly even spacing
rl := ratelimit.New(100, ratelimit.WithoutSlack)

// Custom slack
rl := ratelimit.New(100, ratelimit.WithSlack(50))

Note that WithoutSlack is a variable, not a function — no parentheses.

Per-minute and other windows

By default, New(n) means "n per second." If you want per-minute or per-hour, use Per:

rl := ratelimit.New(5, ratelimit.Per(time.Minute)) // 5 per minute

When to pick Uber's library over `x/time/rate`

The honest answer is: only when you specifically need leaky-bucket semantics. x/time/rate can do almost everything Uber's library does and more (context support, Allow/Reserve semantics, dynamic rate changes).

But if your requirement is "evenly spaced output, no bursts," ratelimit.New(n) is one line and you are done.

One caveat: the library is stable and widely used, but not actively iterated on.

That is usually fine for something this small and well-defined — but if you want a library with more momentum behind it, x/time/rate is the safer pick.

3. Sliding window: when "N per minute" really means N

Token bucket and leaky bucket reason about rate.

They give you very good average-rate enforcement, but they do not give you an exact "no more than N requests in any rolling 60-second window" guarantee.

Sometimes that exact-count semantics is what you actually need — billing limits, login attempt windows, fairness across tenants.

The clean way to do this single-node is to keep a sorted slice of request timestamps per key, and on each request: evict timestamps older than the window, then check if the count is under the limit. Here is a self-contained implementation:

package ratelimit

import (
    "sort"
    "sync"
    "time"
)

type SlidingWindow struct {
    limit    int
    window   time.Duration
    requests map[string][]time.Time
    mu       sync.Mutex
}

func NewSlidingWindow(limit int, window time.Duration) *SlidingWindow {
    return &SlidingWindow{
        limit:    limit,
        window:   window,
        requests: make(map[string][]time.Time),
    }
}

func (sw *SlidingWindow) Allow(key string) bool {
    sw.mu.Lock()
    defer sw.mu.Unlock()

    now := time.Now()
    cutoff := now.Add(-sw.window)

    // Timestamps are appended in order, so we can binary-search the cutoff.
    timestamps := sw.requests[key]
    i := sort.Search(len(timestamps), func(i int) bool {
        return timestamps[i].After(cutoff)
    })
    timestamps = timestamps[i:]

    if len(timestamps) >= sw.limit {
        sw.requests[key] = timestamps
        return false
    }

    sw.requests[key] = append(timestamps, now)
    return true
}

Usage:

sw := NewSlidingWindow(100, time.Minute) // 100 requests per rolling minute

if !sw.Allow("user:42") {
    // reject
}

This is the "sliding log" variant — exact, but memory is O(limit × active_keys).

For high limits (say, 10k req/min across many keys), that becomes real memory.

The standard fix is the sliding window counter variant: keep two adjacent fixed windows (the current and previous minute), each with a single integer count, and estimate the rolling count as count_current + count_previous × (1 - elapsed_fraction_of_current_window).

You lose a bit of precision near window boundaries but drop from O(limit) per key to O(1).

For most real workloads, the simple log version above is fine; reach for the counter version when memory becomes a problem.

You also want a janitor here, same as the IP limiter, to evict keys nobody has hit in a while.

A word on distributed rate limiting

Everything above runs in a single process.

The moment you scale to two instances of your service behind a load balancer, your limits are effectively doubled — each instance has its own bucket.

For a single-instance side project this does not matter.

For anything serious, you need a shared store.

The standard answer is Redis.

The github.com/go-redis/redis_rate package implements GCRA (a leaky-bucket variant) on top of Redis with a single Lua script per check, which keeps it atomic and fast. Roughly:

limiter := redis_rate.NewLimiter(rdb)
res, _ := limiter.Allow(ctx, "user:42", redis_rate.PerSecond(10))
if res.Allowed == 0 {
    // reject
}

I will cover the Redis side of this — including the Lua scripts, why GCRA wins over naive sliding windows at scale, and how to handle Redis going down — in a follow-up post.

For now, just know that when you outgrow single-node limiting, this is the next stop.

Which one should you use?

If you remember one thing from this post, remember this:

Default to golang.org/x/time/rate. It is well-tested, context-aware, and covers 90% of cases. Use Allow for HTTP servers, Wait for outbound clients.
Reach for go.uber.org/ratelimit when you specifically need evenly spaced output and no bursts.
Roll a sliding window when your requirement is genuinely "no more than N in any rolling window of duration W" and average-rate enforcement is not enough.

The biggest mistake I see is people writing their own token bucket from scratch and getting the math subtly wrong — off-by-one on the burst size, races on the refill, drift over long runs. The libraries exist. Use them.

Now go put a 429 in front of whatever is currently melting your server.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
🔁 Build a…

View on GitHub

DEV Community

Rate Limiting Strategies in Go: Token Bucket, Leaky Bucket, and Sliding Window

The three algorithms in one minute

1. Token bucket with `golang.org/x/time/rate`

Drop-style: HTTP middleware with `Allow()`

Per-client limiting

Wait-style: throttling outbound calls

2. Leaky bucket with `go.uber.org/ratelimit`

Slack: a controlled amount of burstiness

Per-minute and other windows

When to pick Uber's library over `x/time/rate`

3. Sliding window: when "N per minute" really means N

A word on distributed rate limiting

Which one should you use?

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

See It In Action

Why

Top comments (0)

The three algorithms in one minute

1. Token bucket with golang.org/x/time/rate

Drop-style: HTTP middleware with Allow()

Per-client limiting

Wait-style: throttling outbound calls

2. Leaky bucket with go.uber.org/ratelimit

Slack: a controlled amount of burstiness

Per-minute and other windows

When to pick Uber's library over x/time/rate

3. Sliding window: when "N per minute" really means N

A word on distributed rate limiting

Which one should you use?

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

See It In Action

Why

1. Token bucket with `golang.org/x/time/rate`

Drop-style: HTTP middleware with `Allow()`

2. Leaky bucket with `go.uber.org/ratelimit`

When to pick Uber's library over `x/time/rate`