DEV Community

Mohak Rathod
Mohak Rathod

Posted on

My Curiosity Got Out of Hand - So I Built a Rate Limiter in Go 🔧

Hey everyone! I am Mohak 🙋🏼‍♂️, an MSCS student who apparently likes distributed systems. Thus, I decided to build one and this is my very first blog post ✍🏼 about it 🙂. So please be gentle🙏🏼.

As a student, I have spent a lot of time reading about distributed systems be it in textbooks or papers or YouTube videos. There's CAP Theorem, consistency models, fault tolerance which look fascinating on paper. But, there's this gap which I feel is always present between understanding a concept and actually building something upon it.

So I thought well why not just try to close the gap a little if possible? Thus, I went with rate limiting which is something every real world API needs and decided to build it from scratch. No libraries doing the hard parts for me. Just Golang, Redis, and a lot of debugging sessions.

Well here I am documenting what I learned: the decisions, the gotchas, and the moments where textbook didn't quite prepare me for reality.


🤔 What even is rate limiting in the first place?

Before we go any further, let's make sure we all are on the same page.
Imagine you run an API (Application Programming Interface). One of your users writes some buggy script which accidentally sends 10,000 requests per second ( P.S. When I tried it , my VS Code lagged 😅) . Without rate limiting , this can flood your servers and bring down service for everyone(Classic DOS attack).

Rate Limiting says: " You get 100 requests per minute. After that slow down🟡, relax💆🏼‍♂️, have a chill pill🥶!!!"

Simple enough in concept right? But what happens when your API runs on multiple servers? Each server would have its own memory. A user could send 100 requests to each server and bypasses your limits entirely. That's where we get distributed problem and also the part where things start to get interesting.


🛠️ Here's what I built

My full feature list at a glance:

  • Token Bucket algorithm - Same approach used by AWS and Stripe
  • Per-client isolation - Every user/API gets their own independent limit
  • Redis-backed shared state - Consistent limits across multiple servers
  • gRPC API- Fast Binary Protocol, two endpoints: CheckLimit and GetStatus
  • Live Monitoring- Prometheus metrics with Grafana dashboards
  • One-command local setup - Docker Compose
  • Kubernetes deployment- With autoscaling that can handle traffic spikes automatically

The numbers speak for themselves

  • 3,000 requests/second sustained throughput

  • 1.57ms average latency

  • Zero errors across 110,000 test requests


🏗️ The Architecture

Let's see how the components combine to form the megazord:

System Architecture
One design choice which I am truly proud of : "the Redis manager and in-memory manager are fully interchangeable", they can be swapped via a config flag. During local development, I used in-memory (fast, no setup required). For production, I switched to Redis. Same code, different backend choice.


1️⃣ The Token Bucket Algorithm:

The core of everything.

You can think of it like a physical bucket (instead of water, there's tokens). Every request consumes one token. The bucket can slowly refill over time. If the bucket is empty when we ask , then we get the same typical response which we get when we try using free tier API.
Bingo, you guessed it "Sorry, due to rate limits you need to wait."

Now, why did I choose Token Bucket.? I could have gone with Sliding Window(Leetcode Grind💀)or other approaches.

The key advantage is that it allows short bursts not real explosions💥. If you remain quiet for some seconds ,for instance 10 seconds, then you can build up tokens and can legitimately fire off several requests quickly⚡. This feels natural to users. AWS and Stripe both use token bucket for this very reasons.

Here's my actual implementation (internal/ratelimiter/limiter.go):

//TokenBucket => token bucket for single client
type TokenBucket struct{
    mu sync.Mutex
    tokens float64
    capacity int32
    refillRate float64
    lastRefill time.Time
}

// Create New Token Bucket
func NewTokenBucket(capacity int32, refillRate float64) *TokenBucket{
    return &TokenBucket{
        tokens: float64(capacity),
        capacity: capacity,
        refillRate: refillRate,
        lastRefill: time.Now(),
    }
}

//Refill function based on time elapsed
func (tb *TokenBucket) refill(){
    now:= time.Now()
    elapsed := now.Sub(tb.lastRefill).Seconds()

    //Calculate tokens to add
    tokensToAdd := elapsed* tb.refillRate

    //Adding tokens with limit
    tb.tokens = min(tb.tokens + tokensToAdd, float64(tb.capacity))
    tb.lastRefill=now
}

// Check when Requst can be made
func (tb *TokenBucket) Allow(tokensRequested int32) (bool, int32, int64){
    tb.mu.Lock()
    defer tb.mu.Unlock()

    // Refill tokens
    tb.refill()

    // Check current tokens count
    if tb.tokens >= float64(tokensRequested){
        tb.tokens -= float64(tokensRequested)
        return true, int32(tb.tokens),0
    }

    //Not Enough Tokens
    tokensNeeded := float64(tokensRequested) - tb.tokens
    retryAfterSeconds:= tokensNeeded / tb.refillRate
    retryAfterMS := int64(retryAfterSeconds*1000)

    return false, int32(tb.tokens), retryAfterMS
}
Enter fullscreen mode Exit fullscreen mode

2️⃣ Managing Multiple Clients

Implementing just one bucket is basically child's play. But, in reality, when we build a product, there are thousands of independent buckets, that's where concurrency comes into picture.
Each client(Identified by some UID) needs their own completely isolated bucket. Client A blowing through their quota should have zero effect on Client B.

The concurrency challenge: Multiple requests arrive simultaneously for different clients. If two goroutines try to create a bucket for same new client at the same time, it can result in duplicates or crashes.

To solve this, Go has sync.RWMutex - a Read-Write Lock, Many goroutines can read concurrently ( checking if bucket exists), but only one can write( creating a new bucket).
Here's my implementation on this part:

//Creating and getting Buckets
func (m *Manager) getOrCreateBucket(clientID string) *TokenBucket{
    m.mu.RLock()
    bucket, exists := m.buckets[clientID]
    m.mu.RUnlock()

    if exists{
        return bucket
    }

    m.mu.Lock()
    defer m.mu.Unlock()

    //Just for confirmation double checking 
    if bucket, exists := m.buckets[clientID]; exists {
        return bucket
    }

    bucket = NewTokenBucket(m.config.Capacity, m.config.RefillRate)
    m.buckets[clientID] = bucket
    return bucket
}
Enter fullscreen mode Exit fullscreen mode

Interesting part: After acquiring the write lock, I have included a double-check. This is interesting because there's a possibility that another goroutine might have created the bucket in the gap between releasing the read lock and acquiring the write lock. If we ignore this check, it can lead to overwriting their bucket and resetting the client's token count. So, it may seem as a subtle bug, but can lead to nasty consequences.


3️⃣The Distributed Problem ( Why I Needed Redis)

This is the part that makes distributed rate limiting genuinely hard.
Consider a scenario in which you have two servers. Each server has rate limiting running in memory.
Server 1 memory: Client has 100 tokens remaining
Server 2 memory: Client has 100 tokens remaining
Both have independent memory and full bucket at this moment

Client sends 100 requests -> Server 1 : Allowed✅(100 tokens consumed)
Client sends 100 requests -> Server 2 : Allowed✅(100 tokens consumed)

Total requests : 200 . While the Limit was: 100 🚨
As we can see the in-memory approach fails completely at scale. Each server lives in its own bubble.
Solution:
Every CheckLimit call fetches the client's token state directly from Redis, it calculates the refill based on the elapsed time, then writes the updated count back. Every server does this against the same Redis database.
Meaning that there's only one shared source of truth regardless of how many instances are running.
One deliberate decision: If Redis goes down, Requests are allowed through rather than denied. Basically CAP Theorem , choosing Availability- An outage should not take down our entire service as that case would be worse than just temporarily bypassing rate limits.


4️⃣ Why gRPC instead of Regular REST API

Most developers default to REST + JSON. I chose gRPC for this service for the following reasons:

  • Speed- Protocol Buffers(binary encoding) is much faster than JSON text parsing. For a service called on every single API request, this really matters.
  • Type safety - The .proto file defines exactly what fields exist and their types. You cannot accidentally send a string where an integer is expected.

  • Auto-generated code - Run one command, we get client/server code in any language. There's no need for manual JSON marshaling.

The service definition is also quite simple to write:

service RateLimiter{
    rpc CheckLimit(CheckLimitRequest) returns (CheckLimitResponse);
    rpc GetStatus(GetStatusRequest) returns (GetStatusResponse);
}
Enter fullscreen mode Exit fullscreen mode

5️⃣ Monitoring: Because Flying Blind Is Not an Option

Tbh, this was my favorite part while coding for some reason.
A distributed system without monitoring is just a Schrodinger's box. Thus, I set up a full observability stack: Prometheus for metrics collection , Grafana for visualization.
Here's the full Grafana dashboard during a live load test:

Grafana Dashboard

Grafana Dashboard- 4 panels:

  1. Requests per Second- sum by (allowed) (rate(rate_limiter_request_total[1m])) — traffic split by allowed vs. denied
  2. Rate Limit Hits- rate_limiter_hits_total — how often clients are actually getting blocked
  3. Active Clients- rate_limiter_active_clients — unique clients being tracked (updated every minute)
  4. Request Duration p95 - histogram_quantile(0.95, ...) — 95th percentile latency

Additional Prometheus metrics:

  • rate_limiter_token_bucket_size — distribution of remaining tokens across clients, great for debugging

  • rate_limiter_redis_operations_total — Redis operation counts by operation type and status

Request per Second

Token Bucket Levels

The p95 latency graph was the most useful during load testing. When it starts climbing, it's a signal that requests are getting stuck waiting for locks which tells us where exactly to look for contention.
Grafana dashboards auto-refresh every 5 seconds, so you can watch traffic patterns in real time.


6️⃣Deployment: From Laptop to Production

Local Deployment - Docker-Compose
One command spins up the entire stack: Redis, the rate limiter, Prometheus and Grafana.

docker-compose up -d

The Docker image came out to ~55MB using a multi-stage build. The first stage compiles the Go binary. The second stage is a bare Alpine Linux image with just that binary. No Go compiler, no source code, no bloat in production.

Production- Kubernetes
This is where the system gets serious . Here are the actual pods running in my minikube cluster:

K9s

The Kubernetes setup includes:

  • 3 replicas by default - If one pod crashes, two others keep serving traffic
  • ConfigMap for configuration - Change rate limit settings without rebuilding Docker image
  • Readiness probes - Kubernetes won't send traffic to a pod until it is actually ready
  • Horizontal Pod Autoscaler- The system scales itself

📊 Load Testing Results:

Requests per second with traffic drop

Request Latency p50/p95/p99

Prometheus

🔮 Future Directions:

This project can be extended in the following ways:

  1. Sliding Window algorithm- More mathematically precise than token bucket for "exactly N requests per minute" guarantees, but significantly more complex to implement.
  2. Admin API - A way to change limits for specific clients at runtime without deploying.
  3. Multi-region Redis- The current setup has one Redis instance; a globally distributed system would need Redis clusters across regions

🔗 Source Code

Distributed-rate-Limiter
The README has full setup instructions, API reference and step-by-step Kubernetes deployment guide.


Wow!, if you have honestly made it this far ,thank you for bearing with me😅. This was honestly quite fun to build and I learned a lot. The gap between reading about distributed systems and actually wrestling with race conditions is real and I think by doing this I got a better understanding of these concepts.
If you have any suggestions like say something new which I should explore or any critiques which are welcomed btw since I am still a frog who is trying to escape the well🐸.
Drop a comment, connect with me, or just say hi — I'm still learning and always happy to chat 👇

Top comments (0)