Bravian

Posted on Jun 20

Building Rate Limiting That Actually Works

#programming #cybersecurity #learning

I have been working on two distinct web applications over the past few weeks. The first is a voice journaling application where the core feature relies on a third-party AI service for transcription. The second is a platform for managing corporate petty cash. Both require rate limiting, but their threat models are quite different.

For the voice journal, the primary risk is service degradation and cost control. An uncontrolled burst of requests, from concurrent usage from many users, could flood the AI provider's API. This would violate their terms of service, leading to my API key being revoked and incurring unpredictable costs. I needed a strategy to smooth out traffic into a predictable, steady stream.

For the petty cash app, the concern was security. The login endpoint was a prime target for brute-force and credential-stuffing attacks. I needed an inflexible, unforgiving defense to make such attacks statistically impossible.

These issues forced me to move beyond a superficial understanding of rate limiting. A generic "100 requests per minute" limit is naive. An effective strategy requires tailoring the algorithm and identification method to the specific risk you are mitigating. The why dictates the how.

What is Rate Limiting? A Traffic Management Analogy

From a technical standpoint, rate limiting is a control mechanism to regulate the frequency of requests a client can make to a service. When a defined threshold is exceeded, the system rejects subsequent requests, typically with an HTTP 429 Too Many Requests status code, to ensure the service's availability, security, and fairness.

To make this tangible, imagine your API is a high-end supermarket and your server is the cashier. Rate limiting is the manager's job.

For Overuse (The Voice Journal Problem): A tour bus pulls up, and 50 shoppers rush into the store all at once, each trying to check out immediately. A naive manager lets them all line up at the same checkout, overwhelming the cashier (your server), slowing down everyone, and possibly crashing the system. A smart manager (rate limiter) steps in and says:

"Please queue up, only 1 customer per register at a time."

Now, people check out in an orderly fashion. The service remains stable, and each shopper eventually gets their turn. This is throttling traffic to ensure a smooth customer experience.

For Security (The Petty Cash Problem): One shady customer comes in and tries to pay using a different fake credit card every 10 seconds. Instead of letting them keep trying until they hit something valid, the manager notices the pattern. After 3 suspicious attempts, the customer is kicked out and blacklisted. This is rate limiting for security purposes—detecting abusive patterns and cutting them off early to protect the system and honest shoppers.

The goal is not just to prevent failure but to enforce a predictable and safe operational envelope.

The "Who": Why IP-Based Limiting is an Anti-Pattern

When I first started, my instinct was to limit by IP address. It's easy; it's right there in the request. But this is a rookie mistake.

The Problem with IP Limiting: Imagine a whole office building or a university campus sharing a single public IP address. If you block that IP because of one person's misbehaviour, you've just blocked hundreds of legitimate users. You penalize the many for the sins of the one.

IPs are also easy to change. Attackers know this, and they exploit it. They use botnets with thousands of different IPs to fly under the radar. An IP-based limit won't stop them.

So, we need better, more precise identifiers:

User ID: This was the perfect identifier for my voice journaling app. Since users have to be logged in, I can tie the rate limit directly to their account. This is the fairest method because the limit follows the user, not their device or network. It ensures every user gets the same fair usage quota.

API Key: This is ideal for B2B or multi-tenant services. Each customer gets a key. If they abuse the service, I can limit that specific key without affecting anyone else.

Device Fingerprint: This is the advanced option. By analyzing a device's unique combination of browser, OS, and hardware signals, you can identify the machine itself. This is incredibly powerful for stopping sophisticated attackers who rotate IPs and user accounts.

The Lesson: The identifier should be as close to the end-user entity as possible.

Identifier	Technical Use Case	Simple Reason
IP Address	Unauthenticated, low-security endpoints	Unreliable and unfair. A weak first line of defense.
User ID	Authenticated, user-facing applications	The fairest method. The limit is tied to the person.
API Key	Multi-tenant SaaS, developer APIs	Finer control per customer; enables monetization.
Device Fingerprint	High-security endpoints (login, payments)	Targets the attacker's machine, not just their credentials.

The "How": Matching the Algorithm to the Threat

Once you know who to limit, you must decide how. Different algorithms offer different trade-offs in flexibility, precision, and performance.

Strategy 1: The Token Bucket (For Flexible Throttling)

This was the perfect fit for my voice journaling app. My goal was to allow about 10 requests per minute per user, ensuring there was a natural pause between calls to avoid hammering the AI service.

How it works:

Every user gets a "bucket" with a small capacity (e.g., 3 tokens)
To make an API call, they must "spend" a token from their bucket
The bucket gets refilled at a slow, constant rate (e.g., 1 token every 6 seconds)
If the bucket is empty, the user must wait for more tokens to be added

This is perfect because it allows a small, initial burst (3 quick api calls) but then forces a slower, steady pace. It naturally creates the behavior I want.

Let's visualize how this works for a single user:

The Starting Line: Our user starts with a full bucket containing 3 tokens. The bucket can't hold more than 3.
Making a Burst of Requests: The user quickly uploads a journal entry. Each upload does 3 API calls and each call costs a token, emptying the bucket.
The Limit is Reached: The bucket is now empty. The user tries to make a 4th request immediately, but the system denies it.
The Slow Refill: The system is fair. After a 6-second pause, a single token is dripped back into the bucket, allowing the user to make another call.

This cycle ensures that over a minute, the user can make about 10 requests (60 seconds ÷ 6 seconds per token = 10 tokens), but they can't make them all at once. This approach perfectly protects my upstream AI provider while ensuring a fair and predictable experience for my users.

Strategy 2: Sliding Window Counter (For Strict Security)

This was the clear winner for my petty cash app's login endpoint. Here, I don't care about flexibility. I care about security. I need a hard, unforgiving limit on failed login attempts.

How it works:

Divide time into fixed windows (e.g., 15-minute intervals)
Count the number of requests from a source within the current window
If the count exceeds the limit (e.g., 5 failed logins), block all further requests until the next window starts

Time Window: [00:00 - 00:15]
Failed Login Attempts: 1, 2, 3, 4, 5 ❌ BLOCKED
Next attempt allowed at: 00:16

For brute-force protection, I combined two identifiers: the User ID and the IP Address.

Limit per User ID: 5 failed attempts per 15 minutes. This stops an attacker from hammering a single known account.
Limit per IP Address: 20 failed attempts per 15 minutes. This stops an attacker from trying thousands of different usernames from a single machine.

This dual-key strategy is incredibly effective. It's strict, simple, and directly addresses the threat.

Architecting for Defense in Depth

                ┌──────────────────────────────┐
                │         The Internet         │
                └────────────┬─────────────────┘
                             │
              ┌──────────────▼──────────────┐
              │     Edge (CDN / WAF)        │
              │ ─ Broad IP-based limiting   │
              │ ─ DDoS protection           │
              │ ─ Services: Cloudflare, AWS │
              └──────────────┬──────────────┘
                             │
              ┌──────────────▼──────────────┐
              │     API Gateway             │
              │ ─ Enforces most limits      │
              │ ─ Token bucket per API key  │
              │ ─ JWT / User-based logic    │
              │ ─ Services: Kong, Nginx     │
              └──────────────┬──────────────┘
                             │
              ┌──────────────▼──────────────┐
              │     Application Layer       │
              │ ─ Security-critical limits  │
              │ ─ Brute-force protection    │
              │ ─ Uses full app context     │
              │ ─ Sliding window counters   │
              └─────────────────────────────┘

Rate limiting shouldn't be a single function call in your code. It should be a layered defense throughout your architecture.

The Edge (CDN/WAF): This is your outermost layer. Use services like Cloudflare or AWS WAF for broad, IP-based limiting. Their job is to absorb large scale DDoS attacks before they ever reach your application infrastructure.

The API Gateway (e.g., Kong, Nginx): This is the central enforcement point for most of your business logic. The gateway is the perfect place to implement your Token Bucket strategy based on API keys or user tokens (like JWTs). It protects all your backend services.

The Application: This is the deepest layer. Implement highly specific, security-critical limits directly in your application code. Your brute-force protection on the login controller, using the Sliding Window algorithm, belongs here. This layer has the full application context to make the most precise decisions.

Monitoring and Alerting: Your Rate Limiting Radar

A rate limiting system without monitoring is like driving blindfolded. You need visibility into what's happening and automated alerts when things go wrong.

Key Metrics to Track

Rate Limit Hit Rate: What percentage of requests are being blocked? A sudden spike might indicate an attack or a misconfigured client.

Top Rate-Limited Sources: Which users, IPs, or API keys are hitting limits most frequently? This helps identify problem patterns.

False Positive Rate: Are legitimate users getting blocked? Monitor support tickets and user complaints that correlate with rate limiting events.

Resource Consumption: How much CPU and memory is your rate limiting consuming? Complex algorithms can become bottlenecks themselves.

Essential Alerts

Set up alerts for:

Rate limit hit rate exceeding 10% (possible attack)
Single source hitting limits repeatedly (targeted investigation needed)
Rate limiting service failures (your protection is down)
Unusual patterns in blocked requests (new attack vectors)

Implementation Tip

Log every rate limit decision with structured data:

{
  "timestamp": "2025-06-20T10:30:00Z",
  "identifier": "user:12345",
  "endpoint": "/api/transcribe",
  "action": "blocked",
  "limit": 10,
  "current_count": 11,
  "reset_time": "2025-06-20T10:31:00Z"
}

This gives you the data you need for both real-time alerting and post-incident analysis.

Don't Be Rude: Communicate Your Limits!

Finally, when you do have to rate limit someone, don't leave them guessing, communicate!

Return an HTTP 429 Too Many Requests status code. Don't just send a generic 400 or 500.

Use response headers to tell the developer (or your own front-end) what's happening:

X-RateLimit-Limit: The total number of requests they can make
X-RateLimit-Remaining: How many requests they have left
X-RateLimit-Reset: When their window resets (as a timestamp)
Retry-After: How many seconds they should wait before trying again

This turns a frustrating error into an actionable, predictable experience.

The Takeaway

Effective rate limiting is a mark of mature system design. It's a refined discipline that requires moving beyond simple approaches. My experience with two contrasting applications taught me that the key principles are:

Identify Wisely: Base your limits on an identifier as close to the user entity as possible (User ID, API Key). Avoid IP addresses for application logic.

Match the Algorithm to the Risk: Use flexible algorithms like Token Bucket for throttling and fair use. Use strict, precise algorithms like Sliding Window Counter for security-critical endpoints.

Build in Layers: Implement defense in depth, enforcing different policies at the edge, the gateway, and within the application itself.

Monitor Everything: Rate limiting without visibility is ineffective. Track metrics, set up alerts, and learn from the data.

By treating rate limiting as a core architectural component, you can build systems that are not only secure and resilient but also fair and predictable for your users. Try the voice journaling app