Three Systems, Not One
"Rate limiting" gets used as a catch-all for anything that rejects or slows down requests. But there are actually three distinct mechanisms, each protecting against a different failure mode, each asking a different question:
| Mechanism | Question it asks | What it protects |
|---|---|---|
| Load shedding | "Is this server healthy enough to handle ANY request?" | The server from itself |
| Rate limiting | "Is THIS CALLER sending too many requests?" | The system from abusive callers |
| Adaptive throttling | "Is the DOWNSTREAM struggling right now?" | Downstream services from this server |
A rate limiter won't save you when your server is OOM-ing — every user is within their quota, the server is just dying. Load shedding won't stop one customer from consuming 80% of your capacity — total concurrency is fine, the distribution is unfair. And neither will prevent you from hammering a downstream service that's already struggling.
These are complementary systems. Treating them as one thing — or building only one of the three — leaves gaps that show up exactly when you need protection most.
The Three Layers
Each layer asks a different question. Each protects a different thing.
Layer 1 — Load Shedding protects this server from itself. Is memory pressure too high? Are there too many concurrent requests? Did a downstream just return RESOURCE_EXHAUSTED? If any of these are true, reject immediately — doesn't matter who the user is, doesn't matter what the request is. The building is at capacity.
Layer 2 — Rate Limiting protects the system from abusive users. Is this specific user, API key, or IP address sending more than their allowed share? This is the classic rate limiter — per-user counters, sliding windows, token buckets.
Layer 3 — Adaptive Throttling protects downstream services from this server. The server tracks its success rate when calling each downstream. If 20% of calls to the payment service are failing, it starts probabilistically dropping 20% of outbound calls — giving the payment service breathing room to recover.
Why the Order Matters
Layer 1 runs at the highest priority — before authentication, before request parsing, before anything. Here's why:
If rate limiting (Layer 2) runs first, the server spends CPU checking Redis counters, computing sliding window math, and doing per-user lookups. Then it reaches Layer 1, which says "actually, the server is dying, reject everything." All that rate-limit computation was wasted on a request you were going to reject anyway.
Load shedding is cheap — one atomic counter check or one GC flag read. It takes microseconds. Rate limiting might require a Redis round-trip. Run the cheap check first.
Think of it like a nightclub. The fire marshal at the door (load shedding) doesn't check your ID. "Building is at capacity. Nobody gets in." Only if the building isn't full does the bouncer (rate limiter) check your guest list.
What Each Layer Catches That The Others Miss
- A bad deployment causes OOM. Your new ML model eats 3x the expected memory. Layer 1 sees GC pressure spike and starts shedding. Layer 2 is blind — every user is within their rate limit. Layer 3 is blind — the downstream is fine. Without load shedding, you're relying on Kubernetes to OOM-kill the pod and restart it, which takes 30-60 seconds of full outage.
- One customer sends 10x their normal traffic. A migration script gone wrong. Layer 2 catches it immediately — their per-user counter crosses the threshold. Layer 1 might eventually catch it (if the extra traffic pushes overall concurrency past the limit), but it can't distinguish "one bad user" from "legitimate traffic spike." Layer 3 is blind — the downstream doesn't know which user caused the load.
- A downstream payment service enters degraded state. It accepts 60% of requests, returns RESOURCE_EXHAUSTED on the rest. Layer 3 sees the failure rate climb and starts probabilistically dropping outbound calls — giving the payment service room to breathe. Layer 1 catches the RESOURCE_EXHAUSTED responses and triggers a reactive backoff. Layer 2 is completely blind — users are within their limits, the problem is downstream.
- A DDoS hits your API. Thousands of IPs, each sending moderate traffic. Layer 1 catches it (total concurrency spikes). Layer 2 catches it (per-IP limits hit). Layer 3 is blind — this is an inbound problem, not outbound. Both layers contribute, but neither alone is sufficient — the DDoS might stay under per-IP limits while overwhelming total capacity, or it might come from one IP but stay under concurrency limits.
- A slow dependency causes thread pool exhaustion. A database query that usually takes 5ms starts taking 2 seconds. Threads pile up waiting. Layer 1 sees concurrent request count spike toward the limit. Layer 3 would catch it if the dependency returned errors, but slow responses aren't errors — the threads just accumulate silently. Layer 2 is blind. This is the scenario where load shedding saves you — it's the only layer watching the server's actual resource consumption.
No single layer handles everything. That's the point. They're complementary, not redundant.
If Layer 2 has a bug or Redis goes down, Layer 1 still protects the server from overload. If Layer 1's threshold is set too high, Layer 2 still limits abusive users. If both fail, Layer 3 at least prevents a cascade into downstream services.
Defense in depth. Not defense in one.
Layer 2 Has Two Personalities: Reject or Delay
Rate limiting (Layer 2) isn't one tool — it's two, with opposite behaviors.
Rejection says "no." The request is over the limit. Return 429. The caller deals with it.
Delay says "wait." The request is over the limit, but instead of rejecting it, hold it in a queue and release it when the rate allows. The caller doesn't even know it was throttled — just that the response was a bit slow.
Same goal (enforce a rate), completely different experience for the caller.
The question is: when do you reject, and when do you delay?
- Reject when someone external is holding a connection. A user called your API. Their HTTP connection is open. If you delay them, you're holding that connection — which means a thread, a socket, memory. Delay 500 users and you've exhausted your connection pool. Now legitimate users who are under the limit can't get a connection. Your rate limiter just caused an outage for good users by being too nice to bad ones. Reject fast. Free the connection. Let the client's retry logic handle it.
Delay when your own system needs the request to succeed. You're calling Stripe's payment API. You know their limit: 100 requests per second. The 101st request doesn't need to fail — it just needs to wait 10 milliseconds for the next second's budget. If you reject it instead, you need retry logic, backoff timers, dead letter queues, monitoring for the retries — an entire infrastructure to handle a problem that "just wait" solves.
Five scenarios to build the intuition:Your public API gets a burst from a customer. Reject. Return 429 instantly. The customer's SDK has built-in retry with exponential backoff. Your server processed the rejection in microseconds and moved on. If you delayed instead, 500 connections held open, connection pool starved, outage for everyone.
You're sending 50,000 marketing emails through SendGrid. Delay. SendGrid allows 500/sec. Queue all 50,000, drip them at 500/sec. Takes 100 seconds, every email delivered. If you rejected instead, 49,500 emails bounced in the first second. Now you need a dead letter queue and retry scheduling for a problem that "wait your turn" solves completely.
Your gRPC server receives internal traffic from an upstream service. Reject. Return RESOURCE_EXHAUSTED. The upstream's adaptive throttler (Layer 3 on their side) sees the error and automatically backs off. The system self-heals. If you delayed instead, the upstream's gRPC deadline expires while its request sits in your queue. Timeout errors are worse than clean rejections — the upstream can't tell "server is slow" from "I'm being rate limited."
A batch job scrapes 10,000 records from a partner API nightly. Delay. Partner allows 50 req/sec. Pace it perfectly — 3.3 minutes, all requests succeed, partner never sees a spike. If you rejected instead, 9,950 requests fail immediately, retry logic fires, you hammer the partner for 20 minutes instead of a clean 3-minute crawl.
A user calls your payment endpoint during checkout. Reject. The user is staring at a button that says "Pay Now." A 200ms rejection with a "please try again" message is infinitely better than a 5-second delay where they think the page froze, hit refresh, and trigger a duplicate payment.
The rule is simple: reject when someone is waiting for the connection. Delay when you can afford to wait.

Top comments (0)