"the "PLEASE CHILL” Pattern your services desperately need"
Imagine your service is a tiny café.
Most days it’s fine. A few customers, some coffee orders, a little latency but nothing dramatic.
Then one day you get featured on Hacker News. Suddenly 10,000 people show up, all yelling GET /coffee at the same time.
Options:
- You try to serve everyone → kitchen melts, nobody gets coffee.
- You shut the door and deny everyone → users rage, business dies.
- You let people in at a controlled rate → some wait, some get “come back later,” the kitchen keeps working.
That third one is throttling.
In distributed systems, throttling is how we tell clients:
“You’re not wrong, you’re just early.”
Let’s unpack what throttling really is, how it differs from plain rate limiting, and how to design it cleanly in large systems.
Throttling vs Rate Limiting vs “Just Autoscale It”
These terms get mixed a lot, so let’s carve out some boundaries.
Rate limiting (from the client’s perspective)
Rate limiting is usually about enforcing a policy on how many requests a client is allowed to make over some time window:
“User A can hit /search at most 100 requests per minute.”
If the client exceeds that, we reject extra requests (often with HTTP
429 Too Many Requests
). Rate limiting is often part of API gateways and public-facing APIs.
Throttling (from the system’s perspective)
Throttling is the system saying:
“Given my current resources, I’ll only process this many things right now.”
It’s not only about fairness across clients, but also about protecting dependencies, keeping latency under control, and staying alive under chaos.
Throttling might:
- Slow you down (queue or delay requests),
- Reject you outright,
- Or downgrade what you get (fallbacks, cached/stale responses). Rate limiting is often policy-first (“free users get 10 requests/sec”). Throttling is often health-first (“DB is unhappy, we’ll aggressively shed load until it recovers”).
Why “just autoscale” isn’t enough
Autoscaling is great, but:
- It’s slow compared to traffic spikes.
- Some resources don’t scale linearly (databases, legacy systems, third-party APIs).
- You pay for overprovisioning.
- There’s always a ceiling where more machines don’t help.
Throttling is your first line of defense even in a fully auto-scaled world. Azure’s architecture docs explicitly recommend throttling as a complement/alternative to scaling when resources are finite.
What Exactly Are We Throttling?
“Throttling” isn’t just “requests per second.” You can throttle pretty much any scarce thing:
- RPS per client: 100 req/min per API key, IP, user, tenant.
- Global RPS: 50k req/sec across the service.
- Concurrent work: max 500 in-flight queries to a DB, max 200 open HTTP connections to a dependency.
- Resource usage: CPU, memory, I/O bandwidth, number of Kafka partitions you read from at once, etc.
- Per-resource quotas: particular endpoints, particular queues, particular features.
You also have to decide:
- Who gets limited? Per-API key, per-user, per-tenant, per-region, per-service, per-IP…
- Where does it happen? Client SDK, API gateway, service layer, database gatekeeper, background worker pool.
- What happens to excess? Drop immediately, queue, delay, or degrade.
Throttling is half policy, half plumbing.
Core Throttling Algorithms (Without Hand-Wavy Math)
Let’s go through the usual suspects in human terms.
1. Fixed Window Counter
Policy: N requests per time window (say 100 req/min).
Implementation idea:
- Maintain a counter per key (like user_id).
- Each minute, reset the counter.
- If the count for this minute > 100 → reject.
It’s simple and fast but has a nasty edge case:
- User sends 100 requests at 12:00:59 and 100 at 12:01:01 → effectively 200 in ~2 seconds.
Works fine for many systems, but not ideal if you care about burst control.
2. Sliding Window (More Fair, Slightly More Work)
Same rule: 100 requests per minute, but we treat it as “in the last 60 seconds” instead of “in this calendar minute.”
Implementation variants:
- Sliding log: store timestamps for each request, prune anything older than 60s, count the rest.
- Sliding window with buckets: split the minute into smaller buckets (e.g., 10 x 6-second buckets) and sum them.
- Smoother, safer, but you trade off memory (for logs) or precision (for buckets).
3. Token Bucket (let bursts through, but only occasionally)
Token Bucket is the workhorse of modern rate limiting and throttling.
Think of it as:
- A bucket that holds at most capacity tokens.
- Tokens drip in at some rate (r tokens/second).
- Each request consumes one token.
- If there are no tokens, you reject or delay the request.
Properties:
- Allows short bursts (up to capacity) if the client has been idle.
- Enforces a long-term average rate (r).
- Very well-suited to distributed caching stores like Redis / DynamoDB.
Azure’s ARM throttling uses a token bucket model at regional level to enforce limits while allowing some burstiness.
4. Leaky Bucket (smooths spikes aggressively)
Leaky Bucket is like a queue with a fixed drain rate:
- Requests enter a queue (the bucket).
- The system processes them at a constant rate (the leak).
- If the bucket fills up → drop or reject new arrivals.
This is great for protecting a fragile downstream:
“We’ll never send more than 200 writes/sec to this database, full stop.”
It’s more about smoothing than fairness.
5. Concurrency Limits (semaphores in a trench coat)
Sometimes RPS isn’t the right lever. You care about how many operations are currently in flight.
Classic pattern:
- Wrap access to a resource with a semaphore of size N.
- Each request acquire()s a slot before calling the dependency.
- When done, release().
If no slot is available:
- Either queue until one frees up, or
- Fail fast and tell the caller to back off.
This is common for DB pools, file I/O, CPU-heavy tasks, and in thread-pool-based throttling patterns.
Coming up in Part 2
This wraps up the core theory: what throttling is, how it differs from rate limiting, and the main algorithms (fixed/sliding windows, token bucket, leaky bucket, concurrency limits) that power it.
In Part 2, we’ll plug this into real-world architecture: where to place throttling in a distributed system, how to combine it with circuit breakers and load shedding, what actually happens at runtime (429s, backoff, queues), and a concrete design for a distributed throttling service.


Top comments (0)