Who am I?
Hi, I'm ttatsato. I'm currently developing an open-source API Gateway that sits in front of an API server and lets developers easily manage plan-based rate limiting, API Key issuance, authentication, and usage tracking — all by simply "plugging it in" to an existing server.
I believe public APIs will become even more essential as AI agents become the primary way we interact with SaaS, so I want this Gateway to be both robust and highly flexible. In this article, I explore which rate-limiting algorithm best fits that goal, alongside a review of the fundamental concepts.
Types of Rate Limiting Algorithms
Fixed Window
The fixed window algorithm resets its counter at specific time intervals.
Suppose we establish a rule of 100 requests per second. The system counts all incoming requests within the one-second window from 10:00:00.000 to 10:00:00.999, then resets at 10:00:01.000.
Pros
- The concept and implementation are exceptionally straightforward — just count requests per interval.
- Very low memory footprint.
Cons
- The primary drawback is the boundary spike problem.
- For example, if 100 requests arrive at
10:00:00.999and another 100 at10:00:01.000, the system effectively serves 200 requests within a single second, doubling the intended limit.
Sliding Window Log
Sliding Window Log records the precise timestamp of every incoming request.
- Store a record each time a request is received.
- Retain only the logs within the specified timeframe from the current moment, and purge older entries.
Pros
- Exceptional precision. Unlike Fixed Window, it completely eliminates the boundary problem.
Cons
- Storing every timestamp puts significant pressure on memory. At 1,000 req/s per user with a 1-minute window, each user costs ~480 KB just for raw timestamps (8 bytes × 60,000). In practice the real footprint is several times larger: a Redis Sorted Set adds 40–80 bytes of skiplist and hash overhead per element, easily pushing a single user past 1 MB. Multiply by user count for the system-level impact.
Sliding Window Counter
Sliding Window Counter keeps two adjacent fixed-window counters — the current window and the immediately preceding one — and approximates the true sliding-window count by weighting the previous window by how much of it still overlaps the now-shifted view:
estimate = previousCount * (1 - elapsedInCurrent / windowSize) + currentCount
So the storage cost is just two integers per key, not a list of timestamps.
Pros
- Highly memory-efficient — only two counter values per key.
- Effectively eliminates the boundary problem.
Cons
- Since it relies on an approximation, slight inaccuracies may occur if request patterns are heavily skewed within a window.
Token Bucket
Token Bucket is a widely used method that allows for a certain degree of burstiness while maintaining a steady average rate.
It deposits "tokens" into a bucket at a fixed rate. Every incoming request must consume a token to be processed; if the bucket is empty, the request is rejected or delayed.
Pros
- Accommodates bursty traffic while maintaining a steady average rate.
- Memory-efficient — only stores the current token count and the last refill timestamp.
Cons
- Parameter tuning (refill rate vs. bucket capacity) requires care.
- Slightly more complex than Fixed Window: a naive implementation must coordinate two values (token count + last refill) atomically. In a centralized Redis setup this is trivially handled by a short Lua script (see Criterion 4), but a hand-rolled, lock-based implementation can become race-prone.
Leaky Bucket
Leaky Bucket models requests as water entering a fixed-capacity bucket that "leaks" at a constant rate. Excess requests overflow and are rejected (or queued, depending on the variant).
Pros
- Smooths bursty input into a strictly steady output rate — useful when downstream services require a flat traffic profile.
Cons
- Cannot accommodate bursts beyond bucket capacity.
- Queue-based variants add latency and do not naturally support cumulative usage tracking.
GCRA (Generic Cell Rate Algorithm)
GCRA, originally designed for ATM networks, determines whether a sequence of events conforms to a specific traffic contract.
Unlike counter-based methods, it uses the Theoretical Arrival Time (TAT) to decide whether a request is arriving too early relative to the allowed rate. Two parameters control its behavior:
- T (emission interval): the inverse of the average rate (e.g., 100 req/s → T = 10 ms).
- τ (tolerance / burst): how much earlier than TAT a request may arrive without being rejected. Functionally equivalent to Token Bucket's bucket capacity.
On each request: if now ≥ TAT − τ, accept and update TAT = max(TAT, now) + T; otherwise reject.
Pros
- Exceptionally space-efficient — only a single timestamp (TAT) per key (~16 bytes vs. ~32 bytes for Token Bucket's count + last-refill pair).
- The single-value state makes atomic updates in Redis simpler than counter-pair models: the entire read-calculate-write cycle in a Lua script touches only one field.
Cons
- Conceptually harder to explain — "Theoretical Arrival Time" is less intuitive than a request counter.
- Stored values are timestamps rather than human-readable counts, making manual inspection and debugging harder.
Requirements
The Gateway must satisfy:
- Hierarchical Quota Management: Multi-layered limits, combining monthly aggregate quotas (Fixed Window) with per-second flow control (Token Bucket / GCRA).
- Flexible Plan-Based Limiting: Dynamic enforcement of differentiated quotas and rates mapped to user tiers (e.g., Free vs. Pro).
- Burst Tolerance: Accommodate transient spikes typical of AI agents using a token-based or interval-based credit system.
- Strict Distributed Consistency: Guaranteed counter accuracy across a multi-node cluster via a centralized Redis store with atomic Lua operations.
-
Low-Latency & Space Efficiency: Minimal processing overhead and
O(1)storage complexity per key.
Criterion 1: Quota Suitability
- Fixed Window: Best fit. Directly tracks usage per calendar interval (Month / Day). Simple and effective.
- Sliding Window: Poor fit. Designed for "moving" intervals; calendar resets are non-trivial to implement.
- Token Bucket / GCRA: Indirect. Optimized for traffic flow, not for tracking a "monthly remaining balance."
- Leaky Bucket: Incompatible. Purpose-built for smoothing output rate, not for tracking a "remaining balance" over a calendar window. Even the reject-variant is a poor semantic match — its state describes "how full is the leak buffer right now," not "how much of this month's quota is left."
→ Fixed Window for monthly quotas, plus another algorithm for per-second flow.
Criterion 2: Per-Second Throttling Performance
- Fixed Window: Weak. Susceptible to boundary spikes; unreliable for server protection.
- Sliding Window Log: High precision / high cost. Perfect accuracy, but storing every timestamp is prohibitively expensive for a high-traffic gateway.
- Sliding Window Counter: Middle ground. Smoother than Fixed Window and more memory-efficient than the Log version, but still an approximation and lacks the explicit burst semantics found in Token Bucket / GCRA.
- Token Bucket: Best. Naturally handles bursts while protecting the backend, with battle-tested implementations across the industry (Stripe, GitHub, AWS API Gateway). Burst capacity and refill rate map directly to product-level dials.
- GCRA: Excellent. Mathematically precise interval management; τ plays the same role as Token Bucket's bucket capacity. The added precision rarely translates into observable behavioral differences at API-gateway scale.
Criterion 3: Observability and User Feedback
How well can the algorithm expose its internal state via standard headers like X-RateLimit-Remaining and X-RateLimit-Reset?
- Fixed Window: Best. Remaining quota is a simple subtraction; highly predictable for users.
- Token Bucket: Best. The current token count is "Remaining" — no conversion layer, no derived formula, no rounding decisions to defend in a bug report.
-
GCRA: Good. A conversion is required:
remaining = max(0, τ − max(0, TAT − now)) / T. The math is straightforward, but every conversion is a place where an off-by-one can sneak in and where a contributor has to re-derive the formula to reason about the response. -
Sliding Window Counter: Poor. The deeper issue is that "Remaining" carries ambiguous burst semantics — the value depends on a weighted blend of the previous and current windows, so a single integer cannot honestly answer "how many requests can I send right now."
floor()smooths over the fractional output, but the underlying mismatch between the exposed number and the actual admission decision still surfaces as "I had 1 left and got blocked" reports.
Criterion 4: Strict Distributed Consistency and Performance
How well can the algorithm maintain a synchronized state across a distributed cluster using a centralized store (Redis) without sacrificing latency?
-
Fixed Window: Best. Leverages Redis's atomic
INCR, requiring minimal communication and no complex logic to prevent race conditions. - Token Bucket: Excellent. Two fields (token count + last refill) sound heavier than one, but in practice both live in a single Redis hash and are mutated by a ~15-line Lua script that runs atomically on the server. This is what dissolves the "distributed complexity" caveat from the Cons section: the hard part of Token Bucket is racing two writes from multiple nodes, and Lua eliminates that race by collapsing the read-calculate-write cycle into a single server-side operation.
- GCRA: Excellent. Single-timestamp state shrinks the Lua payload by a few bytes and removes one field from the script. A genuine micro-optimization, but one whose practical impact is dwarfed by network round-trip time.
- Sliding Window Log: Poor. Transmitting and processing entire timestamp sets per request bottlenecks high-throughput environments.
Final Decision: A Tiered Architecture with Fixed Window and Token Bucket
We have opted for a hybrid approach to balance business requirements with technical performance:
-
Fixed Window for monthly quotas. Aligns with calendar-based billing cycles and gives the most intuitive experience for tracking long-term usage. Its simplicity in Redis (
INCR) ensures maximum throughput for high-level accounting. - Token Bucket for per-second throttling. Protects backend resources from transient spikes and AI-driven bursts.
Token Bucket was chosen over GCRA for three reasons:
-
Industry-standard semantics for API consumers. Stripe, GitHub, and AWS API Gateway have collectively trained developers to think in terms of buckets and tokens. When a customer sees
X-RateLimit-Remaining: 47and consults their existing mental model, the Gateway's behavior matches their expectation immediately. Aligning with the dominant convention reduces integration friction — a direct product win. -
Operational debuggability. Inspecting Redis during an incident and seeing
tokens=47, lastRefill=...immediately tells an operator the current state. A raw TAT timestamp requires a side calculation to interpret — a friction tax paid every time the system misbehaves in production, where minutes of confusion translate directly into customer impact. -
Headers map directly to internal state.
X-RateLimit-Remainingis the token count itself, with no conversion layer between storage and response. One fewer formula means one fewer place a bug can reach an end user.
The advantages GCRA offers — ~16 bytes saved per key, single-field atomic updates — are real but marginal at the scales this Gateway is designed for. The product-level wins of Token Bucket, by contrast, are paid out on every single request and every single incident.
This loops back to the opening premise. As AI agents become the dominant API consumer, traffic patterns get higher in volume, less predictable, and harder to reason about by inspection. The right response to more chaos in the data plane is not a more clever algorithm underneath — it is a control plane that any operator can read at a glance. Token Bucket buys exactly that: when an AI agent goes haywire at 3 a.m., the on-call engineer reads tokens=0, lastRefill=... and immediately knows the story. That legibility is the lifeline.
Request flow
Each request must pass both layers to be served:
1. Evaluate the per-second Token Bucket layer (single-key Lua: refill, then decrement).
If tokens < 1 → reject with 429, set Retry-After = ceil((1 − tokens) / refillRate).
2. Atomically INCR the monthly Fixed Window counter.
If above quota → reject with 429, set X-RateLimit-Reset = next month boundary.
3. Forward the request and emit X-RateLimit-* headers from both layers,
prefixed (e.g., X-RateLimit-Remaining-Second / -Month) so clients can
distinguish which limit they are approaching.
A note on Retry-After: RFC 9110 defines its delay-seconds form as a non-negative integer, so we round up the computed wait. For sub-second precision (relevant when refillRate is high), we additionally emit a non-standard X-RateLimit-Retry-After-Ms header — clients that understand it can back off more tightly, while standard clients fall back to the conservative integer-second value.
The per-second check runs first because it is the cheaper, more frequently failing path — failing fast avoids unnecessary work on the monthly counter.
What about the consumed token when the monthly quota rejects?
A subtle question follows from the ordering: if step 1 consumes a token but step 2 rejects on the monthly cap, do we refund the token? We deliberately do not. Three reasons:
- The monthly rejection is an edge case. The vast majority of rejections happen at the per-second layer — that is the layer designed to fire constantly. Hitting the monthly cap is rare by definition (it happens at most once per user per month), so the small over-charge is bounded and infrequent.
- A refund costs another Redis round-trip. Compensating the token would require a second Lua call on the rejection path, doubling the latency cost of an already-failing request just to grant back a credit the client cannot use anyway (the next request will fail at the monthly layer too).
- Tokens self-heal in seconds. At any sane refill rate, a single un-refunded token is replenished within a fraction of a second — long before the user can retry. The "lost" token is invisible in practice.
This is the kind of trade-off where strict accounting purity loses to pragmatism: the cost of correctness is real, and the user-visible benefit is zero.
Thank you for reading!
This project is open-source and contributions of any size — bug fixes, architectural discussions, or feedback on this very algorithm choice — are always welcome. Let's build the future of API gateways together!
Top comments (0)