DEV Community

Cover image for Fixed Window, Leaky Bucket, Sliding Window: I Used All Three in a Production WAF. Here's Where Each One Broke.
Bhavy Yadav
Bhavy Yadav

Posted on

Fixed Window, Leaky Bucket, Sliding Window: I Used All Three in a Production WAF. Here's Where Each One Broke.

I was scrolling through access logs at 2 AM — not because an alert fired, but because something about the traffic shape I'd glanced at before bed felt wrong. The WAF was running. All counters looked healthy. Then I filtered by endpoint, and the login route told a different story: 4,300 credential stuffing attempts in the past hour, spread across 58 source IPs, every single one of them under my per-IP fixed window threshold. The attack had been running for hours. My rate limiter had never once fired.

That was the night I stopped thinking about rate limiting as a solved problem.

I built this WAF from scratch — started with NGINX and ModSecurity as the backbone, then migrated the core request pipeline to Go when the rule evaluation latency from ModSecurity became unacceptable. At peak, the system handles around 8,000 requests per second across the sites it protects. Not hyperscale, but enough that naive algorithm choices will hurt you in very specific and instructive ways.

I've run fixed window, sliding window, and leaky bucket in production. Each one failed differently. None of them failed the way I expected from reading about them.


Rate Limiting Is Not the Problem It Looks Like

The interview version: "Increment a counter. If it exceeds N, return 429." Two Redis calls. Done.

The production version: your WAF runs on multiple nodes behind a load balancer. Rate limit state is either local — in which case your effective limit per identity is N × node_count, and an attacker who hits all nodes gets 3x or 5x the throttle you intended — or it's centralized in Redis, in which case you've just made Redis a mandatory dependency on every single request. Under active attack, when you need rate limiting most, Redis is also under the highest load it will ever see.

The latency constraint bites immediately. My self-imposed budget for the entire WAF pipeline — IP reputation check, rate limit, header inspection, rule engine — was 5ms added latency. Redis round trips run 0.4-0.8ms under normal load. Under attack, with key contention, I've seen them hit 6ms on their own. The rate limiter alone can blow the entire latency budget the moment traffic spikes.

Then there's the false positive problem, which nobody talks about enough because it's not exciting. A false positive in a WAF doesn't generate a security incident. It generates a legitimate user getting a 429 they don't understand, giving up, and not coming back. False negatives get you breached. False positives just quietly erode whatever you're protecting. You will have both, and the people who care about each one have zero overlap.

That's the context. Now here's what actually broke.


Fixed Window: Fast, Debuggable, and Exploitable by Anyone Who Can Read a Clock

I started with fixed window because the implementation is essentially free:

func (fw *FixedWindowLimiter) Allow(ctx context.Context, key string, limit int64, window time.Duration) (bool, error) {
    windowSize := int64(window.Seconds())
    bucketKey := fmt.Sprintf("%s:%d", key, time.Now().Unix()/windowSize)

    pipe := fw.redis.Pipeline()
    incr := pipe.Incr(ctx, bucketKey)
    pipe.Expire(ctx, bucketKey, window+time.Second)
    if _, err := pipe.Exec(ctx); err != nil {
        return true, err // fail open — debatable, but my threat model allowed it
    }

    return incr.Val() <= limit, nil
}
Enter fullscreen mode Exit fullscreen mode

Two pipelined Redis calls. O(1) storage per identity per window. The counter is a plain integer — you can inspect it with redis-cli GET at any time without understanding anything about the system. I liked how debuggable it was.

The problem is the boundary. A fixed window resets on a schedule — top of the minute, top of the hour, whatever you pick. That reset time is observable. Any attacker watching X-RateLimit-Reset headers knows exactly when it happens.

Here's the 58-IP attack that ran through my WAF undetected:

Window: 60 seconds
Limit:  100 req/min per IP

T=58.2s  →  35 requests from IP A  (window 1 total: 35)
T=59.1s  →  65 requests from IP A  (window 1 total: 100) ← at limit
             ────────── WINDOW RESETS ──────────
T=60.0s  →  80 requests from IP A  (window 2 total: 80)  ← fresh counter
T=60.4s  →  20 requests from IP A  (window 2 total: 100) ← at limit again

200 requests in 2.2 seconds.
Intended limit: 100 per 60 seconds.
Enter fullscreen mode Exit fullscreen mode

Each window looks correct in isolation. 100 in window 1, 100 in window 2. My monitoring was alerting on sustained violations — counters that stayed pegged at the limit across multiple consecutive windows. This attack never triggered that alert because each window had a gap. The burst was entirely invisible.

Across 58 IPs all doing the same thing, that's 11,600 credential stuffing attempts in a 2-second window. The backend database handled 200 bcrypt comparisons that second. That's what woke it up in the logs.

I added a parallel 5-second burst counter after this. <your_ip>:burst:<epoch/5>. It fires a separate alert if any IP exceeds 30 requests in a 5-second window, regardless of what the per-minute window says. The fix was not replacing fixed window — it was accepting that fixed window alone is incomplete and adding what it's missing.

Fixed window is fine for coarse limits where boundary bursts don't create exploitable gaps: ingress byte caps, background job throttling, anything where 2x the intended rate for 2 seconds doesn't matter. On an authentication endpoint, it's the wrong tool.


Sliding Window: The Right Algorithm With a Memory Problem I Should Have Calculated First

After the fixed window incident, I switched login and payment endpoints to sliding window log. Store every request timestamp in a Redis sorted set, prune entries older than the window on each check, count what's left.

func (sw *SlidingWindowLog) Allow(ctx context.Context, key string, limit int64, window time.Duration) (bool, error) {
    now := time.Now()
    cutoff := strconv.FormatInt(now.Add(-window).UnixMicro(), 10)
    nowScore := float64(now.UnixMicro())

    pipe := sw.redis.Pipeline()
    pipe.ZRemRangeByScore(ctx, key, "0", cutoff)
    count := pipe.ZCard(ctx, key)
    pipe.ZAdd(ctx, key, redis.Z{Score: nowScore, Member: now.UnixNano()})
    pipe.Expire(ctx, key, window+time.Second)

    if _, err := pipe.Exec(ctx); err != nil {
        return true, err
    }
    return count.Val() < limit, nil
}
Enter fullscreen mode Exit fullscreen mode

No boundary exploits. Accurate count at any point in time. I felt good about this.

Then a scrapers' botnet hit one of the news sites I was protecting, and I did the memory math I should have done before deploying:

Active unique IPs during the attack: ~180,000
Max timestamps stored per IP:        600  (the req/min limit)
Redis sorted set overhead per entry: ~128 bytes

180,000 × 600 × 128 bytes = ~13.8 GB
Enter fullscreen mode Exit fullscreen mode

My Redis instance had 4GB. At around 60% capacity Redis started making eviction decisions I had not planned for. When a rate limit key gets evicted, the counter resets. An attacker whose key gets evicted effectively gets a clean slate. The scraper was generating enough distinct source IPs that key eviction was happening continuously — their effective rate limit was functionally nonexistent during peak attack traffic, which was the exact moment I most needed it to work.

ZREMRANGEBYSCORE + ZCARD + ZADD is also three operations per request, versus one INCR for fixed window. Under attack, three operations per request across hundreds of thousands of concurrent IPs is a non-trivial Redis throughput multiplier.

The version I ended up running is the sliding window counter — an approximation that uses two adjacent fixed window buckets:

func (sw *SlidingWindowCounter) Allow(ctx context.Context, key string, limit int64, window time.Duration) (bool, error) {
    windowSecs := int64(window.Seconds())
    now := time.Now().Unix()
    currentSlot := now / windowSecs
    prevSlot := currentSlot - 1

    // How far through the current window are we? 0.0 at start, 1.0 at end
    position := float64(now%windowSecs) / float64(windowSecs)

    currentKey := fmt.Sprintf("%s:%d", key, currentSlot)
    prevKey := fmt.Sprintf("%s:%d", key, prevSlot)

    pipe := sw.redis.Pipeline()
    getCurrent := pipe.Get(ctx, currentKey)
    getPrev := pipe.Get(ctx, prevKey)
    pipe.Exec(ctx)

    current, _ := getCurrent.Int64()
    prev, _ := getPrev.Int64()

    // Weight the previous window based on how much of it falls in our sliding window
    weighted := float64(prev)*(1.0-position) + float64(current)
    if weighted >= float64(limit) {
        return false, nil
    }

    pipe2 := sw.redis.Pipeline()
    pipe2.Incr(ctx, currentKey)
    pipe2.Expire(ctx, currentKey, window*2)
    _, err := pipe2.Exec(ctx)
    return err == nil, err
}
Enter fullscreen mode Exit fullscreen mode

Two integer keys per identity, regardless of traffic volume. Error rate is around 0.4% at window boundaries — there's a small zone where the interpolation overestimates or underestimates the true count. For a security rate limiter, I accepted that tradeoff. For something financial, you probably can't.

The sliding window log is the right algorithm when you have a small number of identities and need exact counts. If you're running it on raw source IPs during any kind of bot traffic, do the memory math before you deploy it.


Leaky Bucket: Where I Blocked Legitimate Users and Didn't Immediately Know Why

Before I rewrote the WAF in Go, I was running NGINX with ngx_limit_req. NGINX's rate limiting is leaky bucket. I didn't choose it — it was the only option the module gave me.

Leaky bucket models a queue with a drain rate. Traffic enters at any rate, leaves at a constant rate. If the queue fills, excess is either delayed or dropped. For protecting a fragile backend from traffic spikes, it genuinely works well.

The failure case is what legitimate traffic looks like.

One of the sites I was protecting had a mobile app with decent usage — maybe 30,000 active users. Mobile users lose connectivity constantly. Subway, elevator, spotty 4G in a rural area. When they reconnect, the app retries whatever it was trying to send. If they lost connectivity for 30 seconds while the app was making periodic API calls, reconnection triggers three or four requests in under a second.

From the leaky bucket's perspective: burst of 4 requests with 200ms spacing from an IP that was quiet for 30 seconds. No burst credit for the quiet period. If the queue is already at capacity from other traffic, these 4 requests get dropped.

The users weren't sending error reports because the app silently retried. I noticed it because app engagement metrics were lower than expected for certain times of day — specifically peak commute hours, when users were moving in and out of underground transit. The "drop rate" I was seeing was legitimate mobile reconnects getting shed by the queue.

Leaky bucket optimizes for smooth output at the cost of bursty-but-legitimate input. It doesn't model how real users behave — people don't make API calls at a metronomic rate. A well-tuned attacker running a slow crawl has smoother traffic than a real mobile user. The algorithm punishes the user and rewards the patient attacker.

I kept NGINX's ngx_limit_req for one specific purpose: coarse traffic shaping before requests reach the Go process. A 2,000 req/s per-IP hard cap with a generous burst parameter (burst=500, nodelay). This protects the Go process from single-source floods without doing any policy enforcement. The policy — who actually gets blocked and why — lives in the Go middleware with sliding window counter.


Searching for the Best Algorithm Is the Wrong Question

Every few months I see a blog post that compares these algorithms and concludes one of them is best. The conclusion is always wrong, because the algorithm that's right depends on what you're trying to stop, at what traffic volume, with what latency budget, and what your false positive tolerance is.

Credential stuffing is patient and distributed. Attackers stay well under per-IP limits intentionally. No single-identity rate limiter catches it — you need cross-IP correlation, behavioral signals, per-endpoint limits that make even sub-limit sustained attempts expensive. A leaky bucket set up to smooth traffic does nothing against this.

A volumetric DDoS doesn't care about your per-minute limits. You need to shed traffic at the kernel level before it touches application code. eBPF at the XDP hook drops packets at line rate before the TCP stack processes them. Sliding window counter accuracy is irrelevant here — you need throughput, not precision.

API abuse from residential proxies — a competitor scraping your pricing, a research firm harvesting your public data — operates at a few requests per minute per IP. Per-identity rate limiting sees nothing. You need aggregate behavior analysis: same user-agent across thousands of IPs, consistent URL patterns, request timing that's too regular to be human.

Algorithm Boundary Safety Memory Redis Ops/req Best use Breaks on
Fixed Window ✗ exploitable O(1) 1 Coarse ingress caps Auth/payment endpoints
Sliding Window Log ✓ exact O(limit × IDs) 3 Low-volume sensitive endpoints Bot traffic with many IPs
Sliding Window Counter ✓ ~99.6% O(IDs) 2 + 2 General API rate limiting Hard financial accuracy
Leaky Bucket N/A (shaping) O(queue) N/A Backend protection Bursty legitimate users
Token Bucket O(IDs) 2 APIs with burst allowance DDoS mitigation

The mistake is picking one and calling it done.


What I Actually Run

Four layers. Each one independently simple. Complexity in the composition.

┌─────────────────────────────────────────────┐
│               INTERNET                      │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  LAYER 1 — eBPF/XDP                         │
│  Fixed window: 10K req/s hard cap per IP    │
│  Drops at NIC before kernel TCP stack       │
│  Blocklist from Kafka consumer (30s lag)    │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  LAYER 2 — NGINX ngx_limit_req              │
│  Leaky bucket: 500 req/s per IP, burst=200  │
│  Traffic shaping only — no policy           │
│  Node-local state, zero Redis dependency    │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  LAYER 3 — Go WAF middleware                │
│  Sliding window counter (Redis-backed)      │
│  Per-IP + per-session + per-API-key         │
│  Auth endpoints: 5 req/min, no exceptions   │
│  In-process cache for hot keys, 100ms sync  │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  LAYER 4 — Kafka telemetry (off hot path)   │
│  Every decision emitted as event            │
│  Consumer correlates across IPs/sessions    │
│  Pushes back to Layer 1 blocklist           │
└─────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Layer 1 handles volumetric attacks. eBPF at XDP processes packets before the kernel network stack — a tight fixed window at 10,000 req/s per IP, which no legitimate source ever hits, implemented as a BPF hash map. Packets matching the blocklist or exceeding the cap get XDP_DROP returned before a single socket buffer is allocated. The performance difference versus application-layer dropping is significant: XDP runs at ~2M pps on modest hardware; getting the same traffic to the Go process would saturate it.

Layer 2 is NGINX doing traffic shaping. I'm not trying to make policy decisions here — just preventing a single source from landing 8,000 req/s on the Go process. The burst parameter is generous (200) specifically because I burned legitimate mobile users with a tight leaky bucket before. NGINX handles this without touching Redis.

Layer 3 is where rate limit policy actually happens. Sliding window counter in Go, Redis-backed. Every identity type gets its own key namespace: rl:ip:, rl:user:, rl:session:, rl:apikey:. Different limits per endpoint class — a public API endpoint gets 600 req/min, a login endpoint gets 5 req/min with no burst, a password reset endpoint gets 3 per hour per user. The tiering exists because "rate limiting" is not one policy, it's many policies layered on the same infrastructure.

One practical problem: some API keys are legitimate high-volume clients generating thousands of requests per minute. All their requests map to the same Redis key. At volume, that single key can saturate the Redis CPU for its shard — Redis is single-threaded per core per shard, and a hot key is a ceiling you'll hit before memory becomes an issue. My fix was local counter caching in the Go process: maintain an in-memory counter, sync to Redis every 100ms. This reduces Redis ops by roughly 50-100x for hot keys. The accuracy cost is ±100ms on the limit enforcement, which is acceptable for API key throttling and completely unacceptable for authentication endpoints where I need exact counts.

Layer 4 is Kafka. Every rate limit decision — allowed, warned, blocked — gets emitted as a structured event asynchronously. It does not touch the request path. A Go consumer reads the stream and looks for patterns individual layers can't see: IPs clustering near the limit without hitting it (scanning), identical user-agents across thousands of distinct source IPs (coordinated botnet), session tokens that seem to travel geographically faster than physics allows. When confidence is high enough, it pushes the offending identities to a Redis set that Layer 1 reads every 30 seconds and translates into XDP blocklist entries.

The 30-second lag between detection and enforcement is real and intentional. I'd rather have 30 seconds of exposure with high-confidence blocks than 0 seconds of exposure with an aggressive heuristic that starts blocking legitimate traffic. The eBPF layer catches volumetric attacks immediately. The Kafka consumer catches the slow, distributed, low-signal attacks. They cover different threat models.

This architecture evolved over about a year and a half of the WAF running against real traffic. It was not designed this way. I started with just fixed window in Go, added sliding window counter after the credential stuffing incident, added the eBPF layer after a volumetric attack saturated the Go process, added Kafka telemetry after noticing slow distributed scraping that none of the per-identity limiters caught.

Each layer was added reactively, after something broke. I'd design it this way from the start now, but I wouldn't have known to without having run the earlier, simpler versions until they failed.


What to Actually Build and When

If you're starting from zero: sliding window counter in Redis, applied to authentication endpoints with a tight limit and everything else with a loose one. Do not start with leaky bucket unless traffic shaping is your actual goal. Do not start with fixed window on anything where a boundary burst creates a security gap. The sliding window counter implementation above is around 50 lines of Go and handles most cases correctly.

When Redis starts hurting: the signal is rate limit check latency climbing at p99, or Redis CPU staying above 60% under normal load. Before redesigning, add local counter caching with Redis sync. Cheap to implement, handles hot key scenarios, keeps the rest of the architecture unchanged. This buys significant headroom.

On eBPF: don't reach for XDP until you have a clear volumetric problem and you're comfortable debugging kernel-space code. An eBPF bug that passes the verifier but has a logic error can do things you cannot recover from without a reboot. The performance gain is real — moving packet drops from the Go process to the NIC makes a measurable difference once you're getting hit with hundreds of thousands of packets per second. The operational cost is also real.

On authentication endpoints: they deserve their own rate limit config, separate from everything else, with limits 10x tighter than your general API. A 600 req/min API limit that you accidentally apply to your login endpoint lets an attacker run 36,000 credential stuffing attempts per hour per IP. Five requests per minute on login makes credential stuffing economically unviable. Three requests per minute is better. The number of legitimate users who need to attempt login more than 5 times per minute is zero.

On 429 responses: they need a Retry-After header, a request ID the user can reference, and a message that doesn't say "rate limit exceeded" to an end user who has no idea what that means. This sounds like product work and not worth engineering time. It is worth engineering time. Unexplained 429s generate support tickets that take longer to resolve than writing the Retry-After header would have taken.


What Still Breaks

The hybrid setup catches the attacks it's been designed to catch. It does not catch everything.

Six months ago, an actor was rotating API session tokens every 300-400 requests — specifically tuned to stay under per-session limits while exceeding the per-endpoint total rate I actually cared about. The rotation was fast enough that no individual token triggered a block. The Kafka consumer eventually flagged the pattern: thousands of short-lived tokens all following the same URL sequence, all from the same three ASNs. By the time the block deployed, the actor had been running for 11 days.

The lesson from that isn't "build a better heuristic." It's that any fixed rule set has a bypass if the attacker has enough signal about your rules. The adaptive detection layer — Kafka consumer plus behavioral analysis — is the part that closes unknown-unknown attack patterns. It needs to be running before you need it.

Rate limiting is not a problem you solve. It's a system you tune against an adversary who is also tuning against you. Every algorithm I've described has a bypass. The goal is making the bypass expensive enough that the attacker moves on.


Backend engineer. Go, security infrastructure, distributed systems. WAF project on GitHub: bhavyyadav25. Portfolio: bhavyyadav25.github.io. Email: yadavbhavy25@gmail.com

Top comments (2)

Collapse
 
uzoma_uche_3ec83974b4a8a5 profile image
Echo

That "filter by endpoint" moment is the unlock for rate-limit work — global counters always look healthy, the attack hides in a slice. Three things I'd add from running WAFs in production:

1) The "spread across 58 source IPs, every one under threshold" attack pattern is what kills the per-IP fixed window. The right defense is per-(IP, route) or per-(IP, route, user-agent) as a hierarchical key, but that explodes the counter space. The real fix is usually a behavioral signal on top of the count: same JA3 hash, same TLS fingerprint, same path enumeration order, same timing. Once you have one of those, the rate limit can be a single threshold. Behavioral detection is more code than a counter, but the counter-only world is what the attacker is winning.

2) "All counters looked healthy" is a dashboard problem, not a detection problem. A WAF dashboard should show tail-end statistics (p99, 95th percentile of request rate per route) not just sums. Sums hide the attack because the attack is in the shape, not the total. Plot the per-minute count over the last 4 hours and the spike is obvious; the rolling 1-hour sum is fine.

3) Migrating the rule pipeline from ModSecurity to Go is the right move for latency but the wrong move for rule expressiveness. The 80% of rules you want to write are "if request matches pattern X AND source matches Y" — that's a small DSL, not a Go function. Worth writing a tiny rule DSL (~50 lines) that compiles to a Go function. Otherwise every new rule is a deploy, and the team that owns the WAF ends up gating every change.

The 2 AM framing is the right one for the post too. Most rate-limit blog posts are written from the safety of a daylight office. The 2 AM version is the one that actually teaches.

Collapse
 
bhavyyadav25 profile image
Bhavy Yadav • Edited

JA3 is where I want to take the next iteration, but TLS termination makes it annoying in practice. The WAF middleware runs at the application layer — by the time a request arrives, the handshake is done. Getting JA3 means either nginx's ssl_fingerprint patch (didn't want to maintain a custom nginx build) or eBPF at the socket layer before the kernel hands off to TLS. The behavioral signals I have running are Kafka-side: same user-agent prefix across N distinct source IPs within a time window, consistent path enumeration order, request timing with CV too low to be human. Less precise than a TLS fingerprint, but no dependency on capturing the handshake.
I tried the (IP, route, UA) composite key — counter space explosion is manageable with a sliding window counter instead of log, but composite keys make false positive debugging harder at 2AM. A single-dimension key at least tells you what you're looking at.
Point 2 is exactly the fix. The Kafka consumer now emits per-route p95/p99 as a time series. Rolling sum looked fine the entire time the attack was running — 5-minute bucketed per-route chart showed the spike in the first frame. Dashboard problem, not a detection problem.
On the DSL: I'd push back on "50 lines." The rules I want to write look like if url.path matches /api/login AND body @detectSQLi AND ip.reputation >= 70 THEN block — field selectors, transformation chains, multi-condition boolean logic, custom operators, action specification. None of that is free. ModSecurity's rule language is 15 years of accumulated production edge cases. A WAF DSL that actually covers what you need is closer to 500 lines, and at that point you're maintaining a parser. I solved the deploy-per-rule problem differently: rules load from YAML at startup, hot-reload on SIGHUP, no restart. New rule is a config push, not a go build. Not as expressive, but I'll take that over owning a parser.