DEV Community

Cover image for Rate Limiting Wasn't Enough — So I Built an API Gateway with Behavioral Abuse Detection
Macaulay Praise
Macaulay Praise

Posted on

Rate Limiting Wasn't Enough — So I Built an API Gateway with Behavioral Abuse Detection

Bloom filters and DIY vs WAF trade-offs

Real rate limiting, Bloom filters, credential stuffing detection, and the bugs that almost broke everything. Live demo included.

GitHub: macaulaypraise/api-gateway-with-abuse-detection
Live demo: api-gateway-with-abuse-detection.onrender.com/docs


As someone transitioning into backend engineering, I wanted to build something that went beyond tutorials. I didn't want a CRUD app. I wanted something that would teach me how real systems defend themselves — something I could point to in an interview and say: "I built this from scratch and I know exactly why every line exists."

That project became an API Gateway with Abuse Detection — a FastAPI service that sits in front of upstream backends and actively detects credential stuffing, scraping bots, and known-bad actors. Here's a technical breakdown of how it works, the decisions behind it, and the real bugs that nearly cost me my sanity.


What the System Does

Every request passes through a six-step middleware chain in this exact order:

1. RequestID      → UUID trace ID attached to every request
2. Auth           → JWT validation, client_id + role extracted
3. BloomFilter    → O(1) bad IP + bad user-agent check
4. RateLimit      → sliding window per authenticated client
5. AbuseDetector  → graduated response (throttle/block)
6. ShadowMode     → log would-be blocks before enforcement
Enter fullscreen mode Exit fullscreen mode

Each middleware depends on the one before it. If the Bloom filter flags you, the rate limiter never runs. Fail fast, fail cheap.


The Core Components (And Why Each One Exists)

1. Sliding Window Rate Limiter

Fixed-window rate limiting has a well-known flaw: a client can send N requests at the end of window 1 and N more at the start of window 2 — that's 2N requests in 2 seconds while technically never violating the per-window rule.

The sliding window eliminates this. Every request gets timestamped and stored in a Redis sorted set. On each new request:

  1. Delete all entries older than the window
  2. Count what remains
  3. Allow or deny

The key word is atomic. If steps 1–3 aren't wrapped in a Lua script, a concurrent request can slip between the remove and the count, creating a race condition that lets clients exceed their limit.

-- Executed atomically on the Redis server
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)

if count < limit then
    redis.call('ZADD', key, now, now)
    return 1  -- allowed
end
return 0  -- blocked
Enter fullscreen mode Exit fullscreen mode

Production verification: 150 parallel requests against the live Render deployment confirmed the enforcer is exact:

100 × 200 OK  ← exactly the rate limit
 50 × 429     ← every request over the limit rejected
Enter fullscreen mode Exit fullscreen mode

Prometheus confirmed rate_limit_rejections_total{client_id="demo"} 200.0 after two parallel test runs. The client_id label proves the JWT identity is tracked, not the IP address — a crucial distinction for shared NATs and corporate networks.

2. Two-Dimensional Auth Failure Tracking

Credential stuffing is tracked on two axes simultaneously:

  • By IP: failed_auth:{ip} — one IP failing across many accounts
  • By username: failed_auth:{username} — many IPs targeting the same account

These are separate Redis keys with independent TTLs, configurable via environment variables:

AUTH_FAILURE_IP_THRESHOLD=10       # failures before IP soft-block
AUTH_FAILURE_USER_THRESHOLD=20     # failures before username soft-block
AUTH_FAILURE_WINDOW_SECONDS=300    # counter TTL
Enter fullscreen mode Exit fullscreen mode

Keeping these counters independent means you can block a specific IP without penalizing every other IP targeting that same user, and flag a username as under attack without affecting unrelated clients.

3. Scraping Detection via Request Timing Entropy

Humans generate requests with high temporal variance. Bots generate requests with suspiciously regular inter-request timing.

For each client, I maintain a sliding window of the last N timestamps in a Redis sorted set and compute the standard deviation of the inter-arrival gaps. A standard deviation below SCRAPING_ENTROPY_THRESHOLD (default 0.5) triggers a bot flag.

The elegant part: this doesn't care about request volume. A sophisticated bot that rate-limits itself to human speeds will still be caught if it's too regular. This pairs with user-agent fingerprinting (the second Bloom filter) to create a multi-signal detection approach.

4. Dual Bloom Filters

Two in-memory Bloom filters, both synced from Redis every 60 seconds by a background worker:

  • known_bad_ips — screens every incoming IP at O(1) with no Redis round-trip
  • abusive_agents — user-agent fingerprinting for known scraper signatures

Configuration:

BLOOM_FILTER_CAPACITY=1000000  # expected entries
BLOOM_FILTER_ERROR_RATE=0.001  # 0.1% false positive rate
Enter fullscreen mode Exit fullscreen mode

At a 0.1% false positive rate across 1 million IPs, the filter requires roughly 1.1 MB of memory. The worst case is a legitimate IP being flagged — which shadow mode surfaces before enforcement is ever enabled.

Critical implementation detail: the filter must live on app.state.bloom and be shared across all requests. Per-request instantiation gives you a fresh empty filter on every call — zero enforcement, zero errors, 100% invisible failure. More on this in the bugs section.

5. Graduated Response System

Three states instead of a binary allow/block:

State Behavior
ALLOWED Request passes through normally
THROTTLED Response delayed via asyncio.sleep, served with Retry-After
SOFT_BLOCK Immediate 429 — Redis TTL, temporary, self-expiring

This matters because going straight to hard block means a legitimate client that briefly triggered a rule is permanently punished. The graduated approach lets real users recover automatically while truly malicious clients face escalating consequences.

6. Shadow Mode — The Safety Net

Shadow mode is how you deploy new detection rules without blocking real users. When a request would trigger a rule, shadow mode logs the event to Redis with a 24-hour TTL instead of blocking. The request passes through normally.

What makes this interesting is the implementation: shadow mode is a runtime toggle, not a deploy-time config. It's controlled via a Redis key:

# Enable — observe but don't block
curl -X POST $BASE/admin/shadow-mode?enabled=true \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Disable — start enforcing
curl -X POST $BASE/admin/shadow-mode?enabled=false \
  -H "Authorization: Bearer $ADMIN_TOKEN"
Enter fullscreen mode Exit fullscreen mode

The middleware reads config:shadow_mode_enabled from Redis on every request, falling back to the SHADOW_MODE_ENABLED environment variable if the key is absent. Toggle takes effect on the next request — no redeployment, no restart.


Database-Backed RBAC

The admin role system started as a simple ADMIN_USERNAMES environment variable. That approach has an obvious flaw: any user who registers with that exact username bypasses all admin checks.

The replacement: a UserRole enum (USER, ADMIN) stored in the users table, embedded in the JWT at login time.

# JWT payload at login
{"sub": username, "role": user.role}
Enter fullscreen mode Exit fullscreen mode

The require_admin dependency reads the JWT role claim directly — no database query per request. To promote a user:

UPDATE users SET role = 'admin' WHERE username = 'target';
Enter fullscreen mode Exit fullscreen mode

The user logs in again, receives a JWT with "role": "admin", and admin endpoints immediately become accessible. Their previous token expires in 30 minutes. No server restart required.


The Bugs That Actually Hurt

Bug 1: The Async Password Verification Trap

This one was subtle and genuinely dangerous. I had refactored verify_password to be an async function wrapping bcrypt's blocking checkpw in asyncio.to_thread() — which was correct. But I forgot to await it at the call site:

# 🚨 WRONG — coroutine object is always truthy
if verify_password(plain, hashed):
    # This branch ALWAYS executes
    ...

# ✅ CORRECT
if await verify_password(plain, hashed):
    ...
Enter fullscreen mode Exit fullscreen mode

A coroutine object that's never awaited evaluates as truthy. Every password check passed, regardless of input. All authentication was silently bypassed. The auth endpoint returned a valid JWT for any password entered against any account.

There were no exceptions, no warnings, no test failures if your tests weren't checking wrong-password rejection specifically. The fix is trivial once you find it — finding it is the hard part.

Bug 2: Bloom Filter Instantiated Per-Request

The block-ip admin route was creating a new BloomFilterService() inside the route handler, adding the IP to that instance, and returning. Meanwhile, the middleware's shared in-memory filter (on app.state.bloom) was never updated — until the 60-second background sync ran.

The result: a hard-blocked IP could make 60 more requests before the block took effect. The fix was making admin routes update request.app.state.bloom directly:

# 🚨 WRONG — local instance, never seen by middleware
bloom = BloomFilterService()
bloom.add(ip)

# ✅ CORRECT — updates the shared middleware instance immediately
request.app.state.bloom.add(ip)
Enter fullscreen mode Exit fullscreen mode

Bug 3: Static Admin Username Bypassed by Registration

The original ADMIN_USERNAMES config approach had a security hole: if the env var was set to "admin", anyone could register with username admin and gain admin access. Replaced entirely with the database-backed UserRole enum. The setting and its associated property were deleted from config.py.

Bug 4: Duplicate Alembic Migration Head

Running make makemigration twice without migrating in between creates two heads in the Alembic migration graph. The fix:

alembic merge heads -m "merge heads"
alembic stamp head
alembic upgrade head
Enter fullscreen mode Exit fullscreen mode

Not a show-stopper, but something that will confuse you the first time you hit it.

Bug 5: Sequential curl Doesn't Test Rate Limiting

This one isn't a code bug — it's a test methodology bug that looks exactly like a code bug.

A rate limit of 100 requests per 60-second window means requests must arrive within the same 60-second window to count against each other. Over a network connection (Render free tier adds ~500ms per request), 300 sequential calls take roughly 5 minutes. At any point only ~60 requests sit inside the window — well under the limit. The limiter appears broken when it's working correctly.

# This will NOT trigger rate limiting against a remote host
for i in $(seq 1 300); do curl $BASE/gateway/proxy; done

# This will — all requests fire within the same window
for i in $(seq 1 150); do
  curl -s -o /dev/null -w "%{http_code}\n" \
    $BASE/gateway/proxy \
    -H "Authorization: Bearer $TOKEN" &
done | sort | uniq -c
# Output: 100 × 200, 50 × 429
Enter fullscreen mode Exit fullscreen mode

Always use parallel requests when testing rate limiting against any remote deployment.


Performance Numbers

From a 60-second Locust load test, 20 concurrent users (legitimate users, credential stuffers, and scrapers running simultaneously):

Metric Result
Throughput 59 req/s sustained
Legitimate user failure rate 0%
Credential stuffing detection Blocked within 10 attempts
P50 gateway latency 10ms
P99 gateway latency 440ms (includes throttle delay)
Shadow events logged in 60s 740

The P99 spike is intentional — throttled clients hit asyncio.sleep, which is where the latency comes from. Legitimate users sit at the P50 line throughout.


Test Coverage

67 tests, 93% coverage. The most important tests to get right:

  • test_sliding_window_blocks_boundary_spike — send N requests at end of window 1, N at start of window 2, assert total allowed is N not 2N
  • test_concurrent_duplicate_requestsasyncio.gather firing same endpoint 5 times simultaneously, assert no race condition in the counter
  • test_shadow_mode_does_not_block — enable shadow mode, send a would-be-blocked request, assert 200 returned and shadow log has an entry
  • test_credential_stuffing_detected — fail auth 10 times from same IP, assert 11th is blocked
  • test_require_admin_valid_admin and test_non_admin_cannot_access_admin_routes — RBAC enforcement

Integration tests run against real Redis and PostgreSQL via a separate docker-compose.test.yml. Test isolation uses TRUNCATE TABLE ... RESTART IDENTITY CASCADE per test, not drop_all/create_all — same isolation, far lower overhead.


Production Stack

Component Technology
Web framework FastAPI + Uvicorn
Rate limit state Redis 7 (sorted sets + Lua scripts)
IP/agent filtering Bloom filter (pybloom-live)
Auth JWT (python-jose) + bcrypt (asyncio.to_thread)
Database PostgreSQL 15 + SQLAlchemy async
Migrations Alembic
Metrics Prometheus
Logging structlog (JSON output with request_id on every line)
Testing pytest + pytest-asyncio + Locust
CI GitHub Actions
Hosting Render (app) + Upstash (Redis) + Supabase (PostgreSQL)

Interview Talking Points Worth Owning

"Why Lua scripts in Redis?"MULTI/EXEC is optimistic; other clients can interleave between commands. Lua runs atomically on the Redis server. The read-increment-expire cycle cannot be observed in an intermediate state under concurrent load.

"How do you handle a Redis outage?" — Fail open vs. fail closed is a business decision. A bank fails closed — block everything if rate limit state is unavailable. A media site fails open — serve traffic and accept the abuse risk. Expose it as a config flag.

"What about shared IPs and NATs?" — IP alone is a weak identifier. The system layers it with JWT client_id. IP rate limiting catches unauthenticated abuse; user-level limiting catches authenticated abuse. Both are needed, neither is sufficient alone.

"How does the Bloom filter help performance?" — Without it, every request does a Redis SISMEMBER call — a network round-trip. The Bloom filter checks the same list from process memory in microseconds. At 0.1% false positive rate, 1 in 1000 legitimate IPs might be flagged — which shadow mode surfaces before enforcement is enabled.

"What would you change at 10x scale?" — Move to Redis Cluster to eliminate the single point of failure. Load detection rules from Redis at runtime instead of config at deploy time. Add ML anomaly detection as a second signal layer. Per-datacenter rate limiting with global sync.


What I'd Do Differently

The most valuable lesson wasn't any individual component — it was build order. The pattern that worked: environment → infrastructure → config → database models → core clients → services → API layer → workers. Never jumping a stage. A broken Redis client makes every rate limiter test confusing. A broken DB session makes every auth test unreliable.

The second lesson: cross-check against your spec after you think you're done. The graduated response system, user-agent fingerprinting, and several Prometheus metrics were all missing from my "complete" implementation until I ran a systematic audit.


Try It

The live demo is running at api-gateway-with-abuse-detection.onrender.com/docs. Register a user, grab a JWT, hit the gateway endpoint 110 times in parallel, and watch the 429s start. Shadow stats accumulate at /admin/shadow-stats if you have an admin token.

Source, DESIGN.md, and load test scenarios: github.com/macaulaypraise/api-gateway-with-abuse-detection


Tags: python fastapi redis security webdev

Top comments (13)

Collapse
 
arkforge-ceo profile image
ArkForge

The entropy-based scraping detection is a solid approach, but Poisson jitter is a known countermeasure - modern credential stuffing frameworks already add randomized delays specifically to defeat low-entropy detection. A more resilient signal: correlate entropy with the 4xx ratio for the same client_id. Your two-dimensional auth failure tracking already collects that data. A legitimate API poller will have low inter-request entropy but near-zero auth failures; a credential stuffer with Poisson jitter will still surface as anomalous once you cross-reference against its failure rate. That combination is significantly harder to defeat without slowing the attack to the point of impracticality.

Collapse
 
circuit profile image
Rahul S

Worth noting there's an even nastier evasion than Poisson jitter — replaying captured human session timings. If an attacker records real user inter-request gaps and uses those as their delay distribution, entropy-based detection is basically blind to it. At that point you're down to correlating across signals the attacker can't easily spoof: does the TLS fingerprint match the claimed user-agent, is the source IP geographically consistent across the session, does the request path diversity look like an actual user navigating vs. a script hitting the same endpoint. Single-signal detection is always going to be an arms race, the real leverage is making them defeat multiple orthogonal checks simultaneously.

Collapse
 
wolfraider profile image
Macaulay Praise

That's the honest ceiling of pure behavioral heuristics — once an attacker is replaying real session distributions, you've lost the timing signal entirely. The multi-signal point is exactly right though; the value isn't any single check but making them defeat orthogonal ones simultaneously. TLS fingerprinting and path diversity are harder to spoof consistently at scale in a way that also passes auth failure thresholds and Bloom checks at the same time. Each layer you add raises the attacker's cost. Appreciate you extending ArkForge's point — this thread has mapped out a real extension roadmap.

Collapse
 
wolfraider profile image
Macaulay Praise

Really sharp catch — Poisson jitter is a known blind spot for pure entropy detection and I didn't address it. The cross-correlation idea is clean though, and the data's already there: auth failure counts per client_id exist in Redis from the 2D tracking, so it'd be an additional scoring step rather than new infrastructure. A jittered stuffer still fails logins — combining that failure rate against the entropy score makes it significantly harder to defeat both signals at once without slowing the attack to uselessness. Flagging this as a documented extension, appreciate the depth here.

Collapse
 
bridgexapi profile image
BridgeXAPI

Really interesting 😁😁 I’ve seen rate limiting fall apart pretty fast once you deal with more distributed or “low and slow” abuse. Ran into something similar with OTP flows, retries can look totally legit but still mess up delivery at scale. Behavior based detection makes a lot of sense here. What kind of signals worked best for you?

Collapse
 
wolfraider profile image
Macaulay Praise

The timing entropy signal handles "low and slow" pretty well — bots that throttle themselves still get caught if their inter-request gaps are too regular (std dev below threshold). The two-dimensional auth failure tracking ended up being the most reliable signal overall though — tracking by IP and by username independently avoids the corporate NAT false positive trap.
Your OTP case is a good edge — retries that look legit individually but create delivery pressure at scale would probably need a third axis tracking failure rate per action type. Hadn't thought about it that way before.

Collapse
 
bridgexapi profile image
BridgeXAPI

ah that’s actually really interesting, especially the timing entropy part. makes sense that even “slow” bots still look too consistent over time. the IP + username split is smart too, that NAT issue is always annoying 😅

yeah with OTP it gets weird because retries aren’t always failures, but they still put pressure on delivery. tracking per action type sounds like a solid way to catch that without breaking legit flows

Thread Thread
 
wolfraider profile image
Macaulay Praise

Yeah exactly — retries that aren't failures are the tricky part, it's pressure without a clear signal. Tracking delivery rate or time-to-success per action type might be the cleaner handle there rather than failure counts alone. Might revisit that as an extension!

Collapse
 
itskondrat profile image
Mykola Kondratiuk

good for learning internals - bloom filters + behavioral detection is useful to understand deeply. but in production: you probably shouldn’t own this surface. Cloudflare/AWS WAF already handle this. building it custom means maintaining a security-critical component indefinitely.

Collapse
 
wolfraider profile image
Macaulay Praise

Totally fair — and I'd agree in most production contexts. The goal here was never to compete with WAF products but to understand what they're actually doing under the hood. Cloudflare/AWS WAF are the right answer operationally; being able to reason about why — sliding windows, probabilistic filtering, behavioral signals — is what makes you useful when those tools need tuning, custom rules, or you're working somewhere that can't just throw a WAF in front.
The maintenance burden point is real though, it's a genuine cost most teams shouldn't take on.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah fair point - honestly the real payoff is when you're debugging why cloudflare's rules are misfiring in prod. knowing the internals means you're not just clicking through the WAF console hoping something helps

Thread Thread
 
wolfraider profile image
Macaulay Praise

Exactly that — knowing what's underneath means you're debugging with a model, not just guessing at knobs. Appreciate you engaging on it.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

Right — that intuition compounds fast. When the decision tree is actually in your head, you stop chasing symptoms and start eliminating causes. The "just adjust the knobs" approach works fine until you hit a production edge case at 2am — then the difference between a 20-minute fix and a 3-hour postmortem is exactly whether you know why the gateway decided what it did.