No Fail2Ban. No rate-limiting libraries. No shortcuts. Just Python, a deque, and statistics.
There is a moment every engineer dreads.
You are staring at your monitoring dashboard. The request graph is vertical. Your server is on its knees. Legitimate users are getting timeouts. And somewhere out there, an attacker is running a script they downloaded in five minutes - while your defence took you zero minutes to build, because you had none.
That was the situation at cloud.ng, a rapidly growing cloud storage platform powered by Nextcloud. After a wave of suspicious activity, I was handed a mandate: build an anomaly detection engine that watches all incoming HTTP traffic in real time, learns what normal looks like, and automatically responds when something deviates.
No off-the-shelf tools. No Fail2Ban. Build it from scratch.
This is the full story - every decision, every line of reasoning, every tradeoff.
First, Understand the Enemy
A DDoS attack - Distributed Denial of Service - works by exhausting your server's resources. Your server has a finite ability to handle connections. When an attacker sends 10,000 requests per second, your server spends all its time handling garbage traffic and has nothing left for real users.
But here is the subtler problem that most tutorials gloss over: you cannot just block "high traffic" IPs.
Traffic is not constant. A cloud storage platform gets hammered at 9 AM when everyone starts their workday. It goes quiet at 3 AM. If you set a fixed threshold - say, "block any IP sending more than 50 requests per minute" - you will:
- Miss attacks during busy hours when 50 req/min is perfectly normal
- Flag innocent users during quiet hours when 50 req/min is genuinely suspicious
What you need is a system that learns what normal looks like for right now - and flags deviations from that learned baseline. That is a statistics problem, not a configuration problem.
This insight shaped every architectural decision I made.
The Architecture
Before writing a single line of code, I mapped out the full system:
┌─────────────┐ HTTP ┌─────────────────┐ proxy ┌──────────────┐
│ Internet │ ────────────► │ Nginx │ ────────────► │ Nextcloud │
│ (clients) │ │ (reverse proxy) │ │ (app) │
└─────────────┘ └─────────────────┘ └──────────────┘
│
│ JSON access logs
▼
┌─────────────────┐ (named Docker volume)
│ hng-access.log │
└─────────────────┘
│
│ tail -f (continuous)
▼
┌────────────────────────┐
│ Detector Daemon │
│ │
│ monitor.py ◄── reads log
│ baseline.py ◄── learns normal
│ detector.py ◄── spots anomalies
│ blocker.py ◄── calls iptables
│ unbanner.py ◄── releases bans
│ notifier.py ◄── pings Slack
│ dashboard.py ◄── live metrics UI
└────────────────────────┘
│ │
iptables Slack
DROP rule #hng-alerts
Nginx sits in front of Nextcloud and writes every HTTP request as a JSON log line. The detector daemon reads that file continuously, processes each entry, and takes action when needed.
The key design principle: the daemon runs forever as a systemd service. It is not a cron job. It is not a script you run manually. It is always watching.
Part 1: The Foundation - JSON Logs with All the Right Fields
Everything starts with Nginx writing structured logs. Without good data, nothing else works.
Here is the log format I configured:
log_format json_logs escape=json
'{'
'"source_ip":"$remote_addr",'
'"timestamp":"$time_iso8601",'
'"method":"$request_method",'
'"path":"$request_uri",'
'"status":$status,'
'"response_size":$body_bytes_sent,'
'"http_host":"$host",'
'"user_agent":"$http_user_agent"'
'}';
access_log /var/log/nginx/hng-access.log json_logs;
Every log line looks like this:
{
"source_ip": "1.2.3.4",
"timestamp": "2026-04-29T21:48:35+00:00",
"method": "GET",
"path": "/",
"status": 200,
"response_size": 6674,
"http_host": "cloud.ng",
"user_agent": "Mozilla/5.0..."
}
One critical detail: I configured Nginx to extract the real client IP from the X-Forwarded-For header. Since Nginx is a Docker container proxying to another Docker container, without this, every request would appear to come from the Docker gateway IP - completely useless for per-IP detection.
set_real_ip_from 172.16.0.0/12;
real_ip_header X-Forwarded-For;
real_ip_recursive on;
The logs are written to a named Docker volume called HNG-nginx-logs. Both Nginx (read-write) and Nextcloud (read-only) mount this volume. The detector daemon reads from the host path of the same volume.
Part 2: The Log Tailer - Watching the File Live
The first module, monitor.py, does one thing: follow the log file and yield parsed entries.
def tail_log(log_path: str):
"""
Generator that yields one parsed dict per HTTP request.
Works like `tail -f` - blocks waiting for new lines.
"""
# Wait for the file to exist (Nginx may not have written anything yet)
while not os.path.exists(log_path):
time.sleep(2)
with open(log_path, 'r', encoding='utf-8') as f:
# Seek to END — we only care about new requests, not history
f.seek(0, 2)
while True:
line = f.readline()
if not line:
time.sleep(0.05) # 50ms poll — barely any CPU cost
continue
entry = _parse(line.strip())
if entry:
yield entry
The f.seek(0, 2) is important. Without it, on every restart, the daemon would replay the entire log history and potentially re-ban IPs that have already served their time.
The 0.05 second sleep between empty reads means the daemon uses almost zero CPU when traffic is quiet, but responds to new log entries within 50 milliseconds.
Part 3: The Sliding Window - The Heart of Rate Detection
Here is where most tutorials get it wrong.
The wrong way: Count requests per minute. Reset the counter every 60 seconds.
The problem: this creates "bucket boundaries". An attacker can send 59 requests at second 59, wait 2 seconds, send 59 more at second 61, and your per-minute counter never sees more than 59. They are attacking continuously but your counter says they are fine.
The right way: A true rolling window using collections.deque.
from collections import deque
import time
class SlidingWindowCounter:
"""
Each entry in the deque is a float timestamp of one request.
The deque always contains only the requests from the last window_seconds.
"""
def __init__(self, window_seconds: int = 60):
self.window_seconds = window_seconds
self._timestamps = deque() # timestamps of individual requests
self._err_timestamps = deque() # timestamps of error requests (4xx/5xx)
def add(self, ts: float, is_error: bool = False):
# New request joins from the RIGHT
self._timestamps.append(ts)
if is_error:
self._err_timestamps.append(ts)
self._evict(ts)
def _evict(self, now: float):
cutoff = now - self.window_seconds
# Stale requests fall off the LEFT
while self._timestamps and self._timestamps[0] < cutoff:
self._timestamps.popleft() # O(1) — this is why deque, not list
while self._err_timestamps and self._err_timestamps[0] < cutoff:
self._err_timestamps.popleft()
def rate(self) -> float:
# Exact rolling rate — no bucketing, no approximation
return len(self._timestamps) / self.window_seconds
def error_rate(self) -> float:
return len(self._err_timestamps) / self.window_seconds
Picture the deque as a conveyor belt. Requests board at the right. Requests older than 60 seconds fall off the left automatically. At any moment, dividing the belt's occupancy by 60 gives the exact requests-per-second for the last minute.
Why O(1) matters: list.pop(0) shifts every element — O(n). deque.popleft() is O(1). At 10,000 requests per second across thousands of IPs, the difference between O(1) and O(n) is the difference between a daemon that keeps up and one that falls behind and misses attacks.
Every IP gets its own SlidingWindowCounter. There is also one global counter tracking all traffic combined - this catches distributed attacks where no single IP is hitting hard, but collectively they are overwhelming the server.
Part 4: The Baseline - Teaching the System What Normal Looks Like
The sliding window answers what is happening now. The baseline answers what normally happens.
I implemented a rolling 30-minute baseline of per-second request counts.
Timeline ──────────────────────────────────────────────────────►
│ 1req │ 0req │ 3req │ 2req │ ... │ 5req │ 2req │ NOW │
└──────────────────────────────────────────────────────┘
◄────────── 30 minutes = 1800 seconds ──────────►
Each slot represents one completed second. Every 60 seconds, the system recomputes:
def _recalculate(self):
samples = [count for _, count in self._window]
n = len(samples)
mean = sum(samples) / n
variance = sum((x - mean) ** 2 for x in samples) / n
stddev = math.sqrt(variance)
# Floor values prevent division-by-zero at startup
self.effective_mean = max(mean, self.floor_mean) # config: 0.1
self.effective_stddev = max(stddev, self.floor_stddev) # config: 0.1
The Hour-Slot System
Here is the part that makes the baseline actually smart.
Traffic at 9 AM is genuinely different from traffic at 3 AM. If you use a single rolling window across all time, the baseline at 3 AM will reflect the busy afternoon - making it too permissive, missing real attacks.
I maintain per-hour slots:
hour = datetime.fromtimestamp(ts).hour
self._hour_slots[hour].append(count) # {0: [...], 1: [...], ..., 23: [...]}
During recalculation:
hour_data = self._hour_slots.get(current_hour, [])
if len(hour_data) >= 60:
# Enough data for this hour — use it (more accurate)
samples = hour_data
else:
# Fall back to the full rolling window
samples = [count for _, count in self._window]
Once the system has been running for an hour, the 3 AM baseline reflects actual 3 AM traffic. The 9 AM baseline reflects actual 9 AM traffic. The system adapts to time-of-day patterns automatically, with zero configuration.
This is also why effective_mean is never, ever hardcoded. It is always computed from real traffic data.
Part 5: The Detection Logic - Making the Call
Now the payoff. We have:
-
current_rate— from the IP's sliding window -
effective_mean— what is normal right now -
effective_stddev— how much variation is normal
We compute a z-score:
z = (current_rate − effective_mean) / effective_stddev
The z-score is the number of standard deviations the current rate sits above the mean. In a normal distribution:
- z = 1.0 → slightly above average (84th percentile)
- z = 2.0 → notably above average (97th percentile)
- z = 3.0 → very far above average (99.87th percentile)
At z ≥ 3.0, something is almost certainly not normal.
But z-score alone has a weakness: if traffic is so explosive that the baseline itself has not had time to update yet, the stddev might be large and the z-score misleadingly small. So I add a second condition:
# Flag if EITHER condition fires — whichever comes first
is_anomalous = (zscore >= 3.0) or (rate >= 5 * effective_mean)
The 5× multiplier catches the "traffic just went vertical" scenario that z-score might miss.
The Error Surge Tightener
Sometimes attackers probe rather than flood. They scan your endpoints looking for vulnerabilities, generating lots of 4xx and 5xx errors without necessarily hitting the raw request rate threshold.
The system detects this:
if ip_error_rate >= 3 * baseline_error_mean:
# This IP is generating errors at 3× the normal rate
# Tighten the detection threshold — flag it sooner
z_threshold = max(z_threshold - 1.5, 1.0)
An IP generating suspicious error patterns gets caught at a lower threshold than regular traffic. This fires independently of the raw rate - it catches the probers.
Part 6: Blocking with iptables - The Linux Firewall
When an IP is flagged, the daemon calls the Linux kernel's firewall directly:
subprocess.run(
['iptables', '-I', 'INPUT', '-s', offending_ip, '-j', 'DROP'],
check=True
)
-I INPUT inserts the rule at the top of the INPUT chain — it gets evaluated first, before anything else.
-j DROP silently discards the packet. No TCP RST. No HTTP response. The attacker's requests just vanish into nothing.
This is more effective than returning a 403 for two reasons:
- It consumes zero application resources - the kernel drops the packet before it ever reaches Nginx
- The attacker gets no feedback - they cannot even confirm the server is alive
The Backoff Schedule
Permanent bans on first offence are both unfair and operationally risky. You might ban a legitimate user with a buggy client, or a friendly scraper, and then spend hours investigating why a partner's integration broke.
The system uses a progressive backoff:
| Offence | Duration | Reasoning |
|---|---|---|
| 1st | 10 minutes | Could be a misconfigured client |
| 2nd | 30 minutes | Likely intentional, more time needed |
| 3rd | 2 hours | Persistent offender |
| 4th+ | Permanent | They know what they are doing |
A background Unbanner thread wakes every 10 seconds:
def _check_expiry(self):
now = time.time()
for ban in self.blocker.get_banned_ips():
if ban['duration'] == -1:
continue # Permanent — never release
if now - ban['banned_at'] >= ban['duration']:
self.blocker.unban(ban['ip']) # removes iptables rule
self.notifier.send_unban(ban['ip']) # notifies Slack
The _ban_history dict persists across unbans — so if an IP gets a 10-minute ban, behaves, gets released, then attacks again, it goes straight to the 30-minute tier. The system remembers repeat offenders even after their ban expires.
Part 7: Slack Alerts - Real-Time Visibility
Every significant event fires a Slack message to #hng-alerts:
Ban alert:
🚨 IP BANNED
• Condition: ip_rate_anomaly
• IP address: 1.2.3.4
• Current rate: 12.4000 req/s
• Baseline mean: 0.4200 req/s
• Z-score: 28.57
• Ban duration: 600s
• Timestamp: 2026-04-29T21:48:35
Global anomaly alert (no block - just awareness):
⚠ GLOBAL TRAFFIC ANOMALY
• Condition: global_rate_anomaly
• Current global rate: 48.3000 req/s
• Baseline mean: 3.2000 req/s
• Z-score: 14.09
• Action: Slack alert only (no IP block)
All Slack calls run in daemon threads - the detection loop never waits for a network call.
Part 8: The Live Dashboard
The system serves a live metrics dashboard at http://dashboard.gbadedata.com. It auto-refreshes every 3 seconds via a simple JavaScript setInterval call hitting a Flask /api/metrics endpoint.
The dashboard shows:
- Global requests per second (live)
- Baseline effective_mean and stddev
- CPU and memory usage
- All currently banned IPs with conditions, durations, and ban times
- Top 10 source IPs ranked by current request rate
- System uptime
No frontend framework. No build step. Flask serves an HTML string with embedded JavaScript. The entire dashboard is ~80 lines of code.
Part 9: The Audit Log
Every ban, unban, and baseline recalculation writes a structured entry to /var/log/detector-audit.log:
[2026-04-29T21:48:35.998876] BAN 172.18.0.1 | ip_rate_anomaly | rate=0.4000 | baseline=0.1000 | duration=600
[2026-04-29T21:58:38.252357] UNBAN 172.18.0.1 | ban_expired | rate=0.0000 | baseline=0.0000 | duration=600
[2026-04-30T12:20:17.038296] BASELINE_RECALC | source=rolling_window (2 samples) | rate=10.5000 | baseline=10.5000 | duration=0
This is your paper trail. If someone asks "why was our partner's IP blocked at 3 AM on Tuesday?" - the answer is in the audit log, timestamped to the millisecond.
What I Would Do Differently
1. Use a proper WSGI server for the dashboard. Flask's development server works, but in production I would swap it for Gunicorn behind the existing Nginx setup.
2. Persist ban history across restarts. Currently, _ban_history lives in memory. If the daemon restarts, it forgets who was a repeat offender. A small SQLite database would fix this with minimal complexity.
3. Add IP allowlisting. Right now, any IP can be banned - including monitoring services, CDN health checkers, and your own office's IP. A config-driven allowlist would prevent embarrassing self-bans.
4. Ship metrics to Prometheus. The dashboard is useful for humans. For alerting, on-call tools, and long-term trend analysis, exposing a /metrics endpoint would integrate with the rest of a modern observability stack.
The Numbers
After running for 24 hours on the live server:
- Processed thousands of HTTP requests continuously
- Detected and blocked anomalous IPs within under 10 seconds of the first suspicious request
- Auto-released bans correctly on the configured schedule
- Dashboard refreshed without interruption
- Zero false positives on legitimate traffic once the baseline had enough data
Final Thoughts
The most important lesson from this project is not about Python or iptables or deques.
It is about measurement before action.
Every security tool that uses hardcoded thresholds is making a guess. This system does not guess - it measures, it learns, and it acts on what it has observed. The z-score is not magic. It is just a way of asking: "is what I am seeing now consistent with what I have seen in the past?"
When the answer is no - decisively, statistically, not-even-close no - you block it.
That is the whole idea.
Full source code: https://github.com/gbadedata/hng14-stage3
Live dashboard: http://dashboard.gbadedata.com
Top comments (0)