Oluwagbade Odimayo

Posted on May 7

I Built a Real-Time DDoS Detection Engine from Scratch - Here's Every Decision I Made

#architecture #python #security #showdev

No Fail2Ban. No rate-limiting libraries. No shortcuts. Just Python, a deque, and statistics.

There is a moment every engineer dreads.

You are staring at your monitoring dashboard. The request graph is vertical. Your server is on its knees. Legitimate users are getting timeouts. And somewhere out there, an attacker is running a script they downloaded in five minutes - while your defence took you zero minutes to build, because you had none.

That was the situation at cloud.ng, a rapidly growing cloud storage platform powered by Nextcloud. After a wave of suspicious activity, I was handed a mandate: build an anomaly detection engine that watches all incoming HTTP traffic in real time, learns what normal looks like, and automatically responds when something deviates.

No off-the-shelf tools. No Fail2Ban. Build it from scratch.

This is the full story - every decision, every line of reasoning, every tradeoff.

First, Understand the Enemy

A DDoS attack - Distributed Denial of Service - works by exhausting your server's resources. Your server has a finite ability to handle connections. When an attacker sends 10,000 requests per second, your server spends all its time handling garbage traffic and has nothing left for real users.

But here is the subtler problem that most tutorials gloss over: you cannot just block "high traffic" IPs.

Traffic is not constant. A cloud storage platform gets hammered at 9 AM when everyone starts their workday. It goes quiet at 3 AM. If you set a fixed threshold - say, "block any IP sending more than 50 requests per minute" - you will:

Miss attacks during busy hours when 50 req/min is perfectly normal
Flag innocent users during quiet hours when 50 req/min is genuinely suspicious

What you need is a system that learns what normal looks like for right now - and flags deviations from that learned baseline. That is a statistics problem, not a configuration problem.

This insight shaped every architectural decision I made.

The Architecture

Before writing a single line of code, I mapped out the full system:

┌─────────────┐     HTTP      ┌─────────────────┐     proxy     ┌──────────────┐
│   Internet  │ ────────────► │   Nginx          │ ────────────► │  Nextcloud   │
│  (clients)  │               │ (reverse proxy)  │               │  (app)       │
└─────────────┘               └─────────────────┘               └──────────────┘
                                       │
                                       │ JSON access logs
                                       ▼
                              ┌─────────────────┐  (named Docker volume)
                              │  hng-access.log  │
                              └─────────────────┘
                                       │
                                       │ tail -f (continuous)
                                       ▼
                         ┌────────────────────────┐
                         │    Detector Daemon      │
                         │                         │
                         │  monitor.py   ◄── reads log
                         │  baseline.py  ◄── learns normal
                         │  detector.py  ◄── spots anomalies
                         │  blocker.py   ◄── calls iptables
                         │  unbanner.py  ◄── releases bans
                         │  notifier.py  ◄── pings Slack
                         │  dashboard.py ◄── live metrics UI
                         └────────────────────────┘
                                    │         │
                               iptables    Slack
                              DROP rule   #hng-alerts

Nginx sits in front of Nextcloud and writes every HTTP request as a JSON log line. The detector daemon reads that file continuously, processes each entry, and takes action when needed.

The key design principle: the daemon runs forever as a systemd service. It is not a cron job. It is not a script you run manually. It is always watching.

Part 1: The Foundation - JSON Logs with All the Right Fields

Everything starts with Nginx writing structured logs. Without good data, nothing else works.

Here is the log format I configured:

log_format json_logs escape=json
    '{'
    '"source_ip":"$remote_addr",'
    '"timestamp":"$time_iso8601",'
    '"method":"$request_method",'
    '"path":"$request_uri",'
    '"status":$status,'
    '"response_size":$body_bytes_sent,'
    '"http_host":"$host",'
    '"user_agent":"$http_user_agent"'
    '}';

access_log /var/log/nginx/hng-access.log json_logs;

Every log line looks like this:

{
  "source_ip": "1.2.3.4",
  "timestamp": "2026-04-29T21:48:35+00:00",
  "method": "GET",
  "path": "/",
  "status": 200,
  "response_size": 6674,
  "http_host": "cloud.ng",
  "user_agent": "Mozilla/5.0..."
}

One critical detail: I configured Nginx to extract the real client IP from the X-Forwarded-For header. Since Nginx is a Docker container proxying to another Docker container, without this, every request would appear to come from the Docker gateway IP - completely useless for per-IP detection.

set_real_ip_from  172.16.0.0/12;
real_ip_header    X-Forwarded-For;
real_ip_recursive on;

The logs are written to a named Docker volume called HNG-nginx-logs. Both Nginx (read-write) and Nextcloud (read-only) mount this volume. The detector daemon reads from the host path of the same volume.

Part 2: The Log Tailer - Watching the File Live

The first module, monitor.py, does one thing: follow the log file and yield parsed entries.

def tail_log(log_path: str):
    """
    Generator that yields one parsed dict per HTTP request.
    Works like `tail -f` - blocks waiting for new lines.
    """
    # Wait for the file to exist (Nginx may not have written anything yet)
    while not os.path.exists(log_path):
        time.sleep(2)

    with open(log_path, 'r', encoding='utf-8') as f:
        # Seek to END — we only care about new requests, not history
        f.seek(0, 2)

        while True:
            line = f.readline()
            if not line:
                time.sleep(0.05)   # 50ms poll — barely any CPU cost
                continue
            entry = _parse(line.strip())
            if entry:
                yield entry

The f.seek(0, 2) is important. Without it, on every restart, the daemon would replay the entire log history and potentially re-ban IPs that have already served their time.

The 0.05 second sleep between empty reads means the daemon uses almost zero CPU when traffic is quiet, but responds to new log entries within 50 milliseconds.

Part 3: The Sliding Window - The Heart of Rate Detection

Here is where most tutorials get it wrong.

The wrong way: Count requests per minute. Reset the counter every 60 seconds.

The problem: this creates "bucket boundaries". An attacker can send 59 requests at second 59, wait 2 seconds, send 59 more at second 61, and your per-minute counter never sees more than 59. They are attacking continuously but your counter says they are fine.

The right way: A true rolling window using collections.deque.

from collections import deque
import time

class SlidingWindowCounter:
    """
    Each entry in the deque is a float timestamp of one request.
    The deque always contains only the requests from the last window_seconds.
    """

    def __init__(self, window_seconds: int = 60):
        self.window_seconds = window_seconds
        self._timestamps = deque()    # timestamps of individual requests
        self._err_timestamps = deque()  # timestamps of error requests (4xx/5xx)

    def add(self, ts: float, is_error: bool = False):
        # New request joins from the RIGHT
        self._timestamps.append(ts)
        if is_error:
            self._err_timestamps.append(ts)
        self._evict(ts)

    def _evict(self, now: float):
        cutoff = now - self.window_seconds
        # Stale requests fall off the LEFT
        while self._timestamps and self._timestamps[0] < cutoff:
            self._timestamps.popleft()   # O(1) — this is why deque, not list
        while self._err_timestamps and self._err_timestamps[0] < cutoff:
            self._err_timestamps.popleft()

    def rate(self) -> float:
        # Exact rolling rate — no bucketing, no approximation
        return len(self._timestamps) / self.window_seconds

    def error_rate(self) -> float:
        return len(self._err_timestamps) / self.window_seconds

Picture the deque as a conveyor belt. Requests board at the right. Requests older than 60 seconds fall off the left automatically. At any moment, dividing the belt's occupancy by 60 gives the exact requests-per-second for the last minute.

Why O(1) matters: list.pop(0) shifts every element — O(n). deque.popleft() is O(1). At 10,000 requests per second across thousands of IPs, the difference between O(1) and O(n) is the difference between a daemon that keeps up and one that falls behind and misses attacks.

Every IP gets its own SlidingWindowCounter. There is also one global counter tracking all traffic combined - this catches distributed attacks where no single IP is hitting hard, but collectively they are overwhelming the server.

Part 4: The Baseline - Teaching the System What Normal Looks Like

The sliding window answers what is happening now. The baseline answers what normally happens.

I implemented a rolling 30-minute baseline of per-second request counts.

Timeline ──────────────────────────────────────────────────────►
         │ 1req │ 0req │ 3req │ 2req │ ... │ 5req │ 2req │ NOW │
         └──────────────────────────────────────────────────────┘
              ◄────────── 30 minutes = 1800 seconds ──────────►

Each slot represents one completed second. Every 60 seconds, the system recomputes:

def _recalculate(self):
    samples = [count for _, count in self._window]

    n        = len(samples)
    mean     = sum(samples) / n
    variance = sum((x - mean) ** 2 for x in samples) / n
    stddev   = math.sqrt(variance)

    # Floor values prevent division-by-zero at startup
    self.effective_mean   = max(mean,   self.floor_mean)    # config: 0.1
    self.effective_stddev = max(stddev, self.floor_stddev)  # config: 0.1

The Hour-Slot System

Here is the part that makes the baseline actually smart.

Traffic at 9 AM is genuinely different from traffic at 3 AM. If you use a single rolling window across all time, the baseline at 3 AM will reflect the busy afternoon - making it too permissive, missing real attacks.

I maintain per-hour slots:

hour = datetime.fromtimestamp(ts).hour
self._hour_slots[hour].append(count)   # {0: [...], 1: [...], ..., 23: [...]}

During recalculation:

hour_data = self._hour_slots.get(current_hour, [])

if len(hour_data) >= 60:
    # Enough data for this hour — use it (more accurate)
    samples = hour_data
else:
    # Fall back to the full rolling window
    samples = [count for _, count in self._window]

Once the system has been running for an hour, the 3 AM baseline reflects actual 3 AM traffic. The 9 AM baseline reflects actual 9 AM traffic. The system adapts to time-of-day patterns automatically, with zero configuration.

This is also why effective_mean is never, ever hardcoded. It is always computed from real traffic data.

Part 5: The Detection Logic - Making the Call

Now the payoff. We have:

current_rate — from the IP's sliding window
effective_mean — what is normal right now
effective_stddev — how much variation is normal

We compute a z-score:

z = (current_rate − effective_mean) / effective_stddev

The z-score is the number of standard deviations the current rate sits above the mean. In a normal distribution:

z = 1.0 → slightly above average (84th percentile)
z = 2.0 → notably above average (97th percentile)
z = 3.0 → very far above average (99.87th percentile)

At z ≥ 3.0, something is almost certainly not normal.

But z-score alone has a weakness: if traffic is so explosive that the baseline itself has not had time to update yet, the stddev might be large and the z-score misleadingly small. So I add a second condition:

# Flag if EITHER condition fires — whichever comes first
is_anomalous = (zscore >= 3.0) or (rate >= 5 * effective_mean)

The 5× multiplier catches the "traffic just went vertical" scenario that z-score might miss.

The Error Surge Tightener

Sometimes attackers probe rather than flood. They scan your endpoints looking for vulnerabilities, generating lots of 4xx and 5xx errors without necessarily hitting the raw request rate threshold.

The system detects this:

if ip_error_rate >= 3 * baseline_error_mean:
    # This IP is generating errors at 3× the normal rate
    # Tighten the detection threshold — flag it sooner
    z_threshold = max(z_threshold - 1.5, 1.0)

An IP generating suspicious error patterns gets caught at a lower threshold than regular traffic. This fires independently of the raw rate - it catches the probers.

Part 6: Blocking with iptables - The Linux Firewall

When an IP is flagged, the daemon calls the Linux kernel's firewall directly:

subprocess.run(
    ['iptables', '-I', 'INPUT', '-s', offending_ip, '-j', 'DROP'],
    check=True
)

-I INPUT inserts the rule at the top of the INPUT chain — it gets evaluated first, before anything else.
-j DROP silently discards the packet. No TCP RST. No HTTP response. The attacker's requests just vanish into nothing.

This is more effective than returning a 403 for two reasons:

It consumes zero application resources - the kernel drops the packet before it ever reaches Nginx
The attacker gets no feedback - they cannot even confirm the server is alive

The Backoff Schedule

Permanent bans on first offence are both unfair and operationally risky. You might ban a legitimate user with a buggy client, or a friendly scraper, and then spend hours investigating why a partner's integration broke.

The system uses a progressive backoff:

Offence	Duration	Reasoning
1st	10 minutes	Could be a misconfigured client
2nd	30 minutes	Likely intentional, more time needed
3rd	2 hours	Persistent offender
4th+	Permanent	They know what they are doing

A background Unbanner thread wakes every 10 seconds:

def _check_expiry(self):
    now = time.time()
    for ban in self.blocker.get_banned_ips():
        if ban['duration'] == -1:
            continue  # Permanent — never release
        if now - ban['banned_at'] >= ban['duration']:
            self.blocker.unban(ban['ip'])        # removes iptables rule
            self.notifier.send_unban(ban['ip'])   # notifies Slack

The _ban_history dict persists across unbans — so if an IP gets a 10-minute ban, behaves, gets released, then attacks again, it goes straight to the 30-minute tier. The system remembers repeat offenders even after their ban expires.

Part 7: Slack Alerts - Real-Time Visibility

Every significant event fires a Slack message to #hng-alerts:

Ban alert:

🚨 IP BANNED
• Condition: ip_rate_anomaly
• IP address: 1.2.3.4
• Current rate: 12.4000 req/s
• Baseline mean: 0.4200 req/s
• Z-score: 28.57
• Ban duration: 600s
• Timestamp: 2026-04-29T21:48:35

Global anomaly alert (no block - just awareness):

⚠ GLOBAL TRAFFIC ANOMALY
• Condition: global_rate_anomaly
• Current global rate: 48.3000 req/s
• Baseline mean: 3.2000 req/s
• Z-score: 14.09
• Action: Slack alert only (no IP block)

All Slack calls run in daemon threads - the detection loop never waits for a network call.

Part 8: The Live Dashboard

The system serves a live metrics dashboard at http://dashboard.gbadedata.com. It auto-refreshes every 3 seconds via a simple JavaScript setInterval call hitting a Flask /api/metrics endpoint.

The dashboard shows:

Global requests per second (live)
Baseline effective_mean and stddev
CPU and memory usage
All currently banned IPs with conditions, durations, and ban times
Top 10 source IPs ranked by current request rate
System uptime

No frontend framework. No build step. Flask serves an HTML string with embedded JavaScript. The entire dashboard is ~80 lines of code.

Part 9: The Audit Log

Every ban, unban, and baseline recalculation writes a structured entry to /var/log/detector-audit.log:

[2026-04-29T21:48:35.998876] BAN 172.18.0.1 | ip_rate_anomaly | rate=0.4000 | baseline=0.1000 | duration=600
[2026-04-29T21:58:38.252357] UNBAN 172.18.0.1 | ban_expired | rate=0.0000 | baseline=0.0000 | duration=600
[2026-04-30T12:20:17.038296] BASELINE_RECALC  | source=rolling_window (2 samples) | rate=10.5000 | baseline=10.5000 | duration=0

This is your paper trail. If someone asks "why was our partner's IP blocked at 3 AM on Tuesday?" - the answer is in the audit log, timestamped to the millisecond.

What I Would Do Differently

1. Use a proper WSGI server for the dashboard. Flask's development server works, but in production I would swap it for Gunicorn behind the existing Nginx setup.

2. Persist ban history across restarts. Currently, _ban_history lives in memory. If the daemon restarts, it forgets who was a repeat offender. A small SQLite database would fix this with minimal complexity.

3. Add IP allowlisting. Right now, any IP can be banned - including monitoring services, CDN health checkers, and your own office's IP. A config-driven allowlist would prevent embarrassing self-bans.

4. Ship metrics to Prometheus. The dashboard is useful for humans. For alerting, on-call tools, and long-term trend analysis, exposing a /metrics endpoint would integrate with the rest of a modern observability stack.

The Numbers

After running for 24 hours on the live server:

Processed thousands of HTTP requests continuously
Detected and blocked anomalous IPs within under 10 seconds of the first suspicious request
Auto-released bans correctly on the configured schedule
Dashboard refreshed without interruption
Zero false positives on legitimate traffic once the baseline had enough data

Final Thoughts

The most important lesson from this project is not about Python or iptables or deques.

It is about measurement before action.

Every security tool that uses hardcoded thresholds is making a guess. This system does not guess - it measures, it learns, and it acts on what it has observed. The z-score is not magic. It is just a way of asking: "is what I am seeing now consistent with what I have seen in the past?"

When the answer is no - decisively, statistically, not-even-close no - you block it.

That is the whole idea.

Full source code: https://github.com/gbadedata/hng14-stage3

Live dashboard: http://dashboard.gbadedata.com