DEV Community

DevOps Journey
DevOps Journey

Posted on

How I Built a Real-Time DDoS Detection Engine from Scratch (No Fail2Ban Allowed)

How I Built a Real-Time DDoS Detection Engine from Scratch (No Fail2Ban Allowed)

When my boss said "build something that watches all incoming traffic, learns what normal looks like, and automatically blocks attackers" — I had no idea where to start. No Fail2Ban. No rate-limiting libraries. Just Python, math, and Linux firewall rules.

This is the story of how I built it, and how you can understand every piece of it — even if you've never worked on security tooling before.


What Does This Project Do?

Imagine a security guard standing at the entrance of a building. Their job is to:

  1. Watch everyone who comes in
  2. Learn what a normal busy day looks like
  3. Notice when something unusual happens — like 200 people trying to enter at once
  4. Act — block the suspicious person and alert the team

That's exactly what this tool does, but for HTTP traffic hitting a web server.

The system runs alongside a Nextcloud instance and continuously monitors Nginx access logs. When it detects abnormal traffic — either from a single aggressive IP or a global traffic spike — it automatically blocks the attacker using Linux firewall rules and sends a Slack alert within 10 seconds.


The Architecture

Internet Traffic


Nginx (writes JSON logs)


Nextcloud

[shared log volume]


Detector Daemon (Python)
├── Monitor → reads log line by line
├── Baseline → learns normal traffic patterns
├── Detector → spots anomalies using math
├── Blocker → adds iptables firewall rules
├── Unbanner → releases bans on a schedule
├── Notifier → sends Slack alerts
└── Dashboard → live web UI

Everything runs in Docker containers. The detector mounts the Nginx log volume read-only so it can watch logs without touching the web server.


How the Sliding Window Works

The first problem to solve: how do you measure "how many requests per second is this IP sending right now?"

The answer is a sliding window — a moving view of the last 60 seconds of traffic.

Think of it like a 60-second conveyor belt. Every time a request arrives, we add a timestamp to the belt. Every time we want to know the current rate, we count how many timestamps are still on the belt (within the last 60 seconds). Old timestamps fall off the end automatically.

In Python, we use collections.deque — a double-ended queue that lets us add to the right and remove from the left efficiently:

from collections import deque
import time

# One deque per IP
ip_window = deque()

def add_request(ip_window):
    now = time.time()
    cutoff = now - 60  # 60-second window

    # Add new timestamp
    ip_window.append(now)

    # Evict expired timestamps from the left
    while ip_window and ip_window[0] < cutoff:
        ip_window.popleft()

def get_rate(ip_window):
    # Rate = number of timestamps in the window
    return len(ip_window)
Enter fullscreen mode Exit fullscreen mode

We maintain two windows:

  • Per-IP window: tracks requests from a single IP address
  • Global window: tracks all requests across all IPs

This is the core of real-time rate measurement. No databases, no counters that reset every minute — just timestamps in a deque.


How the Baseline Learns From Traffic

Detecting "too many requests" only makes sense if you know what "normal" looks like. That's where the baseline comes in.

Every second, we record how many total requests arrived. We store these per-second counts in a rolling 30-minute window — again using a deque:

baseline_window = deque()  # stores (timestamp, count) tuples

def add_second_count(count):
    now = time.time()
    cutoff = now - (30 * 60)  # 30 minutes

    baseline_window.append((now, count))

    # Evict data older than 30 minutes
    while baseline_window and baseline_window[0][0] < cutoff:
        baseline_window.popleft()
Enter fullscreen mode Exit fullscreen mode

Every 60 seconds, we recalculate the mean (average) and standard deviation from all the counts in the window:

import math

def compute_stats(values):
    n = len(values)
    mean = sum(values) / n
    variance = sum((x - mean) ** 2 for x in values) / (n - 1)
    stddev = math.sqrt(variance)
    return mean, stddev
Enter fullscreen mode Exit fullscreen mode

The mean tells us "on average, how many requests per second do we get?" The standard deviation tells us "how much does it vary?"

We also store results in hourly slots — so if your server gets more traffic in the afternoon than at night, the baseline adapts to each hour's pattern rather than using one global average.

Floor values: When the system first starts, there isn't enough data yet. We use floor values (mean=1.0, stddev=1.0) until at least 10 samples are collected. This prevents false positives during startup.


How the Detection Logic Makes a Decision

Once we have a baseline, we can detect anomalies. We use two conditions — whichever fires first triggers a block:

Condition 1: Z-Score > 3.0

The z-score measures how many standard deviations away from the mean the current rate is:

def zscore(value, mean, stddev):
    if stddev == 0:
        return 0
    return (value - mean) / stddev
Enter fullscreen mode Exit fullscreen mode

A z-score of 3.0 means "this value is 3 standard deviations above normal." Statistically, that happens less than 0.3% of the time under normal conditions. If we see it, something unusual is happening.

Condition 2: Rate > 5x the Mean

Even if the standard deviation is small, a rate 5 times higher than normal is suspicious:

is_anomalous = (
    zscore(ip_rate, mean, stddev) > 3.0 or
    ip_rate > 5.0 * mean
)
Enter fullscreen mode Exit fullscreen mode

Error Surge Tightening

If an IP is also generating lots of 4xx/5xx errors (like hammering login endpoints), we tighten the thresholds automatically — the z-score threshold drops from 3.0 to 1.5 and the rate multiplier from 5x to 2.5x. A misbehaving IP gets less tolerance.


How iptables Blocks an IP

When an anomaly is detected, we need to actually stop the traffic. Linux has a built-in firewall called iptables that operates at the kernel level — before the traffic even reaches Nginx or our application.

We add a DROP rule using Python's subprocess:

import subprocess

def ban_ip(ip):
    subprocess.run([
        "iptables",
        "-A", "INPUT",   # append to INPUT chain
        "-s", ip,        # source IP to match
        "-j", "DROP"     # action: drop the packet silently
    ])
Enter fullscreen mode Exit fullscreen mode

When a packet is dropped, the sender gets no response — it's like the server doesn't exist. This is more effective than sending a "rejected" response because it wastes the attacker's time waiting for a reply that never comes.

To unban:

def unban_ip(ip):
    subprocess.run([
        "iptables",
        "-D", "INPUT",   # delete from INPUT chain
        "-s", ip,
        "-j", "DROP"
    ])
Enter fullscreen mode Exit fullscreen mode

Auto-Unban with Backoff

Bans aren't permanent by default. We use a backoff schedule:

  • First ban: 10 minutes
  • Second ban: 30 minutes
  • Third ban: 2 hours
  • Fourth ban onwards: permanent

This is fair — if an IP triggers once, it might be a misconfigured client. If it keeps coming back, it gets progressively longer bans.


The Dashboard

A live web dashboard built with Flask shows:

  • Current global request rate
  • Effective mean and standard deviation
  • All currently banned IPs with their ban conditions
  • Top 10 source IPs by request count
  • CPU and memory usage
  • System uptime

It refreshes every 3 seconds automatically so you can watch the system react in real time.


Slack Alerts

Every significant event sends a Slack message:

  • 🚨 IP Ban: IP address, condition that fired, current rate, baseline mean, timestamp
  • IP Unban: same info plus next ban duration
  • ⚠️ Global Anomaly: when total traffic spikes (no single IP to block)

What I Learned

Building this from scratch taught me things no tutorial covers:

  • Math matters in production: z-scores aren't just textbook concepts — they're genuinely useful for anomaly detection
  • Deques are underrated: Python's collections.deque is perfect for sliding windows — O(1) append and popleft
  • Baselines must be dynamic: hardcoding thresholds fails in production because traffic patterns change by hour, day, and season
  • iptables is powerful: blocking at the kernel level is far more effective than application-level rate limiting

Try It Yourself

The full source code is available at:
https://github.com/travispocr/hng-stage3-devops

The live dashboard is running at:
http://psitdev.duckdns.org:8080

To run it on your own server:

git clone https://github.com/travispocr/hng-stage3-devops.git
cd hng-stage3-devops
cp .env.example .env
# Edit .env with your Slack webhook URL
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Built as part of the HNG Internship DevOps Track — Stage 3.

Top comments (0)