Gideon Bature

Posted on Apr 28

Building a Real-Time DDoS Detection Engine from Scratch: HNG DevOps Stage 3

#devops #security #python #beginners

This is part of my HNG DevOps internship series. Follow along as I document every stage.
Previous articles:
Stage 0: How I Secured a Linux Server from Scratch
Stage 1: Build, Deploy and Reverse Proxy a Rust API
Stage 2: Containerizing a Microservices App with Docker and CI/CD

A Quick Recap

Stage 0 was server hardening. Stage 1 was deploying an API. Stage 2 was containerization and CI/CD. Stage 3 is something different entirely. This time the task was to build a security tool from scratch.

The scenario: I have been hired as a DevSecOps engineer at a cloud storage company running Nextcloud. After a wave of suspicious traffic, my job is to build a daemon that watches every HTTP request in real time, learns what normal looks like, and automatically blocks attackers when traffic goes abnormal.

No Fail2Ban. No rate-limiting libraries. Build it yourself.

The repository is here: https://github.com/GideonBature/hng-stage3

The Task

Here is a summary of what needed to be built:

Deploy Nextcloud behind Nginx using Docker Compose
Nginx must write JSON access logs to a named Docker volume called HNG-nginx-logs
Build a Python daemon that tails the log, tracks request rates using deque-based sliding windows, computes a rolling baseline, and blocks anomalous IPs using iptables
The daemon must send Slack alerts on every ban and unban
A live metrics dashboard must be served at a domain or subdomain
Everything must auto-recover: bans release on a backoff schedule of 10 minutes, 30 minutes, 2 hours, then permanent

What the Project Does and Why It Matters

Every public web server on the internet gets attacked. Sometimes it is a single IP sending thousands of requests per second trying to overwhelm your server. Sometimes it is a distributed flood from many IPs at once. Both are called Denial of Service attacks and they can take a real service offline completely.

Traditional tools scan log files periodically, which introduces a delay. What I built here reads the log file line by line in real time as Nginx writes it, makes a decision within one second, and blocks the offending IP at the firewall level before the attack can do serious damage.

The full data flow looks like this:

Every HTTP request
        |
        v
Nginx serves the response, writes one JSON log line
        |
        v
Daemon reads that line immediately
        |
        v
Updates sliding windows (per-IP and global)
        |
        v
Every second: compares current rate against the rolling baseline
        |
        v
If anomalous: blocks IP with iptables + sends Slack alert
        |
        v
After ban duration: unblocks and sends Slack unban alert

Step 1: Setting Up the Server

I reused the same Oracle Cloud server from Stages 0, 1, and 2. It runs Ubuntu 24.04 on ARM64 (Ampere A1) with 4 OCPUs and 23GB RAM, well above the 2 vCPU and 2GB minimum required.

Docker was already installed from Stage 2. I confirmed it was working:

docker --version
docker compose version

I also needed to open port 8080 for the detector dashboard since the previous stages only had 80 and 443 open:

sudo iptables -I INPUT -p tcp --dport 8080 -j ACCEPT
sudo netfilter-persistent save

I also added port 8080 in Oracle Cloud's Security List the same way I added ports 80 and 443 back in Stage 0.

Step 2: Setting Up the DuckDNS Subdomain

The task required the dashboard to be served at a domain or subdomain. I already had gideonbature.duckdns.org from Stage 1, pointing to my server IP 92.5.80.18.

I went to duckdns.org, logged in, and created a second entry:

Subdomain: detector-gideonbature
IP: 92.5.80.18

After saving, I verified the DNS was resolving correctly:

ping detector-gideonbature.duckdns.org
# Should return 92.5.80.18

Step 3: Writing the Code

I built the detector locally on my Mac and organised it into seven separate modules, each with a single responsibility:

detector/
  main.py        # Entry point, wires all components together
  monitor.py     # Tails and parses Nginx JSON access log
  baseline.py    # Rolling baseline tracker with per-hour slots
  detector.py    # Z-score anomaly detection logic
  blocker.py     # iptables ban/unban management
  unbanner.py    # Background thread for auto-unban backoff
  notifier.py    # Slack alert sender
  dashboard.py   # Flask live metrics web UI
  config.yaml    # All thresholds and configuration
  requirements.txt
  Dockerfile

I also wrote the Nginx configuration and Docker Compose file to wire everything together. Once everything was ready, I pushed it to GitHub:

git init
git add .
git commit -m "Initial Stage 3 anomaly detection engine"
git remote add origin git@github.com:GideonBature/hng-stage3.git
git push -u origin main

One issue immediately: GitHub's secret scanning blocked the push because my Slack webhook URL was hardcoded in config.yaml. The fix was to use an environment variable placeholder instead:

slack:
  webhook_url: "${SLACK_WEBHOOK_URL}"

And resolve it at runtime in main.py:

import re, os

def load_config(path="config.yaml"):
    with open(path, "r") as f:
        content = f.read()
    def replace_env(match):
        return os.environ.get(match.group(1), match.group(0))
    content = re.sub(r'\$\{(\w+)\}', replace_env, content)
    return yaml.safe_load(content)

After removing the hardcoded secret, the push went through.

Step 4: Deploying on the Server

I cloned the repository on the server:

git clone https://github.com/GideonBature/hng-stage3.git
cd hng-stage3

Then created the .env file with my real values:

cp .env.example .env
nano .env

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SERVER_IP=92.5.80.18

Then brought the stack up:

docker compose up -d --build

Step 5: The First Problem — psutil Failed to Build

The first build attempt failed with:

psutil could not be installed from sources because gcc is not installed.
error: command 'gcc' failed: No such file or directory

psutil is a Python library for reading CPU and memory usage. It contains native C code that needs to be compiled. My Dockerfile was using python:3.11-alpine, and Alpine is a minimal Linux image that does not include build tools by default.

The fix was to add the required build dependencies to the Dockerfile:

FROM python:3.11-alpine

RUN apk add --no-cache \
    iptables \
    gcc \
    musl-dev \
    python3-dev \
    linux-headers

After adding those four packages, the build succeeded.

Step 6: The Second Problem — Nextcloud Architecture Mismatch

After the build succeeded, I ran docker compose up -d and saw this warning:

nextcloud The requested image's platform (linux/amd64) does not match
the detected host platform (linux/arm64/v8)

The task specified using the image kefaslungu/hng-nextcloud, but that image was built only for AMD64. My Oracle Cloud server runs on ARM64. Docker warned about the mismatch and then Nextcloud started crashing repeatedly with:

exec /inuxius-entrypoint.sh: exec format error
nextcloud exited with code 255 (restarting)

This is a binary incompatibility. The AMD64 binary simply cannot execute on an ARM64 processor. Because Nextcloud was crashing, Nginx could not resolve the nextcloud hostname in its config and also crashed:

nginx: [emerg] host not found in upstream "nextcloud"

And because Nginx was down, the detector dashboard was also unreachable even though the detector itself was running fine.

The fix was to replace the image with the official Nextcloud image which supports multiple architectures including ARM64:

# docker-compose.yml
nextcloud:
  image: nextcloud:apache  # replaced kefaslungu/hng-nextcloud

After this change, Nextcloud started correctly, Nginx resolved nextcloud successfully, and the dashboard became accessible.

Step 7: Verifying the Stack

With all three containers running, I verified each piece:

# Check all containers are up
docker compose ps

# Check the named volume exists
docker volume ls | grep HNG-nginx-logs

# Confirm Nginx is writing JSON logs
docker compose exec nginx tail -5 /var/log/nginx/hng-access.log

# Confirm detector is tailing the log
docker compose logs detector | head -20

# Check Nextcloud is accessible by IP
curl -I http://92.5.80.18

# Check dashboard is accessible by subdomain
curl -I http://detector-gideonbature.duckdns.org

The Nextcloud check returned a 200 OK with Nextcloud headers. The dashboard check returned 200 OK showing the live metrics page.

Step 8: The Third Problem — Slack Webhook Returning 404

I triggered a test flood to confirm bans were working:

# Install hey on Mac (HTTP load testing tool)
brew install hey

# Send 500 rapid requests
hey -n 500 -c 50 http://92.5.80.18/

The dashboard showed a ban fired correctly. But I never received a Slack notification. Checking the detector logs revealed:

[ERROR] notifier: Slack webhook error: 404 no_service

The webhook URL was being rejected by Slack. This happened because I had initially hardcoded the URL in config.yaml, generated a new webhook URL after the GitHub push was blocked, but only updated the .env file. The config.yaml file inside the container still had the old expired URL hardcoded.

The fix was to update config.yaml to use the environment variable placeholder:

slack:
  webhook_url: "${SLACK_WEBHOOK_URL}"

And rebuild the detector container:

docker compose up -d --build --force-recreate detector

After this, the next ban produced a proper Slack notification immediately.

Step 9: The Fourth Problem — iptables Inside Docker

I ran another flood and saw the ban fire in the logs:

[INFO] blocker: iptables DROP rule added for 105.112.17.175
[WARNING] blocker: BANNED 105.112.17.175: duration=600s

But when I checked the host iptables:

sudo iptables -L INPUT -n | grep DROP

Nothing appeared. The DROP rule was being added inside the Docker container's network namespace, not the host machine's. This means the attacker's traffic was still reaching the server untouched.

The fix was to run the detector with network_mode: host, which makes the container share the host's network stack directly:

detector:
  network_mode: host
  cap_add:
    - NET_ADMIN
    - NET_RAW

When using network_mode: host, the container cannot be on a named network, so I also removed the networks and ports entries from the detector service. This introduced a new problem: Nginx could no longer reach the detector at detector:8080 since the detector was no longer on the Docker bridge network.

The fix for the dashboard proxy was to use the Docker bridge gateway IP, which I found with:

docker compose exec nginx ip route
# default via 172.19.0.1 dev eth0

I updated nginx.conf to proxy the dashboard to 172.19.0.1:8080 instead of detector:8080:

server {
    listen 80;
    server_name detector-gideonbature.duckdns.org;

    location / {
        proxy_pass http://172.19.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

After restarting nginx the dashboard came back up and iptables DROP rules now appeared on the host.

Step 10: Testing Everything End to End

With all fixes applied, I ran a final full test. I opened three terminals:

Terminal 1 (server) — watch iptables live:

watch -n 1 "sudo iptables -L INPUT -n --line-numbers | grep -E 'num|DROP'"

Terminal 2 (server) — watch detector logs live:

docker compose logs -f detector | grep -i "anomaly\|ban\|drop"

Terminal 3 (Mac) — run the flood:

hey -n 2000 -c 100 http://92.5.80.18/

Within about 10 seconds of the flood starting, the detector fired:

[WARNING] detector: IP anomaly: 105.112.17.175 rate=4.82 mean=1.00 z=7.63
[INFO] blocker: iptables DROP rule added for 105.112.17.175
[WARNING] blocker: BANNED 105.112.17.175: duration=600s
[WARNING] detector: GLOBAL anomaly: rate=4.82 mean=1.00 z=7.63

Terminal 1 immediately showed:

1    DROP       0    --  105.112.17.175       0.0.0.0/0

Slack received both the ban notification and the global anomaly alert. Ten minutes later, the unbanner thread fired and Slack received the unban notification.

How the Detection Works (For Beginners)

Now that you have seen the process, let me explain the key concepts clearly.

The Sliding Window

Now here is the core concept. To know if traffic is abnormal, you first need to know the current rate of traffic. A sliding window is how you do that efficiently.

Think of it like a conveyor belt. Each request puts a timestamp on the belt. The belt is exactly 60 seconds long. Old timestamps fall off the left end automatically. At any moment, the number of timestamps on the belt divided by 60 gives you the current requests per second.

In Python, a collections.deque is the perfect data structure for this because removing items from the left (eviction) is O(1), meaning it takes the same time regardless of how many items are in the queue.

from collections import deque
import time

# One deque per IP, one global deque
ip_window = deque()
global_window = deque()

def record(source_ip: str):
    now = time.time()
    ip_window.append(now)
    global_window.append(now)

def get_rate(window: deque, window_seconds: int = 60) -> float:
    cutoff = time.time() - window_seconds

    # Evict timestamps older than 60 seconds from the left
    while window and window[0] < cutoff:
        window.popleft()

    # Count remaining entries and convert to per-second rate
    return len(window) / window_seconds

Every time a new log entry arrives, its timestamp goes into both the per-IP deque and the global deque. Every second, the evaluator evicts old timestamps and counts what remains. If the count is too high, it fires an alert.

The key thing to understand: this window is exact. It is not an approximation or a counter that resets every minute. It literally counts every request that arrived in the last 60 seconds.

The Rolling Baseline

Knowing the current rate is only half the problem. You also need to know what normal looks like. Is 10 requests per second high? It depends. At 3am with one user, yes. At 2pm with many users, maybe not.

This is where the rolling baseline comes in. Instead of hardcoding a threshold like "block anything above 100 req/s", the daemon learns from actual traffic.

The baseline works like this:

Step 1: Count requests per second in buckets

now_bucket = int(time.time())
self._global_buckets[now_bucket] = self._global_buckets.get(now_bucket, 0) + 1

Each second gets its own bucket with a count. These buckets cover the last 30 minutes.

Step 2: Every 60 seconds, recalculate mean and standard deviation

counts = list(self._global_buckets.values())
mean = sum(counts) / len(counts)
variance = sum((x - mean) ** 2 for x in counts) / len(counts)
stddev = math.sqrt(variance)

The mean tells you the average rate. The standard deviation tells you how much the rate varies normally.

Step 3: Per-hour slots

Traffic at 3am is different from traffic at 3pm. The baseline maintains a separate record for each clock hour. When the current hour has enough data, it is preferred over the global average. This means the detector adapts to time-of-day patterns automatically.

Step 4: Floor values

During very quiet periods, the computed mean might be nearly zero. If the mean is 0.001 req/s and one request arrives, the rate is suddenly thousands of times the mean, which would trigger a false alarm. To prevent this, a minimum floor is enforced:

min_rps_floor: 1.0  # never let mean drop below 1 req/s

How Detection Makes a Decision

With the current rate and the baseline established, detection is a single calculation called the z-score:

z-score = (current_rate - baseline_mean) / baseline_stddev

The z-score measures how many standard deviations the current rate is above the mean. In a normal distribution:

z-score of 1.0 means the rate is slightly elevated but normal
z-score of 2.0 means it is moderately elevated
z-score of 3.0 means it is very unlikely to be normal traffic The detector flags an anomaly if either condition fires first:

zscore = (ip_rate - mean) / stddev

rate_breach = ip_rate >= 5.0 * mean  # 5x the average
zscore_breach = zscore > 3.0         # statistically very unlikely

if zscore_breach or rate_breach:
    # This IP is attacking
    fire_anomaly_event(ip)

The 5x multiplier catches sudden bursts even when the stddev is large. The z-score threshold catches sustained elevated rates even when the burst is not huge but is statistically abnormal.

Error surge detection adds another layer. If an IP is sending a lot of 4xx or 5xx errors (typical of scanners probing for vulnerabilities), the thresholds are automatically tightened by 50%, making it easier to ban that IP:

if ip_error_rate >= 3.0 * baseline_error_rate:
    effective_zscore_threshold = 3.0 * 0.5   # tightened to 1.5
    effective_rate_multiplier = 5.0 * 0.5    # tightened to 2.5x

iptables Blocking

When an IP is flagged as anomalous, the blocker adds a DROP rule to the Linux firewall. iptables is the kernel-level packet filter in Linux. Adding a DROP rule means the kernel silently discards all packets from that IP before they even reach Nginx. The attacker gets no response at all.

import subprocess

def ban(ip: str):
    subprocess.run(
        ["iptables", "-I", "INPUT", "-s", ip, "-j", "DROP"],
        check=True,
    )

The -I INPUT flag inserts the rule at position 1, which means it is evaluated before any other rules. This is important because iptables processes rules in order and stops at the first match.

The backoff schedule means repeat offenders get banned for longer:

Offense	Duration
1st	10 minutes
2nd	30 minutes
3rd	2 hours
4th+	Permanent

When the ban expires, the unbanner thread removes the rule:

subprocess.run(
    ["iptables", "-D", "INPUT", "-s", ip, "-j", "DROP"],
)

And sends a Slack notification so the operator knows the IP has been released.

The Live Dashboard

The dashboard is a Flask web application that serves a single HTML page. It auto-refreshes every 3 seconds using an HTML meta refresh tag and shows:

Currently banned IPs with ban time, duration, and condition
Global requests per second
Top 10 source IPs in the last 60 seconds
CPU and memory usage of the server
Current baseline mean and stddev
Uptime of the daemon

There is also a /api/metrics JSON endpoint so the data can be consumed programmatically:

curl http://detector-gideonbature.duckdns.org/api/metrics

Final Verification

# Audit log showing bans, unbans, and baseline recalculations
docker compose exec detector sh -c "grep -v BASELINE_RECALC /var/log/detector/audit.log | tail -10"

# Named volume confirmed
docker volume ls | grep HNG-nginx-logs

# Dashboard live
curl -I http://detector-gideonbature.duckdns.org

# Nextcloud accessible by IP
curl -I http://92.5.80.18

Everything passed. The daemon was left running for the required 12 continuous hours and responded correctly to the test attack traffic sent by the graders.

The Big Picture

Stage 3 introduced concepts that real security engineers work with every day:

What we built	Why it matters
Real-time log tailing	Detect attacks as they happen, not after
Deque-based sliding window	Exact per-second rate tracking with O(1) eviction
Rolling baseline with hour slots	Adapts to time-of-day traffic patterns automatically
Z-score detection	Statistically sound threshold that does not require tuning
iptables DROP rules	Kernel-level blocking before traffic reaches the app
Backoff unban schedule	Repeat offenders face increasingly longer bans
network_mode: host	Required for container to modify host firewall

The hardest bugs were not the obvious ones. The architecture mismatch, the iptables namespace issue, and the stale webhook URL were all invisible until the system was running under real conditions. That is what makes security tooling hard and interesting.

Stage 4 is next. Follow along as I keep documenting the journey.

Find me on Dev.to | GitHub

DEV Community