Wilfrid Okorie

Posted on Apr 29

How I Created a DDoS Protection Engine

#devops #sre #nginx #python

As part of my tasks in HNG14, track DevOps, stage 3, I was to build an engine to protect a live Nextcloud server from DDoS attacks, without using any existing security tools like Fail2Ban.

In summary, this means writing a program that watches traffic in real time, learns what how regular traffic is, and automatically locks out attackers the moment something goes wrong i.e. in times of suspicious traffic spikes.

This post explains exactly how I did it — in plain English, no security background required.

What is a DDoS Attack

DDoS stands for Distributed Denial of Service. Imagine there is a club that has regular traffic, say 10 people averagely entering every minute. A DDoS Attack would be an attacker sending 200 random people to stand in line, without entering so that honest people that want to enter the club are denied the service.
In this case, instead of fake customers, it is fake http requests, thousands per second to flood your server until your server cannot serve real requests anymore.
The protection engine serves as the bouncer in such a case.

Architecture: The Pieces that Work Together

In this setup, there are three pieces to focus on that work together. Here is a diagram of the flow:

- Internet Traffic

- Nginx -> reverse proxy, logs everything to JSON

- Nextcloud -> the actual app, don't touch this

- The Detector Daemon -> reads Nginx logs continuously

The tool:
- Detects anomalies
- Blocks IPs via iptables
- Sends Slack alerts
- Serves a live dashboard
- Auto-unbans on schedule

The idea is that Nextcloud runs behind Nginx (which is a web server, acting as a gatekeeper).
To every request that comes in, Nginx logs it in real time. The tool is an adaptable tool that reads the logs in real time as they come, calculate the average rate of requests at a particular period, to understand what normal is, and then react to suspicious spikes in this rate. The reaction is automatically blocking attackers.
In addition, there is a live web dashboard showing what is happening.

Here are the steps to building such a tool:

1: Set Up Your VPS

Go to your cloud provider (I used AWS). Create an EC2 instance (or a "Virtual machine", or whatever it is called with your cloud provider) with at least 2 vCPUs and 2GB RAM. Start the instance, and copy the Public IPv4.

With your instance running, SSH into it and install the tools you need:

sudo apt update && sudo apt upgrade -y
sudo apt install -y docker.io docker compose git
sudo systemctl enable docker && sudo systemctl start docker

Also open these ports in your cloud firewall (AWS calls it a Security Group):

22 - SSH
80 - HTTP (Nginx/Nextcloud)
443 - HTTPS
5000 - Your detector dashboard

Point a domain or subdomain at your server's public IP - you'll need this for the dashboard URL. If your IP changes after restarting the instance (it will on AWS unless you use an Elastic IP), update your DNS A record.

Problems I faced at this step:

The default Ubuntu image comes with containerd already installed, which conflicts with docker.io. Remove it first, then install Docker.
I initially created one with less RAM and a tiny 8GB disk, and spent a lot of time debugging crashes that were simply caused by running out of memory and disk space. Save yourself the pain: start with a t3.small (2 vCPU, 2GB RAM) and a 16GB disk minimum if you are using AWS.
Don't forget to also allow port 5000 in UFW (Ubuntu's local firewall) - I missed this and spent time confused about why the dashboard wasn't accessible even though the Security Group was correct:

sudo ufw allow 5000

2: Set Up the Docker Compose Stack

I did all of this in my Python codebase, and pulled from git from within my EC2 instance, but besides the code, here is what you need:

Create your project folder:

mkdir -p ~/hng_devops_stage_3/nginx
cd ~/hng_devops_stage_3

Create docker-compose.yml:

version: "3.9"

volumes:
  HNG-nginx-logs:      # nginx writes here, detector reads here
  detector-audit:      # persists audit logs across restarts

services:

  nginx:
    image: nginx:alpine
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - HNG-nginx-logs:/var/log/nginx
    depends_on:
      - nextcloud

  nextcloud:
    image: kefaslungu/hng-nextcloud
    restart: unless-stopped
    volumes:
      - HNG-nginx-logs:/var/log/nginx:ro

  detector:
    build:
      context: ./detector
      dockerfile: Dockerfile
    restart: unless-stopped
    network_mode: host        # required for iptables to affect host firewall
    env_file: .env            # contains your Slack webhook URL
    volumes:
      - HNG-nginx-logs:/var/log/nginx:ro
      - ./config.yaml:/app/config.yaml:ro
      - detector-audit:/var/log/detector
    cap_add:
      - NET_ADMIN             # required to run iptables commands
    depends_on:
      - nginx
    environment:
      - DETECTOR_CONFIG=/app/config.yaml

Here are some things to note:

Why network_mode: host? Docker containers normally have their own isolated network. But iptables rules you add inside a container only affect that container, not the actual host machine. With host networking, the container shares the host's network stack, so iptables rules you add actually block traffic at the server level. Without this, your bans do nothing.
Why cap_add: NET_ADMIN? By default, containers can't modify firewall rules, since that is a privileged operation. This capability grants exactly the permission needed, and nothing more.

Why a named volume HNG-nginx-logs? This is the shared pipe between Nginx and your detector. Nginx writes logs into it. Your detector reads from it. The name must be exactly HNG-nginx-logs, since the task requires it.

When your disk fills up (and it will if you're not careful - the Nextcloud image alone is over 1GB), clean unused Docker data:

sudo docker system prune -f

If you resize your cloud disk, remember to extend the filesystem too:

sudo growpart /dev/nvme0n1 1
sudo resize2fs /dev/nvme0n1p1

3: Configure Nginx

I also did this in my IDE so it would appear when I pulled from github.

Create nginx/nginx.conf:

user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    # JSON log format — every field the detector needs
    log_format json_log escape=json
        '{'
            '"source_ip":"$remote_addr",'
            '"timestamp":"$time_iso8601",'
            '"method":"$request_method",'
            '"path":"$request_uri",'
            '"status":$status,'
            '"response_size":$body_bytes_sent'
        '}';

    access_log /var/log/nginx/hng-access.log json_log;

    # Trust X-Forwarded-For so real client IPs are logged
    real_ip_header    X-Forwarded-For;
    set_real_ip_from  0.0.0.0/0;

    sendfile       on;
    keepalive_timeout  65;

    upstream nextcloud {
        server nextcloud:80;
    }

    server {
        listen 80;
        server_name _;

        location / {
            proxy_pass         http://nextcloud;
            proxy_set_header   Host              $host;
            proxy_set_header   X-Real-IP         $remote_addr;
            proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Proto $scheme;

            client_max_body_size    10G;    # allow large file uploads
            proxy_request_buffering off;
        }
    }
}

Two important things here:

JSON logs: the detector parses these logs line by line. They must be valid JSON. The escape=json directive ensures special characters in URLs don't break the JSON structure.

Real IP forwarding: without real_ip_header X-Forwarded-For, every log entry shows Nginx's internal Docker IP instead of the actual visitor's IP. Your detector would see every request coming from the same internal address and never identify real attackers.

Test Nginx is working before moving on:

sudo docker compose up -d nginx nextcloud
curl http://YOUR_SERVER_IP

You should see the Nextcloud setup page. If you see a 502 error, Nextcloud is still starting up; wait 30 seconds and try again. If you get a port conflict error, something else is using port 80:

sudo systemctl stop nginx    # stop any system nginx
sudo systemctl disable nginx

4: Build the Detector App

Your detector lives in a detector/ folder and is made up of several Python files, each with a single responsibility. Here's what each one does and why it exists:

config.py: loads config.yaml and environment variables. All thresholds live here. Nothing is hardcoded anywhere else in the codebase.

monitor.py: tails the Nginx log file line by line, exactly like tail -f in your terminal. Every new line gets parsed from JSON and fed into the sliding windows. This runs in its own thread continuously.

baseline.py: keeps a 30-minute rolling history of per-second request counts. Every 60 seconds it recalculates the mean and standard deviation. Maintains per-hour slots so peak-hour traffic doesn't distort off-peak baselines.

detector.py: evaluates current request rates against the baseline. Fires if z-score exceeds 3.0 or rate exceeds 5x the mean. Tightens thresholds for IPs with high error rates.

blocker.py: executes iptables to block flagged IPs and records the ban.

unbanner.py: runs on a schedule, checks expired bans, removes iptables rules, and escalates the backoff level for repeat offenders.

notifier.py: sends HTTP POST requests to your Slack webhook with ban/unban/global alert details.

dashboard.py: a Flask web server serving a live metrics page that refreshes every 3 seconds.

main.py: the entry point. Starts all threads and keeps the daemon running.

Your config.yaml holds all the tunable values:

slack_webhook_url: "${SLACK_WEBHOOK_URL}"   # loaded from .env at runtime

zscore_threshold: 3.0
rate_multiplier: 5.0
error_rate_multiplier: 3.0

sliding_window_seconds: 60
baseline_window_minutes: 30
baseline_recalc_interval_seconds: 60

dashboard_port: 5000
log_file_path: "/var/log/nginx/hng-access.log"
audit_log_path: "/var/log/detector/audit.log"

For the Slack webhook URL, never put the real URL in your config file if your repo is public. Instead, I created a .env file on my server (which I added to .gitignore) and let Docker inject it as an environment variable:

.env (on your server only, never committed):

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/REAL/WEBHOOK

Your Dockerfile for the detector:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . ./detector/
CMD ["python", "-m", "detector.main"]

To bring the full stack up:

sudo docker compose up -d --build
sudo docker compose logs -f detector

You should see the Flask dashboard starting and log lines being processed. If you see No space left on device, clean up Docker and resize your disk as described in section 2.

5: Audit Log

Every significant action the detector takes gets written to a structured audit log at /var/log/detector/audit.log. The format is:

[timestamp] ACTION ip | condition | rate | baseline | duration

Real examples from my running system:

[2026-04-28T08:57:09Z] BAN ip=102.90.99.58 | condition=zscore | rate=1.2/s | baseline=1.0/s | duration=600s
[2026-04-28T09:07:28Z] UNBAN ip=102.90.99.58 | condition=backoff-0 | rate=N/A | baseline=1.0/s | duration=1800s
[2026-04-28T09:00:00Z] BASELINE_RECALC ip=global | mean=1.0 | stddev=0.06

To read it live:

sudo docker exec $(sudo docker ps -qf "name=detector") tail -f /var/log/detector/audit.log

The detector-audit Docker volume means this log survives container restarts — if your detector crashes and restarts, the full ban history is still there. This matters because the unbanner needs ban history to know which backoff level to apply next.

6: Set Up Slack

Go to api.slack.com/apps and Create New App
Give it a name (Mine was "HNG DDoS Protection Engine") and pick your workspace
In the left sidebar, click Incoming Webhooks, toggle it On
Click Add New Webhook to Workspace → pick the channel you want alerts in, and Allow
Copy the webhook URL that looks like https://hooks.slack.com/services/T.../B.../...
On your server, add it to your .env file:

echo "SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK" > ~/hng_devops_stage_3/.env

Restart the detector to pick it up:

sudo docker compose restart detector

Test it's working by sending a flood of requests to trigger a ban:

for i in {1..300}; do curl -s http://YOUR_SERVER_IP/ > /dev/null; done

Within 10 seconds you should see a Slack message like this:

🚨 IP Banned
IP: YOUR_IP
Condition: zscore
Current Rate: 4.8 req/s
Baseline Mean: 1.0 req/s
Ban Duration: 600s

Wait 10 minutes and you'll get the unban notification automatically. That confirms the full cycle — detection, blocking, alerting, and auto-unban — is working end to end.

Here is How Some Components Work:

Sliding Window: The problem this solves is, how do you measure requests per second in real time? A naive solution would be to count requests per minute, but that is static, if you measure the time against how long an attack takes i.e. the attacker could be done before your system is done counting a minute.
Instead, imagine you have a stick, with a number of spots where items can sit. When things are placed, they move from one end of the stick to the other end, over 60 seconds. After the first 60 seconds, the number of items on the stick gives you your current rate. After the first 60 seconds are gone, items that have gotten to the other end fall off, and more items (this is an analogy for requests) come in.
Whenever there is an attack, the number of items at the same time on the 60-second window stick would get abnormally high, and that is how you would know there is an attack.
This is implemented in Python with a double-ended queue.
Baseline Mean: The function of this is to learn from traffic. The sliding window is very good, but it would give false positives if in the first place, you don't know how many requests should be "too many". For a personal blog for instance, having 50 requests per second is a massive spike. For a large cloud platform, or a social media app, it is actually normal during peak hours. Therefore, you can't hardcode a number for this. The value has to be specific to your actual traffic patterns.
Every second, the detector records how many requests came in, keeping a rolling 30-minute history of these per-second counts. Every 60 seconds, it recalculates two things:
- Mean: the average requests per second over the last 30 minutes
- Standard deviation: how much the rates typically varies from the mean It also maintains per hour slots, so that peak times are separate from quiet times. Also, the baseline mean never drops below 1.0, to prevent floor division by zero. This means that the baseline needs time, but after that period, it gets what normal looks like for a server.
Detection Logic Decision: From the above two, we have a current rate, a mean, and a standard deviation. The detector calculates a z-score:

z = (current_rate - mean)/standard_deviation

The z-score answers how many standard deviations above normal, a particular rate is. A z-score of 1.0 means it is slightly above average. 3.0 means this happens by random chance less than 0.3% of the time. 10.0 means something is very wrong.
When something is wrong, the detector takes action by banning the source IP, and a notification is sent on the system (slack in this case).
There is also an error surge detector. If an IP is getting errors much higher than normal, its detection thresholds tighten automatically. This is to catch attackers who might not send high volumes of requests.

How iptables Blocks an IP

Whenever an IP is flagged, the detector runs the command:

iptables -A INPUT -s 1.2.3.4 -j DROP

iptables is Linux's built-in firewall
-A INPUT adds a rule to the input chain i.e. incoming traffic
-s 1.2.3.4 selects traffic coming from the source IP that is given as argument to the flag.
-j DROP discards the packet.

With this, the attacker's requests never reach Nginx; Linux drops them at the lowest level, before the application code runs.

Bans are not permanent by default. The


 follows a backoff schedule.
First ban is for 10 minutes, second for 30 minutes, third for two hours, and fourth permanent. This means repeat offenders get permanently blocked.

## Personal Takeaway:
My personal takeaway from this project was the logic used to build adaptive thresholds. I love the math in that, and how it makes sure - to a good extent - that the thresholds adjust very well to different traffic patterns. It adds the time dimension to the volume of requests, which is what accurate systems need: **CONTEXT**.

Here is the github repository: **https://github.com/OWK50GA/ddos-attack-protection-engine**

DEV Community