DEV Community

Cover image for How I Built a Real-Time DDoS Detection Engine for Nextcloud from Scratch
Hendy
Hendy

Posted on

How I Built a Real-Time DDoS Detection Engine for Nextcloud from Scratch

How I Built a Real-Time DDoS Detection Engine for Nextcloud from Scratch

Introduction

Imagine you're running a cloud storage platform used by thousands of people around the world. One day, your boss walks in and says: "We've been seeing suspicious traffic. Build something that detects and blocks attacks automatically."

That's exactly the challenge I faced. In this post, I'll walk you through how I built a real-time anomaly detection engine that watches HTTP traffic, learns what normal looks like, and automatically blocks attackers — all without using any third-party rate-limiting libraries.

By the end of this post, you'll understand:

  • How sliding windows track request rates in real time
  • How a baseline learns from your own traffic patterns
  • How z-score math decides if something is an attack
  • How iptables drops malicious IPs at the kernel level

What Does the Project Do?

The system sits alongside a Nextcloud instance and does five things continuously:

  1. Reads Nginx access logs in real time, line by line
  2. Tracks request rates using sliding windows (per IP and globally)
  3. Learns what normal traffic looks like using a rolling baseline
  4. Detects anomalies when traffic deviates significantly from normal
  5. Blocks suspicious IPs automatically using iptables

Here's the overall architecture:
Internet → Nginx (JSON logs) → Nextcloud

Log Volume (shared)

Python Daemon
├── Monitor (reads logs)
├── Baseline (learns traffic)
├── Detector (flags anomalies)
├── Blocker (iptables rules)
├── Unbanner (auto-release)
├── Notifier (Slack alerts)

└── Dashboard (live metrics)

How the Sliding Window Works

A sliding window is a way of asking: "How many requests happened in the last 60 seconds?"

The naive approach would be to count all requests every minute — but that gives you a stale snapshot, not a real-time view.

Instead, I used Python's collections.deque — a double-ended queue. Here's the idea:

from collections import deque
import time

# One deque per IP
ip_window = deque()

def record_request(ip):
    now = time.time()
    cutoff = now - 60  # 60-second window

    # Add current timestamp to the right
    ip_window.append(now)

    # Evict old timestamps from the left
    while ip_window and ip_window[0] < cutoff:
        ip_window.popleft()

    # Current rate = requests in last 60 seconds
    rate = len(ip_window) / 60
    return rate
Enter fullscreen mode Exit fullscreen mode

Every time a request comes in, its timestamp is appended to the right. Old timestamps (older than 60 seconds) are removed from the left using popleft(). The current rate is simply the length of the deque divided by 60.

This gives us a perfectly accurate rolling count with O(1) insertions and evictions. No databases, no counters that reset at fixed intervals.

I maintain two windows:

  • Per-IP window: one deque per IP address
  • Global window: one deque for all traffic combined

How the Baseline Learns from Traffic

The baseline answers the question: "What does normal traffic look like on this server?"

Instead of hardcoding a threshold like "flag anything above 10 req/s", I compute the mean and standard deviation from actual recent traffic.

Here's how it works:

Rolling 30-minute window:
Every second, I record how many requests arrived that second. I keep a rolling window of the last 30 minutes (1800 seconds) of these per-second counts.

Recalculation every 60 seconds:

counts = [c for _, c in self.window]
mean = sum(counts) / len(counts)
variance = sum((c - mean) ** 2 for c in counts) / len(counts)
stddev = math.sqrt(variance)
Enter fullscreen mode Exit fullscreen mode

Per-hour slots:
The baseline stores separate mean/stddev values per hour. This means if your server is busier at 9am than at 3am, the baseline adapts automatically.

Floor values:
To avoid division by zero on idle servers, I set minimum values:

  • floor_mean = 0.1 req/s
  • floor_stddev = 0.1

How the Detection Logic Makes a Decision

Once I have the baseline mean and stddev, I use z-score to decide if a rate is anomalous.

The z-score measures how many standard deviations a value is from the mean:
z = (current_rate - mean) / stddev
If z > 3.0, that means the current rate is more than 3 standard deviations above normal — statistically very unlikely under normal conditions.

I also add a rate multiplier check as a backup:

def check_ip(self, ip):
    mean, stddev = self.baseline.get_baseline()
    rate = self.get_ip_rate(ip)

    # Z-score check
    if stddev > 0:
        zscore = (rate - mean) / stddev
        if zscore > 3.0:
            return True, f"zscore={zscore:.2f}"

    # Rate multiplier check
    if rate > 5 * mean:
        return True, f"rate={rate:.2f} > 5x mean"

    return False, None
Enter fullscreen mode Exit fullscreen mode

Error surge detection:
If an IP's 4xx/5xx error rate is 3x higher than the baseline error rate, the thresholds are automatically tightened by 30%. This catches attackers who probe for vulnerabilities before launching a full attack.


How iptables Blocks an IP

When an IP is flagged as anomalous, blocking happens at the kernel level using iptables — before traffic even reaches Nginx or Nextcloud.

import subprocess

def ban(self, ip):
    subprocess.run([
        'iptables', '-I', 'INPUT',
        '-s', ip, '-j', 'DROP'
    ])
Enter fullscreen mode Exit fullscreen mode

The -I INPUT inserts the rule at the top of the INPUT chain. -s specifies the source IP. -j DROP silently drops all packets from that IP.

This is extremely efficient — the kernel drops packets without any application-level processing.

Auto-unban with backoff schedule:
Bans don't last forever. The unbanner releases IPs on a backoff schedule:

  • 1st ban: 10 minutes
  • 2nd ban: 30 minutes
  • 3rd ban: 2 hours
  • 4th ban+: permanent
unban_schedule = [600, 1800, 7200, -1]  # seconds, -1 = permanent
Enter fullscreen mode Exit fullscreen mode

The Live Dashboard

The system serves a live dashboard at a public domain showing:

  • Global request rate vs baseline
  • Banned IPs and their ban counts
  • Top 10 source IPs by request rate
  • CPU and memory usage
  • Uptime

Built with Flask and auto-refreshes every 3 seconds.


Slack Alerts

Every ban, unban, and global anomaly sends a Slack notification with:

  • The IP address
  • The condition that fired (z-score or rate multiplier)
  • Current rate vs baseline
  • Ban duration
  • Timestamp

What I Learned

Building this from scratch taught me:

  1. Deques are powerful — simple data structures can solve complex real-time problems elegantly
  2. Statistics beat hardcoded thresholds — a z-score adapts to your actual traffic patterns
  3. Kernel-level blocking is fast — iptables drops packets before your app even sees them
  4. Baselines need floors — always handle the idle server case

Source Code

The full source code is available on GitHub:
👉 https://github.com/IamHendy/hng-anomaly-detector

The live dashboard is running at:
👉 http://hendyogema.mooo.com


Built as part of the HNG DevSecOps track — Stage 3 challenge.

Top comments (0)