ibraheembello

Posted on Apr 26

Building a real-time anomaly detection engine for a self-hosted Nextcloud (HNG Stage 3)

#devops #security #python #docker

TL;DR - I built a Python daemon that watches every HTTP request hitting an Nginx reverse proxy, learns what normal traffic looks like in real time, and automatically bans IPs that misbehave by inserting iptables rules at the kernel level. It alerts to Slack and ships its live metrics on a public dashboard. This post explains every piece in beginner-friendly terms.

Live demo: http://cloud-ng-anomaly.duckdns.org
Source code: https://github.com/ibraheembello/hng-stage3-anomaly-detector

What the project does and why it matters

Imagine you run a public website, let's say, a self-hosted Google-Drive-style cloud storage. Random people visit it all day. Most are real users. Some are bots. A few are attackers trying to flood your server with fake requests until it falls over. That last category is called a DDoS attack - Distributed Denial of Service.

The traditional way to deal with this is reactive: when something breaks, an engineer wakes up at 3 a.m., diagnoses the issue, and manually blocks the offender. That's slow and painful.

The better way is proactive detection: watch the traffic continuously, learn what "normal" looks like, and automatically block the abnormal stuff before it brings the site down.

That's what this project does - for a Nextcloud cluster running behind Nginx.

It builds its own definition of "normal" from the last 30 minutes of real traffic, not a hardcoded number. It works whether you get 2 requests per second at 3 a.m. or 200 at peak. And when an attacker sends a burst, the detector reacts within a couple of seconds - adds a kernel-level firewall rule, pings me on Slack, and shows the ban on a dashboard.

The architecture in one picture

Four moving parts:

Component	What it does
Nginx	The doorman. Every request hits Nginx first; Nginx forwards it to Nextcloud. Crucially, it also writes a JSON line per request into a log file.
Nextcloud	The actual application. Doesn't know any of this is happening.
The detector daemon	A Python program that reads the log file as it grows. The brain of the operation.
iptables	The Linux kernel's firewall. The detector tells it "drop every packet from this IP on ports 80 and 443" and the kernel does the rest.

Everything runs in Docker containers, orchestrated by Docker Compose, on one Ubuntu VPS.

Step 1 - How the sliding window works

This is the heart of the detector. Skip to step 2 if data structures bore you, but trust me - this part is clever and simple.

A "rate" is a question: "how many requests has this IP made in the last 60 seconds?" If you naively count requests per minute by tallying them in 60-second buckets, you get the wrong answer when traffic falls right between two buckets - and an attacker can exploit that gap.

A sliding window answers the same question, but the 60-second period slides forward continuously with the clock. So at 12:00:30, "the last 60 seconds" means 11:59:30 to 12:00:30. At 12:00:31, it means 11:59:31 to 12:00:31. Always exactly the past 60 seconds, never a fixed bucket.

How do you implement that without re-counting every time? Use a deque - a "double-ended queue".

from collections import deque

# Per-IP and global windows: each one stores timestamps of recent requests
window = deque()
WINDOW_SECONDS = 60

def record(now: float):
    # 1. Drop everything older than 60s from the front
    while window and window[0] <= now - WINDOW_SECONDS:
        window.popleft()
    # 2. Append the new request to the back
    window.append(now)

def current_rate() -> float:
    return len(window) / WINDOW_SECONDS

That's it. Two operations per request: pop old entries off the front (O(1)), append the new one to the back (O(1)). The current rate is just len(window) / 60.

The detector keeps one such window per source IP and one global window across all traffic.

⚠️ The Stage 3 brief explicitly forbids "faking the sliding window with a per-minute counter." This deque pattern is the cheapest way to satisfy that requirement honestly.

Step 2 - How the baseline learns from traffic

The detector now knows the current request rate. But it has no idea whether 5 requests per second is a lot or a little for this site. To answer that, it needs a baseline - a model of normal traffic.

The baseline keeps a 30-minute history of per-second request counts. Every second, the count of requests in that second is appended to a rolling list. The list is bounded so anything older than 30 minutes is automatically forgotten.

Every 60 seconds, the baseline thread does this:

mean = statistics.fmean(per_second_counts)
stddev = statistics.stdev(per_second_counts)

That gives us two numbers describing the baseline:

mean - on average, how many requests does this site get per second?
stddev ("standard deviation") - how much does that number bounce around?

Per-hour-of-day slots

Here's a subtle wrinkle: traffic at 3 a.m. is naturally lower than at 3 p.m. If you used one universal baseline, the detector would think 3 p.m. traffic was an attack just because it's higher than the all-day average. Or it would miss a real attack at 3 a.m. because it's still below the all-day average.

The fix: keep 24 separate slots, one per hour of day. Once the slot for the current hour has at least 5 minutes of data, use that slot instead of the global rolling baseline. So 3 a.m. traffic is judged against other 3 a.m. traffic, and 3 p.m. against 3 p.m.

Floors

If traffic is genuinely zero for a while, mean and stddev both crash toward zero. The very next request would then have an infinite z-score (you can't divide by zero) and trigger a false alarm. Two cheap "floor" values prevent this:

mean_floor: 1.0
stddev_floor: 0.5

The mean is never allowed below 1.0; the stddev never below 0.5. Boring, but necessary.

Step 3 - How detection makes a decision

For every request that comes in, the detector asks two simultaneous questions:

Is this rate too many sigmas above the mean?
The "z-score" answers that: z = (rate - mean) / stddev.
If z > 3.0, the rate is more than three standard deviations away from normal - which statistically happens in about 0.3% of normal traffic. Almost certainly an anomaly.
Is this rate flat-out higher than 5× the mean?
This is a hard ceiling that catches steady, sustained floods even when stddev is wide. If the mean is 2 req/s, anything above 10 req/s is suspicious regardless of statistics.

If either fires, the IP (or the global stream) is anomalous. Whichever fires first wins.

z = (rate - mean) / max(stddev, 1e-9)
if z > 3.0 or rate > mean * 5.0:
    return "anomaly!"

There's one more wrinkle: an error surge. If an IP's 4xx/5xx response rate is at least 3× the baseline error rate, the detector tightens that IP's thresholds (multiplies them by 0.7). The intuition: an IP generating mostly errors is probing or scanning, and we want to ban it sooner.

Step 4 - How iptables is used to block an IP

Once the detector decides an IP is bad, the question is: how do you actually block it?

Linux kernels include a packet-filtering subsystem called netfilter. The user-space tool to talk to it is iptables. When you run

sudo iptables -I INPUT -p tcp -s 1.2.3.4 --dport 80 -j DROP

…the kernel installs a rule that says: "any TCP packet coming in on port 80 from 1.2.3.4 - drop it silently." The packet never even reaches Nginx. It's the lowest-level, fastest way to block traffic.

The blocker module shells out to iptables directly:

subprocess.run(["iptables", "-I", "INPUT", "-p", "tcp", "-s", ip,
                "--dport", str(port), "-j", "DROP"], check=True)

I scope the rule to TCP ports 80 and 443 only, not the whole IP. Why? So that when an admin's workstation accidentally trips the detector, the admin doesn't lose SSH at the same time. (Yes, I learned this the painful way.)

The ban isn't permanent - it's tiered:

schedule_seconds: [600, 1800, 7200]   # 10 min, 30 min, 2 h

The first ban lasts 10 minutes. If the same IP misbehaves again later, the second ban lasts 30 minutes. The third lasts 2 hours. The fourth is permanent. This back-off pattern keeps you from permanently banning a real user who had one bad minute, but punishes repeat offenders harder each time.

A small "unbanner" thread polls every 5 seconds for expired bans, removes the corresponding iptables rule, and pings Slack: "unbanned 1.2.3.4 (ban #1)".

Putting it all together

Here's a real audit log line generated during testing:

[2026-04-26T00:31:22Z] BAN 102.88.55.19 | z-score 3.03 > 3.00 | rate=2.52 | baseline=1.00 | duration=600s
[2026-04-26T00:41:27Z] UNBAN 102.88.55.19 | … | duration=released | reason=scheduled-release

I sent a burst of 400 HTTP requests from my home IP. The detector took ~30 seconds to process them, computed z-score = 3.03, banned my IP for 10 minutes, and notified Slack. Exactly 600 seconds later, the unbanner removed the rule and notified Slack again. Meanwhile, the dashboard was updating every 3 seconds, the baseline was learning from the burst (its stddev climbed to 6.6 and decayed back over the next 30 minutes), and a synthetic multi-IP attack later that hour fired the global detection path with rate 11.28 > mean 2.26 × 5.00.

Everything I described above ran without me intervening once.

What I would change next time

Production WSGI - Flask's built-in dev server runs the dashboard. For a real deployment I'd put it behind gunicorn.
TLS - I left HTTP-only because the brief didn't mandate it; in real life I'd terminate TLS at Nginx via certbot.
Persistence of the baseline - currently the baseline restarts cold on a daemon restart. Snapshotting the recent history to disk would let it warm up faster.
Adaptive backoff - the schedule is static. A repeat-offender weighting that takes the gap between offences into account would feel more humane.

But those are polish. The core mechanism - sliding windows + rolling baseline + z-score + iptables - works.

Try it yourself

git clone https://github.com/ibraheembello/hng-stage3-anomaly-detector.git
cd hng-stage3-anomaly-detector
cp .env.example _private/.env
$EDITOR _private/.env
docker compose --env-file _private/.env up -d --build

The README has a full from-scratch runbook, including how to install Docker on a fresh Ubuntu 24.04 box, how to point a free DuckDNS domain at the EC2 IP, and how to set up the Slack webhook.

If you build something on top of this, ping me - I'd love to see what you change.

DEV Community