Felix Gogodae

Posted on Apr 28

Building a Rolling-Baseline HTTP Anomaly Detector (No Fail2Ban)

#devops #security #python #docker

Every VPS running a public web app gets hit with traffic it didn't ask for, from scrapers, brute-force login attempts, or just someone's misconfigured bot hammering the same endpoint every second. Most tutorials say "install Fail2Ban and move on." But what if you want to understand the traffic before you block it? What if you need thresholds that adapt to your actual load instead of a hardcoded "5 failures in 10 minutes"?

That's what I built for the HNG DevOps track: a Python daemon that tails Nginx access logs, compares live request rates to a rolling 30-minute baseline, and reacts — Slack alerts for global spikes, iptables DROP for abusive individual IPs, with tiered auto-unban so a single bad minute doesn't permanently lock someone out.

Repository: github.com/Trojanhorse7/hng-anomaly-detector

The Stack

The whole system runs on a single Linux VPS with Docker Compose:

Nextcloud — the upstream kefaslungu/hng-nextcloud image, unmodified.
Nginx — reverse proxy in front of Nextcloud, configured to write JSON-formatted access logs (not the default combined format). This is critical — structured logs let the detector parse fields reliably instead of regex-guessing.
Detector — a Python 3.12 container that tails the shared log volume, runs the detection logic, calls Slack, and executes iptables commands on the host.
Shared volume — a named Docker volume (HNG-nginx-logs) that Nginx writes to and the detector reads from.

The detector container runs with network_mode: host and cap_add: NET_ADMIN so its iptables calls affect the actual host firewall — not an isolated container network.

How Detection Works

The detection pipeline has three layers: sliding windows, rolling baseline, and anomaly evaluation.

Layer 1: Sliding Windows (60 seconds)

Every parsed log line feeds into collections.deque structures — one global deque for all requests, and one per source IP. Timestamps older than 60 seconds are continuously evicted from the left side. At any moment, RPS = count / 60.

There's no "bucket per minute" approximation. Every request is tracked individually and aged out precisely. Parallel deques track 4xx/5xx errors separately for the error-surge path (more on that below).

Layer 2: Rolling Baseline (30 minutes)

A background thread recomputes the baseline every 60 seconds. It builds a dense vector of per-second request counts over the last 1,800 seconds (30 minutes) and calculates:

effective_mean — average requests per second
effective_std — standard deviation of per-second counts

There's an important twist: if
enough samples exist in the current UTC hour, the baseline uses only that hour's data instead of the full 30-minute window. This matters because traffic patterns shift — 2 AM is different from 2 PM, and the baseline should reflect current conditions, not a blend of quiet and busy periods.

Floor values prevent divide-by-zero edge cases in z-score calculations. Every recompute is audited to a structured log file with the timestamp, source (hourly vs full window), and the computed mean/std.

Layer 3: Anomaly Evaluation

For each incoming request, the detector compares current RPS to the baseline. An anomaly fires if either condition is true:

Z-score > threshold (default 3.0) — the current rate is more than 3 standard deviations above the baseline mean
Rate > multiplier × baseline mean (default 5×) — the current rate is more than 5 times the average

Error surge tightening: if an IP's error RPS (4xx/5xx responses) exceeds 3× the baseline error mean, thresholds tighten automatically — z-score drops to 2.0 and the rate multiplier drops to 3×. This means an IP generating lots of failed requests gets scrutinized more aggressively, which is exactly what you want for brute-force login attempts.

Normal:     z > 3.0  OR  rate > 5 × mean  →  anomaly
Error surge: z > 2.0  OR  rate > 3 × mean  →  anomaly (tighter)

What Happens When an Anomaly Fires

The system distinguishes between global and per-IP anomalies, and they trigger different responses:

Global Anomaly → Slack Only

If the aggregate RPS across all IPs spikes above the baseline, the detector sends a Slack notification. It does not apply iptables rules — blocking all traffic would take the service down. Global alerts are informational: "your server is seeing unusual load right now."

A cooldown (default 120 seconds) prevents Slack spam if the global anomaly persists for minutes.

Per-IP Anomaly → iptables DROP + Slack + Audit

If a single IP is responsible for anomalous traffic, the detector:

Adds an iptables -I INPUT -s <IP> -j DROP rule — the IP is immediately blocked at the kernel level, before Nginx even sees the packets.
Sends a Slack notification with the IP, the detection condition (z-score or rate multiplier), the current rate, and the baseline stats.
Writes a structured audit log entry with all the same details plus the ban duration.

Tiered Auto-Unban

Permanently banning IPs from a single spike is too aggressive. The system uses escalating timeouts:

Strike	Ban Duration
1st	10 minutes
2nd	30 minutes
3rd	2 hours
4th+	Permanent (no auto-unban)

A background thread checks every 3 seconds for IPs whose ban has expired, removes the iptables rule, and sends an unban Slack notification. The strike counter persists across container restarts via a JSON file (ban_state.json).

This means a legitimate user who triggered a false positive gets unblocked in 10 minutes. A repeat offender escalates through the tiers. By the 4th strike, they're gone for good.

The Audit Trail

Every significant event is appended to a structured log file at data/audit.log:

BASELINE_RECALC — every 60 seconds, with source (hourly vs full), mean, std
BAN — IP, condition, rate, baseline stats, duration
UNBAN — IP, reason, historical ban count

This file is the source of truth for debugging, compliance, and the baseline graph (more below).

The Dashboard

A FastAPI server on port 8080 serves a single-page dashboard with live metrics via WebSocket push (every 2.5 seconds). If WebSocket fails (e.g., behind a proxy without Upgrade support), the page falls back to HTTP polling automatically.

The /api/state JSON endpoint returns:

Uptime, event count, CPU/memory
Current global RPS and baseline effective_mean / effective_std
List of currently banned IPs with tier info
Top 10 source IPs by request count in the current window

Baseline Over Time

One of the requirements was demonstrating that the baseline actually adapts. By parsing BASELINE_RECALC lines from the audit log and plotting effective_mean over time, you can see the baseline shift as traffic patterns change between UTC hours.

During a busy period, effective_mean climbs. When traffic drops, it falls. The hourly-slice preference means the baseline reacts to the current hour's pattern rather than being dragged by stale data from 25 minutes ago.

Lessons Learned

1. JSON logs are non-negotiable. Parsing regex against Nginx's default combined log format is fragile. One unusual user-agent string with spaces and quotes breaks your parser. JSON logs with escape=json in the Nginx config give you reliable field extraction every time.

2. Host networking in Docker is powerful but surprising. network_mode: host means the container shares the host's network stack — iptables rules apply to the actual server, not a virtual bridge. This is exactly what you want for blocking IPs, but it also means port conflicts are your problem.

3. Hardcoded thresholds are the enemy. "Block after 100 requests per minute" sounds reasonable until your app legitimately serves 200 req/s during peak hours. A rolling baseline that adapts to actual traffic means your thresholds stay meaningful whether you're serving 2 req/s at 3 AM or 50 req/s at noon.

4. Tiered responses prevent self-inflicted outages. The first time I tested with aggressive thresholds, my own monitoring IP got permanently banned. Escalating tiers (10m → 30m → 2h → permanent) give false positives a way to recover while still catching persistent abuse.

5. Audit everything. When something goes wrong — a legitimate user gets blocked, or an attack slips through — the audit log tells you exactly what the baseline was, what the detector saw, and why it made the decision it did. Without that, you're guessing.

Running It Yourself

git clone https://github.com/Trojanhorse7/hng-anomaly-detector
cd hng-anomaly-detector
cp .env.example .env
# Set SLACK_WEBHOOK_URL in .env
docker compose build && docker compose up -d

Nextcloud at http://<VPS_IP>/, dashboard at http://<VPS_IP>:8080/.

Thresholds, window sizes, and ban durations are all in detector/config.yaml — no code changes needed to tune the system.

What I'd Improve

Per-IP baselines — currently all IPs are compared against the global baseline. High-traffic legitimate IPs (like a CDN edge) could benefit from their own rolling stats.
HTTPS on the dashboard — right now it's plain HTTP on 8080. A reverse proxy with TLS would be better for production.
Prometheus/Grafana — the audit log works, but a proper time-series database would make baseline visualization trivial.
IPv6 — the current implementation only handles IPv4 in iptables rules.

Built for the HNG DevOps track. The full source is at github.com/Trojanhorse7/hng-anomaly-detector.

DEV Community