How I Built an Adaptive "Immune System" for Cloud Traffic

#devops #security #programming #cloud

Recently, I was tasked with a challenge: Build an automated defense system for a live Nextcloud instance. The goal wasn't just to block "bad guys," but to build a system that actually learns what a normal day looks like and reacts when things get weird.

Here is the breakdown of how I engineered this system, the statistical math behind it, and why "Sliding Windows" are a developer's best defense.

The Architecture: Under the Hood

This isn't just a script running in a vacuum. To make this work in a production-style environment, I deployed a stack that mirrors real-world DevOps architecture:

The Source: A Nextcloud instance running in Docker.

The Proxy: Nginx, configured to write JSON access logs to a specific path.

The Bridge: A named Docker volume, HNG-nginx-logs, shared between Nginx (writer) and my Python Daemon (reader).

The Brain: A multi-module Python engine that tails these logs in real-time.

1. The Sliding Window: Beyond Simple Counters

Most beginners use a simple integer counter that resets every minute. That’s a mistake. If an attacker sends 1,000 requests in the last 10 seconds of a minute, a counter might miss the "peak."

I used a Time-Based Sliding Window using Python’s collections.deque.

from collections import deque
import time
# Each IP has its own deque of timestamps
window = deque()

def process_request():
    now = time.time()
    window.append(now) # Add the new hit

    # EVICTION LOGIC:
    # This ensures only the last 60 seconds exist at any time.
    # Not a counter, but a true time-based window.
    while window and window[0] < now - 60:
        window.popleft()

The Engineering Logic: This ensures that at any given millisecond, I am looking at exactly the last 60 seconds of activity. It’s a true rolling window that evicts old data as new data arrives.

2. The Baseline: 1,800 Seconds of Learning

To know what's "weird," the system has to know what's "normal."

Rolling Memory: The system maintains 1,800 seconds (30 minutes) of per-second request counts.

Recalculation: Every 60 seconds, it recomputes the Mean and Standard Deviation.

Hourly Slots: Traffic at 3 PM is different from 3 AM. The system maintains 24 hourly slots, preferring the current hour’s baseline once it has enough data to be statistically significant.

3. Detection Logic: The 3.0 Z-Score Rule

I didn't use a hardcoded limit like "100 hits = ban." Instead, the engine uses two triggers:

Z-Score > 3.0: A statistical flag meaning the traffic is 3 standard deviations away from the average.

The 5x Rule: If the current rate exceeds 5x the baseline mean.

Whichever triggers first results in an anomaly flag. This allows the system to be strict during quiet hours and flexible during peak traffic.

4. The "Zero-Trust" Error Surge

Attackers often leave a trail of 404 Not Found (scanning for hidden files) or 500 Internal Server Error (trying to crash the DB).
My engine tracks the Error Rate. If an IP's 4xx/5xx errors exceed 3x the baseline error rate, the system automatically tightens the detection threshold from 3.0 to 1.5. We stop giving them the benefit of the doubt.

5. Enforcement & The Lifecycle of a Ban

Detection is useless without action. When a ban is triggered, the engine talks directly to the Linux Kernel.

Action: Injects a DROP rule into iptables.

Backoff Schedule: Bans follow a schedule—10 minutes → 30 minutes → 2 hours → Permanent.

Alerting: A Slack notification is fired within 10 seconds, containing the Z-score, current rate, and baseline.

6 . Real-Time Observability

The Live Metrics UI serves as the control room. Built with Flask and refreshing every 3 seconds, it provides full visibility:

Global req/s vs. Learned Effective Mean/StdDev.

Banned IPs with their "Time Remaining" countdowns.

Top 10 Source IPs and system health (CPU/Memory).

Lessons Learned

The biggest takeaway here: DevOps is about observation, not just maintenance. Honestly, the hardest part wasn't the architecture; it was the math. I spent way more time than I'd like to admit fine-tuning thresholds so the system could tell the difference between a successful product launch and a genuine DDoS attack.

One real-world quirk I ran into during testing: I actually ended up banning my own Docker Gateway (172.18.0.1).

Because Nginx was seeing internal traffic through the Docker bridge, the engine flagged the gateway as an "aggressive attacker" and promptly locked it out. It was a classic "it works too well" moment. It forced me to implement a more robust whitelisting strategy for internal CIDR ranges—proving that even the best math needs to be grounded in the reality of how your specific network is plumbed.