<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wandile Ndlovu</title>
    <description>The latest articles on DEV Community by Wandile Ndlovu (@wandile_ndlovu_7dd22d4943).</description>
    <link>https://dev.to/wandile_ndlovu_7dd22d4943</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3666948%2F7e93652e-7f09-4d8c-b3e8-854417aa82a5.png</url>
      <title>DEV Community: Wandile Ndlovu</title>
      <link>https://dev.to/wandile_ndlovu_7dd22d4943</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wandile_ndlovu_7dd22d4943"/>
    <language>en</language>
    <item>
      <title>How I Built an Adaptive "Immune System" for Cloud Traffic</title>
      <dc:creator>Wandile Ndlovu</dc:creator>
      <pubDate>Wed, 29 Apr 2026 08:18:42 +0000</pubDate>
      <link>https://dev.to/wandile_ndlovu_7dd22d4943/how-i-built-an-adaptive-immune-system-for-cloud-traffic-53b</link>
      <guid>https://dev.to/wandile_ndlovu_7dd22d4943/how-i-built-an-adaptive-immune-system-for-cloud-traffic-53b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8kws2l516lbu6zl1e11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8kws2l516lbu6zl1e11.png" alt="Image" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Recently, I was tasked with a challenge: Build an automated defense system for a live Nextcloud instance. The goal wasn't just to block "bad guys," but to build a system that actually learns what a normal day looks like and reacts when things get weird.&lt;/p&gt;

&lt;p&gt;Here is the breakdown of how I engineered this system, the statistical math behind it, and why "Sliding Windows" are a developer's best defense.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Under the Hood
&lt;/h2&gt;

&lt;p&gt;This isn't just a script running in a vacuum. To make this work in a production-style environment, I deployed a stack that mirrors real-world DevOps architecture:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The Source: A Nextcloud instance running in Docker.

The Proxy: Nginx, configured to write JSON access logs to a specific path.

The Bridge: A named Docker volume, HNG-nginx-logs, shared between Nginx (writer) and my Python Daemon (reader).

The Brain: A multi-module Python engine that tails these logs in real-time.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  1. The Sliding Window: Beyond Simple Counters
&lt;/h2&gt;

&lt;p&gt;Most beginners use a simple integer counter that resets every minute. That’s a mistake. If an attacker sends 1,000 requests in the last 10 seconds of a minute, a counter might miss the "peak."&lt;/p&gt;

&lt;p&gt;I used a &lt;strong&gt;Time-Based Sliding Window&lt;/strong&gt; using Python’s collections.deque.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="c1"&gt;# Each IP has its own deque of timestamps
&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_request&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Add the new hit
&lt;/span&gt;
    &lt;span class="c1"&gt;# EVICTION LOGIC:
&lt;/span&gt;    &lt;span class="c1"&gt;# This ensures only the last 60 seconds exist at any time.
&lt;/span&gt;    &lt;span class="c1"&gt;# Not a counter, but a true time-based window.
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Engineering Logic:&lt;/strong&gt; This ensures that at any given millisecond, I am looking at exactly the last 60 seconds of activity. It’s a true rolling window that evicts old data as new data arrives.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Baseline: 1,800 Seconds of Learning
&lt;/h2&gt;

&lt;p&gt;To know what's "weird," the system has to know what's "normal."&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Rolling Memory: The system maintains 1,800 seconds (30 minutes) of per-second request counts.

Recalculation: Every 60 seconds, it recomputes the Mean and Standard Deviation.

Hourly Slots: Traffic at 3 PM is different from 3 AM. The system maintains 24 hourly slots, preferring the current hour’s baseline once it has enough data to be statistically significant.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  3. Detection Logic: The 3.0 Z-Score Rule
&lt;/h2&gt;

&lt;p&gt;I didn't use a hardcoded limit like "100 hits = ban." Instead, the engine uses two triggers:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Z-Score &amp;gt; 3.0: A statistical flag meaning the traffic is 3 standard deviations away from the average.

The 5x Rule: If the current rate exceeds 5x the baseline mean.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Whichever triggers first results in an anomaly flag. This allows the system to be strict during quiet hours and flexible during peak traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The "Zero-Trust" Error Surge
&lt;/h2&gt;

&lt;p&gt;Attackers often leave a trail of &lt;em&gt;404 Not Found&lt;/em&gt; (scanning for hidden files) or &lt;em&gt;500 Internal Server Error&lt;/em&gt; (trying to crash the DB).&lt;br&gt;
My engine tracks the &lt;strong&gt;Error Rate&lt;/strong&gt;. If an IP's 4xx/5xx errors exceed 3x the baseline error rate, the system automatically tightens the detection threshold from 3.0 to 1.5. We stop giving them the benefit of the doubt.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Enforcement &amp;amp; The Lifecycle of a Ban
&lt;/h2&gt;

&lt;p&gt;Detection is useless without action. When a ban is triggered, the engine talks directly to the Linux Kernel.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Action: Injects a DROP rule into iptables.&lt;br&gt;
Backoff Schedule: Bans follow a schedule—10 minutes → 30 minutes → 2 hours → Permanent.&lt;br&gt;
Alerting: A Slack notification is fired within 10 seconds, containing the Z-score, current rate, and baseline.&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  6 . Real-Time Observability&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;The Live Metrics UI serves as the control room. Built with Flask and refreshing every 3 seconds, it provides full visibility:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Global req/s vs. Learned Effective Mean/StdDev.&lt;br&gt;
Banned IPs with their "Time Remaining" countdowns.&lt;br&gt;
Top 10 Source IPs and system health (CPU/Memory).&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Lessons Learned&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;The biggest takeaway here: DevOps is about observation, not just maintenance. Honestly, the hardest part wasn't the architecture; it was the math. I spent way more time than I'd like to admit fine-tuning thresholds so the system could tell the difference between a successful product launch and a genuine DDoS attack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One real-world quirk I ran into during testing&lt;/strong&gt;: I actually ended up banning my own Docker Gateway (172.18.0.1).&lt;/p&gt;

&lt;p&gt;Because Nginx was seeing internal traffic through the Docker bridge, the engine flagged the gateway as an "aggressive attacker" and promptly locked it out. It was a classic "it works too well" moment. It forced me to implement a more robust whitelisting strategy for internal CIDR ranges—proving that even the best math needs to be grounded in the reality of how your specific network is plumbed.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>devops</category>
      <category>security</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
