DEV Community: Timilehin Obalereko

I Built a Tool That Writes Its Own Infrastructure

Timilehin Obalereko — Wed, 06 May 2026 12:18:42 +0000

A complete beginner-friendly walkthrough of building SwiftDeploy: a CLI tool that generates Nginx configs, manages Docker containers, enforces deployment policies, and gives you a live metrics dashboard — all from a single YAML file.

Who This Post Is For

If you have heard words like "Docker", "Nginx", "deployment", or "policy-as-code" and felt a little lost — this post is for you. I will explain every concept from scratch before using it. By the end, you will understand not just what I built, but why every piece exists.

The Problem This Solves

Imagine you are deploying a web app. Normally you would have to:

Write a Docker Compose file to describe your containers
Write an Nginx config to set up your web server
Remember to update both files every time something changes
Hope you did not make a typo somewhere

That is a lot of manual work, and manual work leads to mistakes.

What if instead you just wrote one simple file describing what you want, and a smart tool generated everything else?

That is exactly what SwiftDeploy does. You write manifest.yaml. SwiftDeploy does the rest.

Quick Glossary (Read This First)

Before we dive in, here are the key terms used throughout this post:

Docker — A tool that packages your app into a "container" — think of it like a shipping container for software. It runs the same way everywhere.

Container — A lightweight isolated box running your app. Like a mini computer inside your computer.

Docker Compose — A tool that lets you run multiple containers together and define how they connect.

Nginx — A web server that sits in front of your app and handles incoming traffic. Think of it as a receptionist who forwards calls to the right person.

Reverse proxy — When Nginx receives a request from a user and forwards it to your app. The user never talks to your app directly.

YAML — A simple file format using indentation to represent data. Like a structured shopping list.

CLI — Command Line Interface. A tool you run by typing commands in the terminal.

Canary deployment — A technique where you run two versions of your app at the same time — a stable version for most users and a "canary" version for testing. Like sending one canary into a mine before sending all the miners.

OPA (Open Policy Agent) — A policy engine. You write rules in a language called Rego, and OPA decides whether an action is allowed based on those rules.

Prometheus metrics — A standard format for exposing app statistics like request counts and response times. Looks like plain text with specific formatting.

Part 1 — The Foundation: manifest.yaml as the Single Source of Truth

What Is a "Single Source of Truth"?

In software, a "single source of truth" means one place where the authoritative information lives. If you have the same information in multiple files and they disagree, which one is right? Nobody knows. That is a bug waiting to happen.

SwiftDeploy solves this by making manifest.yaml the only file you ever edit. Every other config file is generated from it automatically.

Here is what the manifest looks like:

services:
  image: swift-deploy-1-node:latest   # Which Docker image to run
  port: 3000                          # What port your app uses inside the container
  mode: stable                        # stable or canary
  version: "1.0.0"                    # App version
  restart: unless-stopped             # Restart if it crashes

nginx:
  image: nginx:latest                 # The Nginx web server image
  port: 8080                          # The port the outside world connects to
  proxy_timeout: 30                   # Give up after 30 seconds

network:
  name: swiftdeploy-net               # Name of the internal network
  driver_type: bridge                 # Type of network

contact: "admin@swiftdeploy.local"    # Shown in error messages

Plain English: "Run my app on port 3000, put Nginx in front of it on port 8080, connect them on an internal network."

How Does `swiftdeploy init` Work?

When you run ./swiftdeploy init, it:

Reads manifest.yaml
Opens templates/nginx.conf.tmpl — a template with placeholder values like {{NGINX_PORT}}
Replaces every placeholder with the real value from the manifest
Writes the result to nginx.conf
Does the same thing for docker-compose.yml

The key insight is string replacement. The template has {{NGINX_PORT}} and the script replaces it with 8080. Simple but powerful.

# This is the core of how it works
nginx_conf = nginx_conf.replace('{{NGINX_PORT}}', nginx_port)
                       .replace('{{PROXY_TIMEOUT}}', proxy_timeout)
                       .replace('{{APP_PORT}}', app_port)

The grader can delete the generated files and run ./swiftdeploy init again — they will be recreated perfectly from the manifest. That is the whole point.

Part 2 — The Architecture: Three Containers, One Network

When you run ./swiftdeploy deploy, three containers start:

Outside World
      |
      | port 8080
      ↓
  ┌─────────┐
  │  Nginx  │  ← Receives all traffic, forwards to app
  └────┬────┘
       │ internal network (swiftdeploy-net)
       ↓
  ┌─────────┐
  │   App   │  ← Python service, port 3000 (never exposed to outside)
  └─────────┘

  ┌─────────┐
  │   OPA   │  ← Policy engine, port 8181 (CLI only, not through Nginx)
  └─────────┘

Why does the app port never get exposed?

This is a security decision. If port 3000 was exposed directly, anyone could bypass Nginx and hit your app without going through timeouts, error handling, logging, or security headers. By keeping it internal, all traffic is forced through Nginx.

Why is OPA isolated from Nginx?

OPA is only for the CLI to query. It is not a user-facing service. If it were accessible through Nginx on port 8080, anyone could query your policies. So OPA lives on the same internal Docker network but is never proxied by Nginx.

Part 3 — The App: A Python HTTP Service

The app (app/main.py) is a pure Python web server — no external frameworks, just Python's built-in http.server. This keeps the Docker image tiny (74MB, well under the 300MB limit).

The Three Endpoints

GET / — Welcome message with mode, version, and timestamp:

{
  "message": "Welcome to SwiftDeploy! Running in canary mode.",
  "mode": "canary",
  "version": "1.0.0",
  "time": "2026-05-05T21:00:00+00:00"
}

GET /healthz — Is the app alive? How long has it been running?

{
  "status": "ok",
  "uptime": 342.15
}

Docker polls this endpoint every 10 seconds. If it fails three times in a row, Docker marks the container "unhealthy" and restarts it. That is why we never apply chaos to /healthz — if we did, Docker would kill our container during a chaos test.

POST /chaos — Only available in canary mode. Simulates failures:

# Make 50% of requests fail with 500 errors
curl -X POST http://localhost:8080/chaos \
  -d '{"mode": "error", "rate": 0.5}'

# Make every request sleep 5 seconds
curl -X POST http://localhost:8080/chaos \
  -d '{"mode": "slow", "duration": 5}'

# Cancel all chaos
curl -X POST http://localhost:8080/chaos \
  -d '{"mode": "recover"}'

Why Only in Canary Mode?

Because canary mode is the "I am testing something risky" mode. Stable mode means production traffic — you should never be able to break it intentionally. Canary mode is the sandbox where chaos makes sense.

Thread Safety

The app handles multiple requests at the same time. If two requests tried to update the chaos state simultaneously, they could corrupt each other. Python's threading.Lock() prevents this:

# Lock means: only one request can change this at a time
with chaos_lock:
    chaos_state["mode"] = "error"
    chaos_state["rate"] = 0.5

This is called "thread safety" — making sure shared data is not corrupted by concurrent access.

Part 4 — The Nginx Configuration

Nginx sits between the user and the app. Here is what our generated nginx.conf does:

Custom Log Format

The task required logs in a specific format. In Nginx, you define log formats with log_format:

log_format swiftdeploy '$time_iso8601 | $status | $request_times | $upstream_addr | $request';

This produces logs like:

2026-05-05T21:00:00+00:00 | 200 | 0.001s | 172.18.0.2:3000 | GET / HTTP/1.1

Timestamp | Status code | How long it took | Where it went | What was requested

JSON Error Responses

When the app is down or slow, Nginx returns an error. By default, Nginx returns HTML error pages — ugly and not useful for APIs. We override this with JSON:

location @err502 {
    default_type application/json;
    return 502 '{"error": "Bad Gateway", "code": 502, "service": "swiftdeploy-api", "contact": "admin@swiftdeploy.local"}';
}

502 means "Bad Gateway" — Nginx could not reach the app. 503 means "Service Unavailable". 504 means "Gateway Timeout" — the app took too long.

The DNS Resolution Trick

One problem we hit: when running nginx -t to validate the config syntax, Nginx tries to resolve the hostname app (the Docker container name). But app only exists inside the Docker network — not during standalone validation. This caused a "host not found" error even though the config was syntactically correct.

The fix is to use a variable for the upstream address:

resolver 127.0.0.11 valid=10s;  # Docker's internal DNS server

location / {
    set $upstream http://app:3000;  # Variable = skip DNS at startup
    proxy_pass $upstream;           # Nginx resolves it per-request instead
}

When proxy_pass uses a variable, Nginx skips DNS resolution at startup and resolves it per-request. This lets nginx -t pass even when the app container does not exist yet.

Part 5 — Stable vs Canary Mode

Canary deployments come from a real practice in mining. Miners used to bring canaries into mines — if the canary died, the air was bad and the miners knew to leave. In software, a "canary" is a small deployment that gets traffic first. If it fails, only a small percentage of users are affected before you roll back.

SwiftDeploy implements a simple version:

Stable — normal mode, chaos disabled
Canary — test mode, chaos enabled, every response gets X-Mode: canary header

When you promote:

./swiftdeploy promote canary

The script:

Runs a 30-second OPA policy check (explained in Part 7)
Updates mode: canary in manifest.yaml in-place (preserving all comments)
Regenerates docker-compose.yml with the new MODE=canary environment variable
Restarts only the app container — Nginx stays up, no downtime
Polls /healthz to confirm the new mode is active

The X-Mode: canary header indicates to clients that they are talking to the canary version. Nginx forwards it from the app to the user:

proxy_pass_header X-Mode;
add_header X-Mode $upstream_http_x_mode always;

Part 6 — Prometheus Metrics: The "Eyes" of the System

What Are Metrics?

Metrics are numbers that describe how your app is behaving. Things like:

How many requests per second?
What percentage are failing?
How long do requests take?

Prometheus is a popular monitoring system. It expects metrics in a specific text format.

What We Track

http_requests_total — A counter. Counts every request, labelled by method, path, and status code:

http_requests_total{method="GET",path="/",status_code="200"} 142
http_requests_total{method="GET",path="/",status_code="500"} 8

http_request_duration_seconds — A histogram. Groups request durations into buckets:

http_request_duration_seconds_bucket{le="0.1"} 140   # 140 requests took ≤ 0.1s
http_request_duration_seconds_bucket{le="0.5"} 142   # 142 requests took ≤ 0.5s
http_request_duration_seconds_bucket{le="+Inf"} 150  # 150 total requests

A histogram is what allows us to calculate percentile latency. The P99 (99th percentile) means "99% of requests completed within this time." It is more meaningful than the average because averages hide slow outliers.

app_mode — 0 for stable, 1 for canary.

chaos_active — 0 for none, 1 for slow, 2 for error.

How `record_request()` Works

After every request completes, we call:

record_request("GET", "/", 200, duration)

This updates both the counter and the histogram in one call. We do NOT record /healthz or /metrics — those are infrastructure endpoints, not user traffic.

Part 7 — OPA: The "Brain" That Makes Decisions

What Is OPA?

Open Policy Agent is a general-purpose policy engine. You write rules in a language called Rego (pronounced "ray-go"), and OPA evaluates them against data you send it.

The core principle: The CLI never decides whether to allow or deny. It only collects data, sends it to OPA, and surfaces the result. All decision logic lives in Rego.

Why does this matter? Because it means you can change your policies without touching your deployment code. Policy and mechanics are separated.

The Infrastructure Policy

Before every deployment, we check if the host has enough resources. Here is policies/infrastructure.rego:

package infrastructure

default allow := false

allow if {
    count(deny) == 0    # Allow if there are zero deny reasons
}

deny contains reason if {
    input.disk_free_gb < data.infrastructure.min_disk_free_gb
    reason := sprintf("Disk free %.1fGB is below minimum %dGB",
        [input.disk_free_gb, data.infrastructure.min_disk_free_gb])
}

deny contains reason if {
    input.cpu_load > data.infrastructure.max_cpu_load
    reason := sprintf("CPU load %.2f exceeds maximum %.2f",
        [input.cpu_load, data.infrastructure.max_cpu_load])
}

Notice: the thresholds (min_disk_free_gb, max_cpu_load) are not written in the Rego file. They come from data.json:

{
  "infrastructure": {
    "min_disk_free_gb": 10,
    "max_cpu_load": 2.0,
    "max_mem_percent": 90
  }
}

This separation means: change the threshold by editing data.json, not the policy logic. Environments have different needs — a production server and a test server should not have the same disk requirements.

The Canary Safety Policy

Before every promotion, we check if the running service is healthy enough to switch modes. This is policies/canary.rego:

package canary

default allow := false

allow if { count(deny) == 0 }

deny contains reason if {
    input.error_rate_percent > data.canary.max_error_rate_percent
    reason := sprintf("Error rate %.2f%% exceeds maximum %.2f%% over last 30s",
        [input.error_rate_percent, data.canary.max_error_rate_percent])
}

deny contains reason if {
    input.p99_latency_ms > data.canary.max_p99_latency_ms
    reason := sprintf("P99 latency %.0fms exceeds maximum %dms over last 30s",
        [input.p99_latency_ms, data.canary.max_p99_latency_ms])
}

How the 30-Second Window Works

The requirement says "over the last 30 seconds." But Prometheus metrics are cumulative — they count from when the app started, not just the last 30 seconds.

The trick: take two snapshots 30 seconds apart and subtract them.

Snapshot 1 (time T)    →    wait 30s    →    Snapshot 2 (time T+30)

delta = Snapshot2 - Snapshot1   ←  this is what happened in the last 30 seconds

delta_total  = sum(totals2.values()) - sum(totals1.values())
delta_errors = sum(errors in snapshot2) - sum(errors in snapshot1)
error_rate_percent = (delta_errors / delta_total * 100)

This gives us a true "what happened in the last 30 seconds" measurement, regardless of historical traffic.

How the CLI Talks to OPA

OPA exposes a REST API. The CLI sends a POST request with the data:

curl -X POST http://127.0.0.1:8181/v1/data/infrastructure \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "disk_free_gb": 11.0,
      "cpu_load": 0.18,
      "mem_percent": 53.5
    }
  }'

OPA responds with:

{
  "result": {
    "allow": true,
    "deny": []
  }
}

Or if blocked:

{
  "result": {
    "allow": false,
    "deny": ["Disk free 4.0GB is below minimum 10GB"]
  }
}

The CLI reads allow to know what to do, and prints the deny reasons so the operator knows exactly why something was blocked.

Part 8 — What Happened When I Injected Chaos

This is where it gets interesting. After deploying in canary mode and activating error chaos:

./swiftdeploy promote canary

curl -X POST http://localhost:8080/chaos \
  -d '{"mode": "error", "rate": 0.5}'

The status dashboard started showing failures:

══════════════════════════════════════════════════════
  SwiftDeploy Status Dashboard
  2026-05-05T21:55:12Z
══════════════════════════════════════════════════════
  Mode:        canary
  Chaos:       error

  Total reqs:  89
  Error rate:  48.31%
  P99 latency: 10000ms

  Policy Compliance:
    ✅ infrastructure: PASS (disk=11GB, cpu=0.22, mem=54.1%)
    ❌ canary: FAIL (error=48.31%, p99=10000ms)

Then when I tried to promote to stable while the chaos was running:

── OPA Pre-Promote Policy Check ──────────────────────
  Sampling metrics over a 30-second window...
  Last 30s: 80 requests, 36 errors
  Error rate: 45.0000%, P99 latency: 10000.00ms
  ❌ BLOCK: Canary policy denied promotion
    ↳ Error rate 45.00% exceeds maximum 1.00% over last 30s
    ↳ P99 latency 10000ms exceeds maximum 500ms over last 30s
❌ Promotion blocked by policy.

The policy worked. The system refused to let me promote while the service was actively failing. This is the whole point — you cannot accidentally promote a broken service.

After recovering:

curl -X POST http://localhost:8080/chaos -d '{"mode": "recover"}'
./swiftdeploy promote stable  # Now succeeds

Part 9 — The Audit Trail

Every significant event is written to history.jsonl:

{"timestamp":"2026-05-05T21:00:00Z","event":"deploy","details":{"mode":"stable"}}
{"timestamp":"2026-05-05T21:10:00Z","event":"promote","details":{"mode":"canary"}}
{"timestamp":"2026-05-05T21:15:00Z","event":"policy_check","details":{"policy":"canary","result":"fail","error_rate_percent":45.0}}
{"timestamp":"2026-05-05T21:20:00Z","event":"promote","details":{"mode":"stable"}}

Running ./swiftdeploy audit transforms this into a readable Markdown report:

# SwiftDeploy Audit Report

Generated: 2026-05-05T21:30:00Z
Total events: 24

## Timeline

| Timestamp | Event | Details |
|-----------|-------|---------|
| 2026-05-05T21:00:00Z | deploy | mode=stable |
| 2026-05-05T21:10:00Z | promote | → canary |
| 2026-05-05T21:15:00Z | policy_check | canary: fail |

## Policy Violations

| Timestamp | Policy | Reasons |
|-----------|--------|---------|
| 2026-05-05T21:15:00Z | canary | Error rate 45.00% exceeds maximum 1.00% over last 30s |

Part 10 — Lessons Learned

1. The DNS Resolution Problem Was the Biggest Surprise

I spent a long time debugging why nginx -t kept failing with "host not found" even though my config was correct. The issue was that Nginx tries to resolve hostnames at startup — but app is a Docker hostname that only exists inside a running Docker network.

The fix (using set $upstream as a variable) was not obvious. It is an Nginx-specific trick that delays DNS resolution from startup to per-request. Learning this saved me from a broken validation check.

2. Capabilities Matter More Than You Think

Running containers as non-root is well-known advice. But dropping capabilities (like cap_drop: ALL) is less discussed. When I dropped all capabilities from Nginx, it crashed on startup with:

chown("/var/cache/nginx/client_temp", 101) failed (1: Operation not permitted)

Nginx needs CHOWN to set up its temp directories. Dropping ALL capabilities silently breaks things. The lesson: drop specific capabilities you know the container does not need, not everything at once.

3. OPA's `sprintf` Handles Integers and Floats Differently

When my Rego policy tried to format an integer from data.json using %.0f, OPA produced %!f(int=500) — a formatting error. The fix was to use %d for integers from data.json and %.0f only for float inputs. OPA does not silently convert types the way Python does.

4. Cumulative Metrics Need Careful Handling

Prometheus metrics are cumulative counters — they only go up. To calculate "what happened in the last 30 seconds," you cannot just look at the current value. You need two snapshots and a delta. This is obvious in retrospect, but took a few failed attempts to get right.

5. The Manifest as the Single Source of Truth Actually Works

The grader requirement was: delete generated files, run ./swiftdeploy init, and verify they regenerate correctly. This forced a discipline that turned out to be genuinely useful. When you know everything comes from one file, debugging is much easier. There is only one place to look.

Complete Setup Instructions

# Clone the repo
git clone https://github.com/YOUR_USERNAME/swiftdeploy-automation.git
cd swiftdeploy-automation

# Install Python dependency
pip install pyyaml

# Build the Docker image
docker build -t swift-deploy-1-node:latest .

# Make CLI executable
chmod +x swiftdeploy

# Deploy
./swiftdeploy deploy

# Try everything
curl http://localhost:8080/
curl http://localhost:8080/healthz
curl http://localhost:8080/metrics
./swiftdeploy status
./swiftdeploy promote canary
./swiftdeploy promote stable
./swiftdeploy audit
./swiftdeploy teardown --clean

Conclusion

SwiftDeploy taught me that the most valuable thing in infrastructure is not the tools themselves — it is the discipline they enforce.

Making the manifest the single source of truth means you cannot have config drift. Putting all policy logic in OPA means your deployment tool cannot make silent decisions. Requiring metrics before promotion means you cannot promote a broken service by accident.

Each of these constraints feels like a restriction at first. But each one also prevents a whole category of mistakes.

If you want to replicate this project, the full source code is at the GitHub link below. Every file is commented and explained. Start with manifest.yaml, read swiftdeploy from top to bottom, then look at the Rego policies.

The best way to learn is to break it intentionally — fill up the disk, inject chaos, watch the policies fire. That is what the chaos endpoint is for.

Source code: github.com/devops-timi/swiftdeploy-automation

Real-Time Anomaly Detection Engine for a Cloud Storage Platform

Timilehin Obalereko — Tue, 28 Apr 2026 20:50:23 +0000

I built a Python daemon that watches incoming HTTP traffic in real time, learns what "normal" looks like, and automatically blocks attackers using Linux's built-in firewall — all without any third-party security tools.

The Problem

Imagine you run a cloud storage company. Your platform is public — anyone on the internet can send requests to it. Most of those people are legitimate users uploading files. But some of them are attackers — bots hammering your server with thousands of requests per second, trying to crash it, brute-force passwords, or scrape data.

You need something that:

Watches all incoming traffic in real time
Learns what normal traffic looks like
Detects when something looks abnormal
Automatically blocks the attacker
Alerts your team on Slack
Shows you a live dashboard of what is happening

That is exactly what I built for my HNG DevSecOps internship. Let me walk you through every part of it in plain English.

The Big Picture

Before we dive into code, here is how all the pieces fit together:

[User's Browser / Attacker Bot]
           │
           │ HTTP Request
           ▼
      ┌─────────┐
      │  Nginx  │  ← logs every request as JSON
      └────┬────┘
           │
           ▼
     ┌───────────┐
     │ Nextcloud │  ← the actual cloud storage app
     └───────────┘

     ┌──────────────────────────┐
     │   Shared Docker Volume   │  ← log file lives here
     └──────────────────────────┘
           │ reads logs
           ▼
     ┌─────────────────────────────────────┐
     │         Detector Daemon             │
     │                                     │
     │  monitor.py  → reads log lines      │
     │  baseline.py → learns normal        │
     │  detector.py → spots anomalies      │
     │  blocker.py  → runs iptables        │
     │  unbanner.py → lifts bans later     │
     │  notifier.py → sends Slack alerts   │
     │  dashboard.py → web UI              │
     └─────────────────────────────────────┘
           │
           ▼
     iptables blocks attacker IPs
     Slack receives alerts
     Dashboard shows live stats

The key insight is that the detector runs alongside Nextcloud — not inside it. It reads Nginx's logs from a shared Docker volume and acts as an automated security guard.

Part 1: Reading Logs in Real Time

The first challenge is reading the Nginx log file as new lines appear — like watching a live news feed.

In bash, you do this with tail -f. In Python, I built the same thing:

def tail_log(log_path):
    """
    Generator that continuously reads new lines from a log file.
    Yields one parsed log entry at a time.
    """
    # Wait until the log file exists
    while not os.path.exists(log_path):
        time.sleep(2)

    with open(log_path, "r") as f:
        # Jump to the END of the file
        # We only want NEW requests, not old history
        f.seek(0, 2)

        while True:
            line = f.readline()
            if line:
                parsed = parse_line(line.strip())
                if parsed:
                    yield parsed  # send one entry to the caller
            else:
                # No new line yet — wait 50ms and try again
                time.sleep(0.05)

A few things worth explaining here:

f.seek(0, 2) — This jumps to the end of the file. Without this, we would process every old log line from before the daemon started, which would give us garbage baseline data.

yield parsed — This makes the function a generator. Instead of returning a whole list of log entries (which would use lots of memory), it sends entries one at a time to the caller. The caller gets each entry the instant it arrives.

time.sleep(0.05) — When there are no new lines, we wait 50 milliseconds before checking again. This is the "tail" behaviour — fast enough to catch requests in real time, not so fast that we burn CPU.

Nginx writes logs in JSON format, so parsing is simple:

def parse_line(line):
    try:
        data = json.loads(line)
        # Make sure all required fields are present
        required = ["source_ip", "timestamp", "method", 
                    "path", "status", "response_size"]
        for field in required:
            if field not in data:
                return None
        return data
    except json.JSONDecodeError:
        return None  # skip malformed lines

Part 2: The Sliding Window

Now that we can read log lines, we need to calculate how many requests per second a given IP is making.

The naive approach would be: count all requests in the last minute, divide by 60. But this is called a "per-minute counter" and it has a problem — it only updates once per minute, so it misses short bursts.

The right approach is a sliding window.

Think of it like this: you are standing on a moving train, looking out a window that shows exactly 60 seconds of track behind you. As the train moves forward, the window always shows the LAST 60 seconds — older track disappears from view automatically.

In code, this uses Python's collections.deque:

from collections import deque, defaultdict
import time

class SlidingWindowDetector:
    def __init__(self, config):
        self.window_seconds = 60  # look at last 60 seconds

        # One deque per IP — each entry is a timestamp
        self.ip_windows = defaultdict(deque)

        # One global deque for all traffic combined
        self.global_window = deque()

    def record(self, ip, status):
        """Called for every incoming request."""
        now = time.time()
        self.ip_windows[ip].append(now)   # add timestamp
        self.global_window.append(now)    # add to global too

    def _evict_old(self, dq, now, window):
        """
        Remove timestamps older than `window` seconds.

        Deques are ordered — oldest on the LEFT, newest on the RIGHT.
        We pop from the left until the oldest entry is within our window.
        This is the eviction logic — what makes it a SLIDING window.
        """
        cutoff = now - window  # anything before this is too old
        while dq and dq[0] < cutoff:
            dq.popleft()  # remove oldest entry

    def get_ip_rate(self, ip):
        """Returns requests per second for this IP."""
        now = time.time()
        dq = self.ip_windows[ip]
        self._evict_old(dq, now, self.window_seconds)
        # Whatever is left = requests in the last 60 seconds
        return len(dq) / self.window_seconds

Here is why this is elegant:

Every request adds one timestamp to the deque
When you want the rate, you throw away everything older than 60 seconds
Count what remains, divide by 60 — that is your rate
The window "slides" because old timestamps fall off automatically

We have two windows:

Per-IP window — detects a single attacker hammering the server
Global window — detects a distributed attack (botnet) where thousands of IPs each send moderate requests, overwhelming the server even though no single IP looks bad

Part 3: The Baseline — Learning What Normal Looks Like

This is the most important part. To detect anomalies, we first need to know what "normal" looks like.

At 2am, maybe your server averages 1 request per second. At 2pm, maybe it averages 50. The baseline must adapt to these patterns — it cannot be a hardcoded number.

My approach: every second, I count how many requests arrived. I store these per-second counts in a rolling window of 30 minutes (1800 seconds). Every 60 seconds, I compute the mean and standard deviation of these counts.

class BaselineEngine:
    def __init__(self, config):
        # Store up to 1800 per-second counts (30 min)
        self.global_samples = deque(maxlen=1800)

        # Floor values prevent division by zero at idle
        self.floor_mean = 1.0
        self.floor_stddev = 0.5

        # Start at floor values
        self.effective_mean = self.floor_mean
        self.effective_stddev = self.floor_stddev

    def _recalculate(self):
        """Compute mean and stddev from rolling samples."""
        samples = [count for (_, count) in self.global_samples]

        if len(samples) < 10:
            return  # not enough data yet

        # Mean = average requests per second
        mean = sum(samples) / len(samples)

        # Standard deviation = how much traffic varies
        # sqrt( average of squared differences from the mean )
        variance = sum((x - mean)**2 for x in samples) / len(samples)
        stddev = math.sqrt(variance)

        # Apply floor values — never go below minimum
        self.effective_mean = max(mean, self.floor_mean)
        self.effective_stddev = max(stddev, self.floor_stddev)

Mean tells us: "on average, how many requests per second do we get?"

Standard deviation tells us: "how much does that vary?" A small stddev means traffic is very consistent. A large stddev means traffic is spiky.

The Spike Guard — Keeping the Baseline Clean

Here is a subtle but critical problem: what happens when an attack occurs?

Without protection, the attacker's 500 req/s gets fed into the baseline. After a minute, the baseline thinks 500 req/s is "normal." The next attack looks completely fine and goes undetected.

The solution is a spike guard:

def _flush_second(self):
    count = self.current_second_count

    # If this second's count is more than 10x the current mean,
    # it is almost certainly attack traffic — discard it
    if len(self.global_samples) >= 10:
        if count > 10 * self.effective_mean:
            print(f"[Baseline] Spike guard: {count} req/s discarded")
            self.current_second_count = 0
            return  # do NOT save this to baseline

    # Normal traffic — save it
    self.global_samples.append((time.time(), count))
    self.current_second_count = 0

With this in place:

Attack happens → spike is discarded → baseline stays clean ✅
Attack stops → baseline is still accurate ✅
Second attack → detected immediately because baseline is still clean ✅

Part 4: Detecting Anomalies

Now we have:

The current request rate (from the sliding window)
The baseline mean and stddev (from the baseline engine)

How do we decide if the rate is "too high"?

Method 1: Z-Score

The z-score measures how many standard deviations above the mean a value is:

z-score = (current_rate - mean) / stddev

If the mean is 5 req/s and stddev is 1, and we see 8.5 req/s:

z-score = (8.5 - 5) / 1 = 3.5

In statistics, a z-score above 3.0 means the value is so extreme it only occurs 0.13% of the time in normal data. In other words: 99.87% chance something is wrong.

def check_ip_anomaly(self, ip, baseline):
    rate = self.get_ip_rate(ip)
    mean = baseline["mean"]
    stddev = baseline["stddev"]

    # Z-score check
    if stddev > 0:
        zscore = (rate - mean) / stddev
    else:
        zscore = 0

    if zscore > 3.0:
        return True, f"z-score {zscore:.2f} > 3.0", rate

    # Rate multiplier check (backup)
    if rate > 5.0 * mean:
        return True, f"rate {rate:.2f} > 5x mean {mean:.2f}", rate

    return False, "", rate

Method 2: Rate Multiplier

Even if z-score math produces unexpected results, if someone is sending 5x the normal amount of traffic, that is unambiguously suspicious. This is the backup check that catches edge cases.

Error Surge Detection

If an IP is causing lots of 404s (page not found) and 401s (unauthorized), it is likely probing for vulnerabilities. We automatically tighten the thresholds for these IPs:

error_rate = self.get_ip_error_rate(ip)
if error_rate >= 3.0 * baseline["error_mean"]:
    # Suspicious error pattern — use tighter thresholds
    zscore_threshold = 2.0   # was 3.0
    rate_threshold = 3.0     # was 5.0

Part 5: Blocking with iptables

When we detect an anomaly, we block the IP using iptables — Linux's built-in packet filter.

iptables sits at the network level, before any application (Nginx, Nextcloud, Python) ever sees a packet. When you add a DROP rule, the Linux kernel silently discards all packets from that IP. The attacker's requests just time out — they get no response at all.

import subprocess

def ban_ip(self, ip):
    """Add an iptables DROP rule for this IP."""

    # -I INPUT 1 = insert at position 1 (top priority)
    # -s {ip}    = match packets FROM this IP
    # -j DROP    = silently discard the packet
    subprocess.run([
        "iptables", "-I", "INPUT", "1",
        "-s", ip, "-j", "DROP"
    ])

    print(f"[Blocker] BANNED {ip}")

def unban_ip(self, ip):
    """Remove the iptables DROP rule."""

    # -D = delete the matching rule
    subprocess.run([
        "iptables", "-D", "INPUT",
        "-s", ip, "-j", "DROP"
    ])

    print(f"[Blocker] UNBANNED {ip}")

We use -I INPUT 1 (insert at position 1) rather than -A (append) so the DROP rule has the highest priority — it is checked before any ACCEPT rules.

The Auto-Unban Backoff Schedule

We do not ban forever immediately. The ban schedule escalates with repeated offenses:

Offense	Ban Duration
1st	10 minutes
2nd	30 minutes
3rd	2 hours
4th+	Permanent

A background thread checks every 30 seconds and lifts expired bans:

def _check_bans(self):
    now = time.time()
    for ip, ban_info in self.blocker.get_active_bans().items():
        duration = ban_info["duration"]
        banned_at = ban_info["banned_at"]

        if duration == -1:
            continue  # permanent ban, never auto-unban

        if now >= banned_at + duration:
            self.blocker.unban_ip(ip)
            self.detector.banned_ips.discard(ip)
            self.notifier.send_unban(ip=ip, ...)

Part 6: Slack Alerts

Every ban, unban, and global anomaly sends a message to Slack via an Incoming Webhook — a special URL that posts messages to a channel when you send an HTTP POST request to it.

def send_ban(self, ip, reason, rate, baseline, duration):
    message = (
        f"🚨 *IP BANNED* — `{ip}`\n"
        f"*Condition:* {reason}\n"
        f"*Current Rate:* {rate:.2f} req/s\n"
        f"*Baseline Mean:* {baseline['mean']:.2f} req/s\n"
        f"*Ban Duration:* {duration_str}\n"
        f"*Time:* {timestamp}"
    )

    requests.post(
        self.webhook_url,
        json={"text": message},
        timeout=10
    )

The 10-second timeout is important — if Slack's servers are slow, we do not want the Slack call to block our detection loop.

Part 7: The Live Dashboard

The dashboard is a Flask web app that serves a single HTML page. The page polls a /api/metrics endpoint every 3 seconds and updates the display without reloading:

@app.route("/api/metrics")
def metrics():
    return jsonify({
        "global_rate": detector.get_global_rate(),
        "top_ips": detector.get_top_ips(10),
        "banned_ips": blocker.get_active_bans(),
        "cpu_percent": psutil.cpu_percent(),
        "mem_percent": psutil.virtual_memory().percent,
        "baseline_mean": baseline.effective_mean,
        "baseline_stddev": baseline.effective_stddev,
        "uptime": calculate_uptime(),
    })

The JavaScript side polls this every 3 seconds:

async function refresh() {
    const res = await fetch('/api/metrics');
    const d = await res.json();

    document.getElementById('global-rate').textContent = d.global_rate;
    document.getElementById('baseline-mean').textContent = d.baseline_mean;
    // ... update all other fields
}

setInterval(refresh, 3000);  // run every 3 seconds

Part 8: Putting It All Together

The main.py orchestrator wires everything together in one loop:

for log_entry in tail_log(config["log_path"]):
    ip = log_entry["source_ip"]
    status = log_entry["status"]

    # 1. Record in sliding window for rate detection
    detector.record(ip, status)

    # 2. Feed into baseline ONLY if IP is not banned
    #    (prevents attack traffic from corrupting baseline)
    if not blocker.is_banned(ip):
        baseline.record_request(ip, status)

    # 3. Maybe recalculate baseline every 60 seconds
    if baseline.maybe_recalculate():
        audit_logger.log_baseline_recalc(...)

    # 4. Get current baseline
    b = baseline.get_baseline()

    # 5. Check if this IP is anomalous
    is_anomaly, reason, rate = detector.check_ip_anomaly(ip, b)
    if is_anomaly and not blocker.is_banned(ip):
        duration = blocker.ban_ip(ip)
        notifier.send_ban(ip, reason, rate, b, duration)
        audit_logger.log_ban(ip, reason, rate, b, duration)

    # 6. Check if global traffic is anomalous
    global_anomaly, reason, rate = detector.check_global_anomaly(b)
    if global_anomaly:
        notifier.send_global_alert(reason, rate, b)

Every single log line goes through this sequence within milliseconds of Nginx writing it.

The Results

When an attack hits the server, here is what happens end to end:

Attacker sends 150 concurrent requests per second
Nginx logs each request as JSON to the shared volume
monitor.py detects new lines within 50ms
Sliding window calculates rate: 150 req/s
Z-score: (150 - 1.5) / 0.8 = 185 — massively above threshold
iptables -I INPUT 1 -s {attacker_ip} -j DROP fires
Slack alert sent within seconds
Audit log entry written
After 10 minutes, auto-unban fires
Slack unban alert sent
If attacker returns — detected again immediately because baseline stayed clean

Key Lessons Learned

1. The baseline is everything.
If your baseline gets corrupted by attack traffic, your detector becomes blind. The spike guard is not optional — it is the difference between a system that works once and one that works reliably.

2. Two windows catch two attack types.
Per-IP windows catch single aggressive attackers. The global window catches distributed botnets. You need both.

3. Z-score beats fixed thresholds.
A fixed threshold of "flag if rate > 10 req/s" would miss attacks during busy periods and false positives during quiet periods. Z-score adapts to whatever the current normal is.

4. iptables operates at the kernel level.
Blocking at the application level (in Nginx or Python) still lets the packet reach your server. iptables drops it before any application code runs — much more efficient.

5. Auto-unban is necessary.
Without it, you accumulate false positives forever. Legitimate users behind shared IPs (corporate NAT, university networks) would be permanently blocked.

Tech Stack

Python 3.11 — main language
Docker + Docker Compose — containerization
Nginx — reverse proxy and JSON logging
Nextcloud — the cloud storage application
Flask + Waitress — dashboard web server
iptables — IP blocking at kernel level
Slack Webhooks — alerting
psutil — system metrics

Source Code

The full source code is available at:
https://github.com/devops-timi/anomaly-detection-engine

The repository includes:

All Python source files with detailed comments
Nginx configuration
Docker Compose setup
Architecture diagram
Full README with setup instructions

Conclusion

Building this taught me that real security tooling is not about fancy AI or expensive software. At its core, it is about:

Watching what is happening (log tailing)
Learning what normal looks like (rolling baseline)
Spotting deviations quickly (z-score detection)
Responding automatically (iptables + alerts)

The hardest part was not the detection logic — it was making sure the baseline stayed honest. Once the spike guard was in place, everything else clicked into place.

If you are learning DevSecOps or cloud infrastructure, I highly recommend trying to build something like this from scratch. You will learn more about how the internet works in one project than in months of reading.

DEV Community: Timilehin Obalereko

I Built a Tool That Writes Its Own Infrastructure

Who This Post Is For

The Problem This Solves

Quick Glossary (Read This First)

Part 1 — The Foundation: manifest.yaml as the Single Source of Truth

What Is a "Single Source of Truth"?

How Does swiftdeploy init Work?

Part 2 — The Architecture: Three Containers, One Network

Part 3 — The App: A Python HTTP Service

The Three Endpoints

Why Only in Canary Mode?

Thread Safety

Part 4 — The Nginx Configuration

Custom Log Format

JSON Error Responses

The DNS Resolution Trick

Part 5 — Stable vs Canary Mode

Part 6 — Prometheus Metrics: The "Eyes" of the System

What Are Metrics?

What We Track

How record_request() Works

Part 7 — OPA: The "Brain" That Makes Decisions

What Is OPA?

The Infrastructure Policy

The Canary Safety Policy

How the 30-Second Window Works

How the CLI Talks to OPA

Part 8 — What Happened When I Injected Chaos

Part 9 — The Audit Trail

Part 10 — Lessons Learned

1. The DNS Resolution Problem Was the Biggest Surprise

2. Capabilities Matter More Than You Think

3. OPA's sprintf Handles Integers and Floats Differently

4. Cumulative Metrics Need Careful Handling

5. The Manifest as the Single Source of Truth Actually Works

Complete Setup Instructions

Conclusion

Real-Time Anomaly Detection Engine for a Cloud Storage Platform

The Problem

The Big Picture

Part 1: Reading Logs in Real Time

Part 2: The Sliding Window

Part 3: The Baseline — Learning What Normal Looks Like

The Spike Guard — Keeping the Baseline Clean

Part 4: Detecting Anomalies

Method 1: Z-Score

Method 2: Rate Multiplier

Error Surge Detection

Part 5: Blocking with iptables

The Auto-Unban Backoff Schedule

Part 6: Slack Alerts

Part 7: The Live Dashboard

Part 8: Putting It All Together

The Results

Key Lessons Learned

Tech Stack

Source Code

Conclusion

How Does `swiftdeploy init` Work?

How `record_request()` Works

3. OPA's `sprintf` Handles Integers and Floats Differently