<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevOps Journey</title>
    <description>The latest articles on DEV Community by DevOps Journey (@devops_journey_4b18fb2ab9).</description>
    <link>https://dev.to/devops_journey_4b18fb2ab9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898954%2Fb53109ad-8d26-4f7e-8993-9ec5e3b04517.png</url>
      <title>DEV Community: DevOps Journey</title>
      <link>https://dev.to/devops_journey_4b18fb2ab9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devops_journey_4b18fb2ab9"/>
    <language>en</language>
    <item>
      <title>SwiftDeploy: How I Built a CLI That Writes Its Own Infrastructure, Gates Deployments with OPA, and Audits Everything</title>
      <dc:creator>DevOps Journey</dc:creator>
      <pubDate>Wed, 06 May 2026 20:52:01 +0000</pubDate>
      <link>https://dev.to/devops_journey_4b18fb2ab9/swiftdeploy-how-i-built-a-cli-that-writes-its-own-infrastructure-gates-deployments-with-opa-and-1fdm</link>
      <guid>https://dev.to/devops_journey_4b18fb2ab9/swiftdeploy-how-i-built-a-cli-that-writes-its-own-infrastructure-gates-deployments-with-opa-and-1fdm</guid>
      <description>&lt;p&gt;HNG Internship 14—DevOps Track, Stage 4A &amp;amp; 4B&lt;br&gt;
A deep dive into building a declarative deployment tool from scratch—no Terraform, no Kubernetes, just bash, Python, Docker, and Open Policy Agent.&lt;/p&gt;

&lt;p&gt;Introduction&lt;br&gt;
Most DevOps tasks ask you to configure infrastructure manually. Stage 4 of the HNG DevOps track asked something harder: build the tool that does it for you.&lt;br&gt;
The result is swiftdeploy — a CLI tool that reads a single manifest.yaml file and automatically generates all configuration files, deploys a containerised stack, gates every deployment through policy checks, streams live metrics, and produces audit reports. Nothing is handwritten except the manifest.&lt;br&gt;
This post covers the full journey: the architecture decisions, the OPA policy design, what happened when chaos was injected, and every painful lesson learned along the way.&lt;/p&gt;

&lt;p&gt;Part 1 — The Design: A Tool That Writes Its Own Config&lt;br&gt;
The Core Principle: Declarative Infrastructure&lt;br&gt;
The fundamental idea behind tools like Terraform and Kubernetes is declarative infrastructure — you describe what you want, not how to build it. You write a spec, the tool figures out the implementation.&lt;br&gt;
swiftdeploy applies this same principle at a smaller scale. The entire deployment is described in one file:&lt;br&gt;
yamlservices:&lt;br&gt;
  image: travispocr-swiftdeploy:latest&lt;br&gt;
  port: 3000&lt;br&gt;
  mode: stable&lt;br&gt;
  version: "1.0.0"&lt;/p&gt;

&lt;p&gt;nginx:&lt;br&gt;
  image: nginx:latest&lt;br&gt;
  port: 8080&lt;br&gt;
  proxy_timeout: 30&lt;/p&gt;

&lt;p&gt;opa:&lt;br&gt;
  image: openpolicyagent/opa:latest&lt;br&gt;
  port: 8181&lt;/p&gt;

&lt;p&gt;network:&lt;br&gt;
  name: swiftdeploy-net&lt;br&gt;
  driver_type: bridge&lt;br&gt;
This is the only file you ever edit. Everything else — nginx.conf, docker-compose.yml — is generated from it.&lt;br&gt;
How init Works&lt;br&gt;
swiftdeploy init reads the manifest using a pure Python YAML parser embedded in the bash script, extracts all values, then uses sed to perform token substitution on two template files:&lt;br&gt;
templates/nginx.conf.tmpl       →  nginx.conf&lt;br&gt;
templates/docker-compose.yml.tmpl  →  docker-compose.yml&lt;br&gt;
Templates use {{TOKEN}} placeholders:&lt;br&gt;
nginx# In template&lt;br&gt;
listen {{NGINX_PORT}};&lt;br&gt;
server app:{{SERVICE_PORT}};&lt;/p&gt;

&lt;h1&gt;
  
  
  After init
&lt;/h1&gt;

&lt;p&gt;listen 8080;&lt;br&gt;
server app:3000;&lt;br&gt;
The grader's test for this is brutal: delete the generated files and re-run init. If they don't regenerate identically, the stack breaks. This forced clean separation between templates (committed to git) and generated files (gitignored).&lt;br&gt;
The Architecture&lt;br&gt;
manifest.yaml&lt;br&gt;
     │&lt;br&gt;
     ▼&lt;br&gt;
swiftdeploy init&lt;br&gt;
     │&lt;br&gt;
     ├──► nginx.conf          (from templates/nginx.conf.tmpl)&lt;br&gt;
     └──► docker-compose.yml  (from templates/docker-compose.yml.tmpl)&lt;br&gt;
                │&lt;br&gt;
                ▼&lt;br&gt;
        swiftdeploy deploy&lt;br&gt;
                │&lt;br&gt;
                ├──► swiftdeploy-app    (Flask API)&lt;br&gt;
                ├──► swiftdeploy-nginx  (Reverse proxy)&lt;br&gt;
                └──► swiftdeploy-opa    (Policy engine)&lt;br&gt;
All traffic routes through Nginx. The app port (3000) is never exposed to the host — only port 8080. OPA runs on an isolated internal network and is only reachable by the CLI through the app container.&lt;/p&gt;

&lt;p&gt;Part 2 — The API Service&lt;br&gt;
The API is a Flask application running in two modes controlled by the MODE environment variable.&lt;br&gt;
Stable vs Canary&lt;br&gt;
Same Docker image, different runtime behaviour:&lt;br&gt;
FeatureStableCanaryGET /Welcome messageWelcome messageGET /healthzStatus + uptimeStatus + uptimeGET /metricsPrometheus metricsPrometheus metricsPOST /chaos403 ForbiddenActiveX-Mode headerNot presentX-Mode: canary&lt;br&gt;
Mode switching is done by swiftdeploy promote:&lt;br&gt;
bash./swiftdeploy promote canary   # stable → canary&lt;br&gt;
./swiftdeploy promote stable   # canary → stable&lt;br&gt;
Promote updates manifest.yaml in-place, regenerates docker-compose.yml with the new MODE env var, and restarts only the app container — Nginx stays up the entire time, so there's zero downtime.&lt;br&gt;
The /metrics Endpoint&lt;br&gt;
Rather than pulling in the prometheus_client library, metrics are tracked in-memory and rendered manually in Prometheus text format. This keeps the image small and avoids dependencies.&lt;br&gt;
Three categories of metrics are tracked:&lt;br&gt;
Counters:&lt;br&gt;
http_requests_total{method="GET",path="/healthz",status_code="200"} 42&lt;br&gt;
Histograms (with standard buckets from 5ms to 10s):&lt;br&gt;
http_request_duration_seconds_bucket{method="GET",path="/",le="0.005"} 38&lt;br&gt;
http_request_duration_seconds_sum{method="GET",path="/"} 0.187&lt;br&gt;
http_request_duration_seconds_count{method="GET",path="/"} 42&lt;br&gt;
Gauges:&lt;br&gt;
app_uptime_seconds 3600.123&lt;br&gt;
app_mode 0&lt;br&gt;
chaos_active 0&lt;br&gt;
A before_request hook records the start time, and an after_request hook calculates duration and updates all counters. The /metrics endpoint itself is excluded from tracking to avoid self-referential noise.&lt;br&gt;
The Chaos Endpoint&lt;br&gt;
In canary mode, POST /chaos accepts three commands:&lt;br&gt;
bash# Slow down responses&lt;br&gt;
curl -X POST &lt;a href="http://localhost:8080/chaos" rel="noopener noreferrer"&gt;http://localhost:8080/chaos&lt;/a&gt; \&lt;br&gt;
  -d '{"mode":"slow","duration":3}'&lt;/p&gt;

&lt;h1&gt;
  
  
  Inject errors at 50% rate
&lt;/h1&gt;

&lt;p&gt;curl -X POST &lt;a href="http://localhost:8080/chaos" rel="noopener noreferrer"&gt;http://localhost:8080/chaos&lt;/a&gt; \&lt;br&gt;
  -d '{"mode":"error","rate":0.5}'&lt;/p&gt;

&lt;h1&gt;
  
  
  Recover
&lt;/h1&gt;

&lt;p&gt;curl -X POST &lt;a href="http://localhost:8080/chaos" rel="noopener noreferrer"&gt;http://localhost:8080/chaos&lt;/a&gt; \&lt;br&gt;
  -d '{"mode":"recover"}'&lt;br&gt;
Chaos state is stored in a thread-safe dictionary and applied in before_request. The chaos_active gauge in /metrics reflects the current state (0=none, 1=slow, 2=error), making it visible in the status dashboard.&lt;/p&gt;

&lt;p&gt;Part 3 — The Guardrails: OPA Policy Enforcement&lt;br&gt;
Why the CLI Never Makes Decisions&lt;br&gt;
The task requirement is strict: all allow/deny logic must live in OPA. The CLI only sends data and acts on the response. This separation matters because:&lt;/p&gt;

&lt;p&gt;Policies can be updated without touching the CLI&lt;br&gt;
Policy logic is auditable and testable independently&lt;br&gt;
Different teams can own different policy domains&lt;/p&gt;

&lt;p&gt;Policy Structure&lt;br&gt;
Two independent policy domains, each owning exactly one question:&lt;br&gt;
policies/infrastructure.rego — "Is the host healthy enough to deploy?"&lt;br&gt;
policies/canary.rego — "Is the canary safe to promote to stable?"&lt;br&gt;
Each domain is queried independently. A deploy could be blocked by infrastructure policy while canary policy passes — they never interfere with each other.&lt;br&gt;
The Rego Policies&lt;br&gt;
Infrastructure policy:&lt;br&gt;
regopackage swiftdeploy.infrastructure&lt;/p&gt;

&lt;p&gt;import rego.v1&lt;/p&gt;

&lt;p&gt;default allow := false&lt;/p&gt;

&lt;p&gt;allow if {&lt;br&gt;
    count(violations) == 0&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;violations contains msg if {&lt;br&gt;
    input.disk_free_gb &amp;lt; data.thresholds.min_disk_free_gb&lt;br&gt;
    msg := sprintf(&lt;br&gt;
        "Disk free %.1fGB is below minimum %.1fGB",&lt;br&gt;
        [input.disk_free_gb, data.thresholds.min_disk_free_gb]&lt;br&gt;
    )&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;violations contains msg if {&lt;br&gt;
    input.cpu_load &amp;gt; data.thresholds.max_cpu_load&lt;br&gt;
    msg := sprintf(&lt;br&gt;
        "CPU load %.2f exceeds maximum %.2f",&lt;br&gt;
        [input.cpu_load, data.thresholds.max_cpu_load]&lt;br&gt;
    )&lt;br&gt;
}&lt;br&gt;
Canary safety policy:&lt;br&gt;
regopackage swiftdeploy.canary&lt;/p&gt;

&lt;p&gt;import rego.v1&lt;/p&gt;

&lt;p&gt;default allow := false&lt;/p&gt;

&lt;p&gt;allow if {&lt;br&gt;
    count(violations) == 0&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;violations contains msg if {&lt;br&gt;
    input.error_rate_percent &amp;gt; data.thresholds.max_error_rate_percent&lt;br&gt;
    msg := sprintf(&lt;br&gt;
        "Error rate %.2f%% exceeds maximum %.2f%%",&lt;br&gt;
        [input.error_rate_percent, data.thresholds.max_error_rate_percent]&lt;br&gt;
    )&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;violations contains msg if {&lt;br&gt;
    input.chaos_active != 0&lt;br&gt;
    msg := sprintf(&lt;br&gt;
        "Chaos mode is active (%d) — recover before promoting",&lt;br&gt;
        [input.chaos_active]&lt;br&gt;
    )&lt;br&gt;
}&lt;br&gt;
Why Thresholds Live in data.json&lt;br&gt;
Notice the policies reference data.thresholds.* — never hardcoded numbers. All threshold values live in policies/data.json:&lt;br&gt;
json{&lt;br&gt;
  "thresholds": {&lt;br&gt;
    "min_disk_free_gb": 10.0,&lt;br&gt;
    "max_cpu_load": 2.0,&lt;br&gt;
    "min_mem_free_percent": 10.0,&lt;br&gt;
    "max_error_rate_percent": 1.0,&lt;br&gt;
    "max_p99_latency_ms": 500.0&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
This means you can tighten or relax thresholds without touching Rego. The policy logic and the policy configuration are separate concerns.&lt;br&gt;
Every Decision Carries Reasoning&lt;br&gt;
OPA never returns a bare boolean. Every decision includes the violations that caused it:&lt;br&gt;
json{&lt;br&gt;
  "result": {&lt;br&gt;
    "allow": false,&lt;br&gt;
    "violations": [&lt;br&gt;
      "Disk free 4.2GB is below minimum 10.0GB"&lt;br&gt;
    ],&lt;br&gt;
    "domain": "infrastructure",&lt;br&gt;
    "checked_at": "2026-05-06T08:35:27Z"&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
The CLI surfaces this clearly:&lt;br&gt;
── Policy check: infrastructure ──&lt;br&gt;
  [FAIL] infrastructure: policy violated&lt;br&gt;
  • Disk free 4.2GB is below minimum 10.0GB&lt;/p&gt;

&lt;p&gt;╔══ POLICY VIOLATION ══════════════════════════════╗&lt;br&gt;
║ Deployment blocked by infrastructure policy.&lt;br&gt;
╚══════════════════════════════════════════════════╝&lt;br&gt;
OPA Isolation&lt;br&gt;
OPA runs on a separate opa-internal Docker network. It is not connected to the Nginx network, so it's completely unreachable from the public-facing port 8080. The CLI reaches OPA by proxying requests through the app container using docker exec:&lt;br&gt;
bashdocker exec swiftdeploy-app curl -sf -X POST \&lt;br&gt;
  "&lt;a href="http://swiftdeploy-opa:8181/v1/data/swiftdeploy/infrastructure/decision" rel="noopener noreferrer"&gt;http://swiftdeploy-opa:8181/v1/data/swiftdeploy/infrastructure/decision&lt;/a&gt;" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d "$input_json"&lt;br&gt;
This means OPA is reachable by internal services but completely isolated from external traffic.&lt;/p&gt;

&lt;p&gt;Part 4 — The Chaos Experiment&lt;br&gt;
Injecting Chaos&lt;br&gt;
With the stack in canary mode:&lt;br&gt;
bash# Inject 50% error rate&lt;br&gt;
curl -X POST &lt;a href="http://localhost:8080/chaos" rel="noopener noreferrer"&gt;http://localhost:8080/chaos&lt;/a&gt; \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{"mode":"error","rate":0.5}'&lt;br&gt;
The status dashboard immediately captured the degradation:&lt;br&gt;
SwiftDeploy Status Dashboard  2026-05-06T20:31:00Z&lt;br&gt;
─────────────────────────────────────────────────&lt;br&gt;
  Throughput   2.4 req/s   Total 89&lt;br&gt;
  Error rate   48.31%&lt;br&gt;
  P99 latency  5ms   Avg 0ms&lt;br&gt;
  Chaos        error&lt;/p&gt;

&lt;p&gt;Policy Compliance&lt;br&gt;
    [✓]  infrastructure: passing&lt;br&gt;
    [✗]  canary: FAILING&lt;br&gt;
─────────────────────────────────────────────────&lt;br&gt;
Attempting Promotion Under Chaos&lt;br&gt;
With chaos active and error rate above 1%, attempting promote stable is blocked:&lt;br&gt;
bash./swiftdeploy promote stable&lt;br&gt;
── Policy check: canary ──&lt;br&gt;
  [FAIL] canary: policy violated&lt;br&gt;
  • Error rate 48.31% exceeds maximum 1.00%&lt;br&gt;
  • Chaos mode is active (2) — recover before promoting&lt;/p&gt;

&lt;p&gt;╔══ POLICY VIOLATION ══════════════════════════════╗&lt;br&gt;
║ Promotion to stable blocked by canary safety policy.&lt;br&gt;
╚══════════════════════════════════════════════════╝&lt;br&gt;
This is exactly the intended be&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsdxodv5u5fho82375fbg.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsdxodv5u5fho82375fbg.jpg" alt=" " width="800" height="542"&gt;&lt;/a&gt;haviour — the policy engine prevents a broken canary from being promoted to production.&lt;br&gt;
Recovery&lt;br&gt;
bashcurl -X POST &lt;a href="http://localhost:8080/chaos" rel="noopener noreferrer"&gt;http://localhost:8080/chaos&lt;/a&gt; -d '{"mode":"recover"}'&lt;br&gt;
./swiftdeploy promote stable&lt;br&gt;
── Policy check: canary ──&lt;br&gt;
  [PASS] canary: all checks passed&lt;br&gt;
[✓] Promotion complete! Mode is now: stable&lt;/p&gt;

&lt;p&gt;Part 5 — Observability: Status Dashboard and Audit Trail&lt;br&gt;
The Status Dashboard&lt;br&gt;
swiftdeploy status runs a live-refreshing terminal dashboard that scrapes /metrics every 5 seconds:&lt;br&gt;
SwiftDeploy Status Dashboard  2026-05-06T20:29:03Z&lt;br&gt;
───────────────────────────────────────────────────────&lt;br&gt;
  Throughput   2.2 req/s   Total 44&lt;br&gt;
  Error rate   0.00%&lt;br&gt;
  P99 latency  10ms   Avg 0ms&lt;br&gt;
  Chaos        none&lt;/p&gt;

&lt;p&gt;Policy Compliance&lt;br&gt;
    [✓]  infrastructure: passing&lt;br&gt;
    [✓]  canary: passing&lt;/p&gt;

&lt;p&gt;History log  history.jsonl&lt;br&gt;
───────────────────────────────────────────────────────&lt;br&gt;
  Refreshing every 5s — Ctrl+C to exit&lt;br&gt;
Every scrape is appended to history.jsonl as a structured JSON line:&lt;br&gt;
json{"timestamp":"2026-05-06T20:29:03Z","event":"status_scrape","data":{"rps":2.2,"total_requests":44,"error_rate_percent":0.0,"p99_latency_ms":10.0,"chaos_active":0}}&lt;br&gt;
The Audit Report&lt;br&gt;
swiftdeploy audit parses history.jsonl and generates audit_report.md:&lt;br&gt;
markdown## Timeline&lt;br&gt;
| Timestamp | Event | Detail |&lt;br&gt;
|-----------|-------|--------|&lt;br&gt;
| &lt;code&gt;2026-05-06T20:15:30Z&lt;/code&gt; | &lt;code&gt;policy_pass&lt;/code&gt; | domain=infrastructure |&lt;br&gt;
| &lt;code&gt;2026-05-06T20:15:31Z&lt;/code&gt; | &lt;code&gt;deploy_success&lt;/code&gt; | mode=stable |&lt;br&gt;
| &lt;code&gt;2026-05-06T20:16:49Z&lt;/code&gt; | &lt;code&gt;mode_change&lt;/code&gt; | from=stable, to=canary |&lt;/p&gt;

&lt;h2&gt;
  
  
  Policy Violations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Timestamp&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Data&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;2026-05-06T08:35:31Z&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deploy_blocked&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;disk_free_gb=58.08&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Metrics Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total scrapes&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg req/s&lt;/td&gt;
&lt;td&gt;6.74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg error rate&lt;/td&gt;
&lt;td&gt;0.0000%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max P99 latency&lt;/td&gt;
&lt;td&gt;10ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The report renders as clean GitHub Flavored Markdown and provides a complete audit trail of every deployment, mode change, policy decision, and chaos event.&lt;/p&gt;

&lt;p&gt;Part 6 — Lessons Learned&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Windows Git Bash Mangles Unix Paths
Running docker exec container /opa eval from Git Bash on Windows translates /opa to C:/Program Files/Git/opa. The fix is MSYS_NO_PATHCONV=1:
bashMSYS_NO_PATHCONV=1 docker exec swiftdeploy-opa /opa eval "data" --data /policies
This was one of the most time-consuming bugs — completely invisible until you check the actual error message from Docker.&lt;/li&gt;
&lt;li&gt;The OPA Image Has No Shell, No curl, No wget
The openpolicyagent/opa:latest image contains only the opa binary. No sh, no curl, no wget. This means:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Healthchecks can't use CMD-SHELL&lt;br&gt;
You can't docker exec into it with a shell&lt;br&gt;
The solution: route all OPA queries through the app container which has curl&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CRLF Line Endings Break Bash Scripts on Linux
Files edited on Windows have \r\n line endings. When bash on Linux/WSL reads them, the \r becomes part of variable values, causing cryptic failures. Always run:
bashsed -i 's/\r//' swiftdeploy
sed -i 's/\r//' templates/&lt;em&gt;.tmpl
sed -i 's/\r//' policies/&lt;/em&gt;.rego&lt;/li&gt;
&lt;li&gt;Port Binding Only Applies at Container Creation
Docker applies port bindings when a container is created, not when it starts or restarts. docker compose restart does not re-apply port mappings. You need docker compose up --force-recreate to pick up port changes.&lt;/li&gt;
&lt;li&gt;OPA Rego v1 Requires Explicit if Keywords
The latest OPA image enforces Rego v1 syntax, which requires import rego.v1 and explicit if keywords on every rule body. The older future.keywords import approach no longer works cleanly with the latest image.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conclusion&lt;br&gt;
swiftdeploy is a working example of several production DevOps patterns in miniature:&lt;/p&gt;

&lt;p&gt;Declarative infrastructure — one manifest drives everything&lt;br&gt;
Policy as code — OPA enforces guardrails without coupling to the CLI&lt;br&gt;
Observability — metrics, dashboards, and audit trails built in from the start&lt;br&gt;
Chaos engineering — deliberate failure injection to verify recovery paths&lt;/p&gt;

&lt;p&gt;The full source code is available at github.com/travispocr/hng14-stage-4a.&lt;/p&gt;

&lt;p&gt;Published as part of HNG Internship 14 — DevOps Track&lt;br&gt;
Author: travispocr | hngstage4b&lt;/p&gt;

</description>
      <category>automation</category>
      <category>cli</category>
      <category>devops</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How I Built a Real-Time DDoS Detection Engine from Scratch (No Fail2Ban Allowed)</title>
      <dc:creator>DevOps Journey</dc:creator>
      <pubDate>Sun, 26 Apr 2026 15:25:48 +0000</pubDate>
      <link>https://dev.to/devops_journey_4b18fb2ab9/how-i-built-a-real-time-ddos-detection-engine-from-scratch-no-fail2ban-allowed-2e59</link>
      <guid>https://dev.to/devops_journey_4b18fb2ab9/how-i-built-a-real-time-ddos-detection-engine-from-scratch-no-fail2ban-allowed-2e59</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Real-Time DDoS Detection Engine from Scratch (No Fail2Ban Allowed)
&lt;/h1&gt;

&lt;p&gt;When my boss said "build something that watches all incoming traffic, learns what normal looks like, and automatically blocks attackers" — I had no idea where to start. No Fail2Ban. No rate-limiting libraries. Just Python, math, and Linux firewall rules.&lt;/p&gt;

&lt;p&gt;This is the story of how I built it, and how you can understand every piece of it — even if you've never worked on security tooling before.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Does This Project Do?
&lt;/h2&gt;

&lt;p&gt;Imagine a security guard standing at the entrance of a building. Their job is to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Watch&lt;/strong&gt; everyone who comes in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn&lt;/strong&gt; what a normal busy day looks like&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notice&lt;/strong&gt; when something unusual happens — like 200 people trying to enter at once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt; — block the suspicious person and alert the team&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's exactly what this tool does, but for HTTP traffic hitting a web server.&lt;/p&gt;

&lt;p&gt;The system runs alongside a Nextcloud instance and continuously monitors Nginx access logs. When it detects abnormal traffic — either from a single aggressive IP or a global traffic spike — it automatically blocks the attacker using Linux firewall rules and sends a Slack alert within 10 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Internet Traffic&lt;br&gt;
│&lt;br&gt;
▼&lt;br&gt;
Nginx (writes JSON logs)&lt;br&gt;
│&lt;br&gt;
▼&lt;br&gt;
Nextcloud&lt;br&gt;
│&lt;br&gt;
[shared log volume]&lt;br&gt;
│&lt;br&gt;
▼&lt;br&gt;
Detector Daemon (Python)&lt;br&gt;
├── Monitor    → reads log line by line&lt;br&gt;
├── Baseline   → learns normal traffic patterns&lt;br&gt;
├── Detector   → spots anomalies using math&lt;br&gt;
├── Blocker    → adds iptables firewall rules&lt;br&gt;
├── Unbanner   → releases bans on a schedule&lt;br&gt;
├── Notifier   → sends Slack alerts&lt;br&gt;
└── Dashboard  → live web UI&lt;/p&gt;

&lt;p&gt;Everything runs in Docker containers. The detector mounts the Nginx log volume read-only so it can watch logs without touching the web server.&lt;/p&gt;


&lt;h2&gt;
  
  
  How the Sliding Window Works
&lt;/h2&gt;

&lt;p&gt;The first problem to solve: how do you measure "how many requests per second is this IP sending right now?"&lt;/p&gt;

&lt;p&gt;The answer is a &lt;strong&gt;sliding window&lt;/strong&gt; — a moving view of the last 60 seconds of traffic.&lt;/p&gt;

&lt;p&gt;Think of it like a 60-second conveyor belt. Every time a request arrives, we add a timestamp to the belt. Every time we want to know the current rate, we count how many timestamps are still on the belt (within the last 60 seconds). Old timestamps fall off the end automatically.&lt;/p&gt;

&lt;p&gt;In Python, we use &lt;code&gt;collections.deque&lt;/code&gt; — a double-ended queue that lets us add to the right and remove from the left efficiently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="c1"&gt;# One deque per IP
&lt;/span&gt;&lt;span class="n"&gt;ip_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;  &lt;span class="c1"&gt;# 60-second window
&lt;/span&gt;
    &lt;span class="c1"&gt;# Add new timestamp
&lt;/span&gt;    &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Evict expired timestamps from the left
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;ip_window&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Rate = number of timestamps in the window
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We maintain two windows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-IP window&lt;/strong&gt;: tracks requests from a single IP address&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global window&lt;/strong&gt;: tracks all requests across all IPs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the core of real-time rate measurement. No databases, no counters that reset every minute — just timestamps in a deque.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Baseline Learns From Traffic
&lt;/h2&gt;

&lt;p&gt;Detecting "too many requests" only makes sense if you know what "normal" looks like. That's where the baseline comes in.&lt;/p&gt;

&lt;p&gt;Every second, we record how many total requests arrived. We store these per-second counts in a rolling 30-minute window — again using a deque:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;baseline_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# stores (timestamp, count) tuples
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_second_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 30 minutes
&lt;/span&gt;
    &lt;span class="n"&gt;baseline_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Evict data older than 30 minutes
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;baseline_window&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;baseline_window&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;baseline_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every 60 seconds, we recalculate the &lt;strong&gt;mean&lt;/strong&gt; (average) and &lt;strong&gt;standard deviation&lt;/strong&gt; from all the counts in the window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
    &lt;span class="n"&gt;variance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stddev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;variance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mean tells us "on average, how many requests per second do we get?" The standard deviation tells us "how much does it vary?"&lt;/p&gt;

&lt;p&gt;We also store results in hourly slots — so if your server gets more traffic in the afternoon than at night, the baseline adapts to each hour's pattern rather than using one global average.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Floor values&lt;/strong&gt;: When the system first starts, there isn't enough data yet. We use floor values (mean=1.0, stddev=1.0) until at least 10 samples are collected. This prevents false positives during startup.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Detection Logic Makes a Decision
&lt;/h2&gt;

&lt;p&gt;Once we have a baseline, we can detect anomalies. We use two conditions — whichever fires first triggers a block:&lt;/p&gt;

&lt;h3&gt;
  
  
  Condition 1: Z-Score &amp;gt; 3.0
&lt;/h3&gt;

&lt;p&gt;The z-score measures how many standard deviations away from the mean the current rate is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;zscore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A z-score of 3.0 means "this value is 3 standard deviations above normal." Statistically, that happens less than 0.3% of the time under normal conditions. If we see it, something unusual is happening.&lt;/p&gt;

&lt;h3&gt;
  
  
  Condition 2: Rate &amp;gt; 5x the Mean
&lt;/h3&gt;

&lt;p&gt;Even if the standard deviation is small, a rate 5 times higher than normal is suspicious:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;is_anomalous&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;zscore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stddev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt;
    &lt;span class="n"&gt;ip_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Error Surge Tightening
&lt;/h3&gt;

&lt;p&gt;If an IP is also generating lots of 4xx/5xx errors (like hammering login endpoints), we tighten the thresholds automatically — the z-score threshold drops from 3.0 to 1.5 and the rate multiplier from 5x to 2.5x. A misbehaving IP gets less tolerance.&lt;/p&gt;




&lt;h2&gt;
  
  
  How iptables Blocks an IP
&lt;/h2&gt;

&lt;p&gt;When an anomaly is detected, we need to actually stop the traffic. Linux has a built-in firewall called &lt;strong&gt;iptables&lt;/strong&gt; that operates at the kernel level — before the traffic even reaches Nginx or our application.&lt;/p&gt;

&lt;p&gt;We add a DROP rule using Python's &lt;code&gt;subprocess&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ban_ip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iptables&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INPUT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# append to INPUT chain
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# source IP to match
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-j&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;     &lt;span class="c1"&gt;# action: drop the packet silently
&lt;/span&gt;    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a packet is dropped, the sender gets no response — it's like the server doesn't exist. This is more effective than sending a "rejected" response because it wastes the attacker's time waiting for a reply that never comes.&lt;/p&gt;

&lt;p&gt;To unban:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;unban_ip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iptables&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-D&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INPUT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# delete from INPUT chain
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-j&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Auto-Unban with Backoff
&lt;/h3&gt;

&lt;p&gt;Bans aren't permanent by default. We use a backoff schedule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First ban: 10 minutes&lt;/li&gt;
&lt;li&gt;Second ban: 30 minutes
&lt;/li&gt;
&lt;li&gt;Third ban: 2 hours&lt;/li&gt;
&lt;li&gt;Fourth ban onwards: permanent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is fair — if an IP triggers once, it might be a misconfigured client. If it keeps coming back, it gets progressively longer bans.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;A live web dashboard built with Flask shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current global request rate&lt;/li&gt;
&lt;li&gt;Effective mean and standard deviation&lt;/li&gt;
&lt;li&gt;All currently banned IPs with their ban conditions&lt;/li&gt;
&lt;li&gt;Top 10 source IPs by request count&lt;/li&gt;
&lt;li&gt;CPU and memory usage&lt;/li&gt;
&lt;li&gt;System uptime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It refreshes every 3 seconds automatically so you can watch the system react in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slack Alerts
&lt;/h2&gt;

&lt;p&gt;Every significant event sends a Slack message:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🚨 &lt;strong&gt;IP Ban&lt;/strong&gt;: IP address, condition that fired, current rate, baseline mean, timestamp&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;IP Unban&lt;/strong&gt;: same info plus next ban duration&lt;/li&gt;
&lt;li&gt;⚠️ &lt;strong&gt;Global Anomaly&lt;/strong&gt;: when total traffic spikes (no single IP to block)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;Building this from scratch taught me things no tutorial covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Math matters in production&lt;/strong&gt;: z-scores aren't just textbook concepts — they're genuinely useful for anomaly detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deques are underrated&lt;/strong&gt;: Python's &lt;code&gt;collections.deque&lt;/code&gt; is perfect for sliding windows — O(1) append and popleft&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Baselines must be dynamic&lt;/strong&gt;: hardcoding thresholds fails in production because traffic patterns change by hour, day, and season&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iptables is powerful&lt;/strong&gt;: blocking at the kernel level is far more effective than application-level rate limiting&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The full source code is available at:&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/travispocr/hng-stage3-devops" rel="noopener noreferrer"&gt;https://github.com/travispocr/hng-stage3-devops&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The live dashboard is running at:&lt;br&gt;
&lt;strong&gt;&lt;a href="http://psitdev.duckdns.org:8080" rel="noopener noreferrer"&gt;http://psitdev.duckdns.org:8080&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To run it on your own server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/travispocr/hng-stage3-devops.git
&lt;span class="nb"&gt;cd &lt;/span&gt;hng-stage3-devops
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env with your Slack webhook URL&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Built as part of the HNG Internship DevOps Track — Stage 3.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>python</category>
      <category>security</category>
      <category>networking</category>
    </item>
  </channel>
</rss>
