DEV Community: Chukwudum Agbasi

DevOps Magic: Building a Self-Healing, Policy-Driven Deployment Engine

Chukwudum Agbasi — Wed, 06 May 2026 21:11:36 +0000

Modern DevOps isn't just about moving code; it’s about creating a "clinical" environment where infrastructure is predictable, self-validating, and resilient. For my recent project, SwiftDeploy, I built a CLI tool that doesn't just deploy containers—it diagnoses the host environment and enforces strict policy guardrails before a single container is birthed.

Here is the technical deep dive into how I built a self-generating infrastructure stack with Open Policy Agent (OPA) integration.

The Design: Infrastructure as Logic Most CI/CD pipelines rely on static YAML files. SwiftDeploy takes a different approach: it treats infrastructure as a dynamic output of a manifest.

How it works:
The tool uses a manifest.yaml as the "source of truth." When you run ./swiftdeploy init, the script acts as a compiler:

It parses service definitions (images, ports, environment variables) using yq.

It injects these variables into .template files using envsubst.

It generates a perfectly tailored docker-compose.yml and nginx.conf on the fly.

By "writing" its own infrastructure files, SwiftDeploy eliminates configuration drift. If the manifest changes, the infrastructure regenerates to match, ensuring that what you see in your config is exactly what runs in your stack.

The Guardrails: OPA and the "Pre-Flight" Check

In a production environment, isolation isn't just a preference—it's a requirement. I integrated Open Policy Agent (OPA) to act as the "Medical Board" for my deployments.

The Logic of Isolation:
We use OPA to evaluate two distinct policy sets:

Infrastructure Policy: Checks host health (CPU load, Disk space, Memory). If the host is "feverish" (e.g., CPU load > 2.0), OPA blocks the deployment to prevent a cascading failure.

Canary Safety: During promotion from Stable to Canary, OPA analyzes live metrics. If the error rate exceeds 5%, the promotion is "quarantined."

Why Rego?
Using Rego (OPA's query language) allows us to write policies like this:

Code snippet
package infrastructure
default allow = false
allow {

input.cpu_load < data.infrastructure.max_cpu_load
input.disk_free_gb > data.infrastructure.min_disk_free_gb
}
This decouples "how to deploy" from "when it is safe to deploy."`

*The Chaos: Watching the System Break * A deployment tool is only as good as its visibility. To test SwiftDeploy, I intentionally injected a "Slow State" and an "Error State" into the Python backend.

The Failure Scenario:
I updated the app to return 500 Internal Server Error on 20% of traffic. I then monitored the stack using the built-in status view.

The Status View Capture:

Plaintext
UPTIME: 124.5s
REQUESTS: 50 total

POLICY COMPLIANCE

metrics: error_rate=20.0% p99_latency=450ms
checking canary_safety policy...
BLOCK canary_safety: error_rate 20.0% exceeds maximum 5%
Because the status loop was feeding real-time metrics into OPA, the Promote command was automatically locked. The system "knew" it was sick before I did.

Lessons Learne Building this journey from Stage 2 through 4B taught me three critical lessons:

Environment Parity is Hard: Moving from Linux-based logic to a Windows/Git Bash environment revealed how much we rely on specific binaries like free or /proc. Python is the ultimate "bridge" for cross-platform hardware stats.

Fail-Safe is the Only Way: My policy engine was designed to "Block if OPA is unavailable." This saved me multiple times when the OPA container hadn't fully mapped its ports to the host.

Observability is Part of Deployment: A deployment doesn't end when the container is Up. It ends when the metrics prove the container is healthy.

Replicate the Work

If you want to build your own policy-driven CLI:

Tooling: Bash, Docker, OPA, and yq.

Step 1: Create templates for your YAML.

Step 2: Use curl to POST your system stats to OPA’s Data API.

Step 3: Only trigger docker compose up if the API returns {"result": true}.

Infrastructure should be smart enough to say "No" to a bad deployment. SwiftDeploy makes sure it does.

How I Built a Real-Time DDoS Detection Engine from Scratch (No Fail2Ban)

Chukwudum Agbasi — Wed, 29 Apr 2026 12:16:39 +0000

Have you ever wondered how a website "knows" it's being attacked and automatically pulls the plug on the attacker?

I recently built an anomaly detection engine from scratch. It’s a live system that watches incoming traffic, learns what "normal" looks like, and automatically blocks suspicious IPs using Linux firewall rules.

In this post, I’ll walk you through how it works in plain English. No prior security experience required. Lets get into it....

🛠 What the Project Does (and Why It Matters)
Imagine a popular restaurant. Usually, customers walk in, order, and eat. But what if 500 people suddenly rushed in at once, stood at the counter, and ordered nothing? The staff would be so overwhelmed they couldn't serve real customers.

That is a DDoS (Distributed Denial of Service) attack.

The challenge is that you can't just say "block anyone who sends more than 100 requests." A busy server might normally get 200, while a quiet one gets 5. A hardcoded limit would either block real fans or miss real attackers.

The Solution: Build a system that learns your server's "rhythm" and flags anything that breaks it.

The Bird's Eye View
Here is how the data flows:

Internet Traffic hits the Nginx server.

Nginx writes logs to a shared folder.

My Detector Daemon (Python) reads those logs in real-time.

It calculates a Baseline, detects Anomalies, and executes a Ban.

It sends a Slack Alert and updates a Live Dashboard.

1. The Sliding Window: "Forgetting" the Past
To know how busy the server is right now, you can't look at all traffic since the beginning of time. You need a Sliding Window.

Think of a sliding window like a 60-second video clip. Every second, the "window" moves forward. It forgets the oldest second and adds the newest one.

In Python, I used a deque (a double-ended queue) to handle this efficiently:

from collections import deque
import time

# A list of (timestamp, is_error)
ip_window = deque()

def record_request(window):
    now = time.time()
    window.append(now)

    # EVICT OLD: Remove anything older than 60 seconds
    cutoff = now - 60 
    while window and window[0] < cutoff:
        window.popleft()

This is the beauty: It uses almost no memory. Old data literally "falls off" the conveyor belt, leaving you with a fresh count of exactly what happened in the last minute.

2. The Baseline: Learning What’s "Normal"
The sliding window tells us the current speed, but the Baseline tells us the "Speed Limit."

The engine keeps a 30-minute history of traffic. Every minute, it calculates the Average (Mean) and the Standard Deviation (how much the traffic usually fluctuates).

Quiet Morning: Average might be 2 requests/sec.

Busy Afternoon: Average might climb to 40 requests/sec.

Because the baseline is always recalculating, the system adapts. If your site gets a permanent boost in popularity, the "security guard" doesn't panic—it just learns the new normal. It is literally the darwin of this security architecture

Click here to see why this is the "Darwin" of Security
Just like the X-Men's Darwin, who grows gills when submerged in water, this baseline evolves based on the "pressure" of the traffic. If the traffic stays high, the baseline grows to accommodate it. If it stays low, it tightens up. It adapts so it never has to panic.

3. The Math: Z-Scores and Multipliers
How do we actually trigger a ban? We use two "sniff tests":

Test A: The Z-Score (The Statistical Freak-out)
A Z-score measures how many "standard deviations" a value is from the average.

Z-Score of 1: Totally normal.

Z-Score of 3+: This is mathematically "weird." In a normal world, this happens less than 0.2% of the time. Verdict: Blocked.

Test B: The Multiplier (The "Common Sense" Rule)
If the baseline is very quiet (e.g., 0.1 requests/sec), the Z-score can get jumpy. So we add a backup: Is the current rate 5x higher than the average? If yes: Verdict: Blocked

4. The Hammer: iptables
Once we catch a "bad actor," we have to stop them. We use iptables, the Linux kernel's built-in firewall.

When we detect an anomaly, the Python script runs a system command to DROP all traffic from that specific IP:

# What the code tells the Linux Kernel:
iptables -I INPUT -s 1.2.3.4 -j DROP

This is incredibly powerful. The traffic is blocked at the "front door" (the kernel level). It never even reaches the web server, saving your CPU and RAM for real users.

The "Backoff" Schedule
We aren't monsters! Sometimes a user just refreshes too fast. We use a "Three Strikes" system:

1st Offense: 10-minute ban.

2nd Offense: 30-minute ban.

3rd Offense: 2-hour ban.

4th Offense: Permanent block.

Because if an IP hasn't learned by now, it’s not a visitor, its a threat

📢 Real-Time Alerts
Security is only good if you know it's working. Every time a ban happens, the system shoots a message to Slack:

🚨 IP BANNED
IP: 1.2.3.4
Reason: Z-Score 4.5 (Way above normal!)
Rate: 50 req/s (Baseline: 5 req/s)
Duration: 600s

💡 Wrapping Up
By building this from scratch with no pre-made tools like Fail2Ban I learned that security isn't just about "locking doors." It's about observation, statistics, and automation.

The beauty of this engine is that it doesn't care if you're a tiny blog or a massive store; it watches your traffic, learns your baseline, and protects you accordingly.

🔗 Resources
Source Code: [https://github.com/Valescaray/hng-stage-3]

Live Dashboard: [https://monitor.oppsdev.xyz]

What do you think? Would you trust an automated math equation to protect your server? Let me know in the comments!

devsecops #python #programming #security