DEV Community: Frank

I Built a DevOps Tool That Thinks: Adding "Eyes" and a "Brain" to SwiftDeploy

Frank — Wed, 06 May 2026 20:35:31 +0000

Most DevOps tasks start with a manual checklist: Is the disk full? Is the latency too high? Should we promote this Canary? In my latest project for the HNG Internship, I decided that "manual" wasn't fast enough. I didn't just want to deploy code; I wanted to build a tool that protects itself.

I upgraded my CLI tool, swiftdeploy, from a simple script to a policy-driven engine with its own "Eyes" (Metrics) and "Brain" (Open Policy Agent). Here is how I did it.

The Architecture: A Single Source of Truth
The core of the project is the manifest.yaml. I wanted to follow the Declarative Infrastructure philosophy—where I describe what I want, and the tool figured out how to build it.

My tool takes this manifest and programmatically generates the docker-compose.yml and nginx.conf. No more hand-writing configs or fixing typos in Nginx blocks.

Giving it "Eyes" (Instrumentation) You can't manage what you can't see. I instrumented my API service (the engine) to expose a /metrics endpoint in Prometheus format.

I focused on the Golden Signals:

Throughput: Tracking every request and status code.

Latency: Using histograms to calculate P99 latency. (Because if 1% of your users are waiting 5 seconds, your app is broken, even if the average is fine).

Health: Tracking uptime and whether Chaos Mode was active.

Giving it a "Brain" (The OPA Sidecar) This was the biggest challenge. I integrated Open Policy Agent (OPA) as a sidecar container.

Instead of hardcoding "if" statements in my Python/Bash script, I moved all the decision-making logic into Rego files.

Why decoupling matters:
If I want to change the "Safety Standard" (e.g., changing the allowed error rate from 1% to 0.5%), I don't touch my CLI code. I just update the .rego policy.

I implemented two core policies:

Infra Policy: Denies deployment if the host has less than 10GB of disk space.

Canary Safety Policy: Denies promotion if the Canary's P99 Latency is over 500ms or error rates spike.

The "Gated" Lifecycle: Look Before You Leap I updated the swiftdeploy CLI to be "Gated."

Before the promote command actually switches traffic from Canary to Stable, it does a Pre-Promote Check:

It scrapes the /metrics from the running Canary.

It sends that data to OPA.

OPA evaluates the data against the Rego policies.

If OPA says "Deny," the CLI stops the deployment and explains exactly why (e.g., "Error rate too high").

Testing with Chaos To prove it worked, I had to break things. I used a /chaos endpoint to inject a "slow" state into the Canary.

When I ran swiftdeploy status, my real-time dashboard showed the P99 latency shooting up. When I tried to promote that "sick" Canary to production, the CLI refused. > CLI Output: Promotion Blocked: P99 Latency is 2000ms (Threshold: 500ms).

That is the moment I knew the "Brain" was working.

Lessons Learned
Fail Fast: Pre-flight validation is a lifesaver. My tool checks if the Nginx port is already taken before it even tries to start a container.

Observability is not optional: Without the /metrics endpoint, I would have been flying blind.

Policy as Code: OPA makes infrastructure audit-friendly and incredibly flexible.

Final Thought
Most DevOps tasks ask you to configure infrastructure. This one asked me to build the tool that manages the infrastructure. It’s been an intense journey from writing basic PHP/MySQL apps to building self-healing DevOps CLI tools, but the control you gain is worth every line of code.
What’s your favorite tool for enforcing deployment policies? Let me know in the comments!

DevOps #CloudEngineering #OpenPolicyAgent #Docker #HNG

How I Built a Real-Time DDoS Detection Engine from Scratch

Frank — Wed, 29 Apr 2026 01:11:25 +0000

How I Built a Real-Time DDoS Detection Engine from Scratch

Introduction

Imagine you own a popular website. Thousands of people visit every day.
Then one morning, a hacker sends millions of fake requests to your server
all at once — trying to crash it. This is called a DDoS attack
(Distributed Denial of Service).

For HNG Stage 3, I was tasked with building a system that:

Watches all incoming web traffic in real time
Learns what "normal" traffic looks like
Automatically detects and blocks attackers
Sends instant Slack alerts
Shows everything on a live dashboard

Here's exactly how I built it — explained simply enough that
a complete beginner can follow along.

The Architecture — How Everything Connects

Think of the system like a security team for a building:
Internet → Nginx (doorman) → Nextcloud (the building)
↓
Access Log (visitor diary)
↓
Python Daemon (security guard reading the diary)
↓
┌──────────────────────────────┐
│ Detect attack → Ban IP │
│ Send Slack alert │
│ Show on live dashboard │
└──────────────────────────────┘
Nginx sits in front of everything. Every single request that
comes in — legitimate user or attacker — passes through Nginx first.
Nginx writes a JSON log entry for every request containing the IP
address, timestamp, URL, and status code.

Our Python daemon reads those log entries in real time,
learns what normal traffic looks like, and fires when something
looks wrong.

How the Sliding Window Works

Here's the core question our system needs to answer at any moment:

"How many requests did this IP make in the last 60 seconds?"

We use a data structure called a deque (double-ended queue)
to answer this efficiently.

Think of it like a conveyor belt:

New items (request timestamps) come in from the right
Old items (timestamps older than 60 seconds) fall off the left automatically

from collections import deque
from datetime import datetime, timedelta

ip_window = deque()

def add_request(ip_window, timestamp):
    # Add new request timestamp to RIGHT
    ip_window.append(timestamp)

    # Remove old timestamps from LEFT
    cutoff = timestamp - timedelta(seconds=60)
    while ip_window and ip_window[0] < cutoff:
        ip_window.popleft()

    # Length = requests in last 60 seconds
    return len(ip_window)

popleft() is O(1) — it removes from the front instantly.
This is why we use deque instead of a regular list — lists
are slow at removing from the front.

We maintain two of these windows:

One per IP — catches single aggressive attackers
One global — catches distributed attacks from many IPs

How The Baseline Learns From Traffic

Knowing the current rate isn't enough. We need to know if
that rate is normal or abnormal.

We solve this with a rolling 30-minute baseline:

Every second, we record how many requests arrived in that second.
We keep a 30-minute history of these per-second counts.
Every 60 seconds, we calculate:

Mean — the average requests per second:

mean = sum(counts) / len(counts)

Standard Deviation — how much the traffic usually varies:

variance = sum((x - mean) ** 2 for x in counts) / len(counts)
stddev = math.sqrt(variance)

We apply floor values to both — mean never drops below 1.0
and stddev never drops below 0.5. This prevents false alarms
when traffic is extremely stable.

We also store baselines in per-hour slots. Traffic at 3pm
looks different from traffic at 3am — so we prefer the current
hour's baseline when making decisions.

How The Detection Logic Makes A Decision

With the current rate and the baseline established, we calculate
a z-score:
z = (current_rate - baseline_mean) / baseline_stddev
The z-score answers: "How many standard deviations above
normal is this?"

Z-score	Meaning
1.0	Slightly above normal
2.0	Noticeably above normal
3.0	Very unusual — only 0.3% of traffic
10.0+	Almost certainly an attack

We flag an IP as anomalous if:

z-score > 3.0 (statistical threshold), OR
rate > 5x the baseline mean (simple multiplier)

Whichever fires first wins. This dual-trigger approach catches
both gradual ramp-up attacks (caught by z-score) and sudden
flood attacks (caught by the multiplier).

Error surge detection: If an IP is generating a lot of
4xx/5xx errors — like trying hundreds of wrong passwords —
we tighten its detection thresholds by 30%. It's already
behaving suspiciously, so we watch it more closely.

How iptables Blocks An IP

When an IP is flagged, we run this Linux firewall command:

iptables -I INPUT -s 1.2.3.4 -j DROP

Breaking it down:

iptables — the Linux kernel firewall tool
-I INPUT — INSERT a rule into the INPUT chain (incoming traffic)
-s 1.2.3.4 — match packets from this SOURCE IP
-j DROP — silently DROP all matching packets

DROP means the attacker gets absolutely no response.
Their packets just disappear. They don't even know they've
been blocked — they just stop getting responses.

We call this from Python using subprocess:

import subprocess

cmd = ['iptables', '-I', 'INPUT', '-s', ip, '-j', 'DROP']
result = subprocess.run(cmd, capture_output=True, text=True)

if result.returncode == 0:
    print(f"Successfully blocked {ip}")

Progressive ban schedule — repeat offenders get longer bans:

1st offence: 10 minutes
2nd offence: 30 minutes
3rd offence: 2 hours
4th+ offence: Permanent

When a ban expires, we delete the rule:

iptables -D INPUT -s 1.2.3.4 -j DROP

The Live Dashboard

The dashboard is a Flask web server running in a background thread.
It serves an HTML page that calls a /api/stats endpoint every
3 seconds and updates the display with fresh data.

It shows:

Global requests per second
Current baseline mean and stddev
All banned IPs with ban details
Top 10 source IPs by request rate
CPU and memory usage
System uptime
Hourly baseline slots

Key Lessons Learned

1. Async Python is powerful — running log monitoring,
baseline calculation, ban checking, and serving a dashboard
simultaneously with asyncio.gather() is elegant and efficient.

2. Read the logs — when the Nextcloud container had issues,
the logs told us exactly what was wrong and how to fix it.

3. Never hardcode secrets — GitHub Push Protection caught
our Slack webhook URL in the code. Always use environment
variables for secrets.

4. Docker volumes are the glue — the named HNG-nginx-logs
volume is what allows Nginx and our detector (in separate
containers) to share log files seamlessly.

5. Z-scores are surprisingly simple — statistical anomaly
detection sounds intimidating but the math is just subtraction
and division.

Conclusion

Building this system taught me that security tooling isn't magic —
it's just careful observation, smart math, and fast response.
The same principles used here are what power enterprise security
tools at companies like Cloudflare and AWS.

The full source code is available at:
https://github.com/Frank363-hash/hng-anomaly-detector

If you have questions or suggestions, drop them in the comments!