DEV Community: Hezekiah Umoh

Building TheEpicBook: A Deep Dive into a Node.js Monolithic Web Application

Hezekiah Umoh — Tue, 26 May 2026 00:53:16 +0000

Building TheEpicBook: A Deep Dive into a Node.js Monolithic Web Application

By a Full-Stack Developer | May 2026

Introduction

In an era where microservices and serverless architectures dominate tech conversations, there is still a strong case to be made for the classic monolithic application. TheEpicBook is a full-stack bookstore web application built as a monolith — a single, unified codebase that handles everything from serving HTML pages to managing a relational database. This post walks through the architecture, the tech stack, the challenges faced during deployment, and the lessons learned along the way.

What Is TheEpicBook?

TheEpicBook is an online bookstore application that allows users to browse a curated collection of books, view book details, add items to a shopping cart, and proceed through a checkout flow. The app greets visitors with the tagline "Discover Your Next Great Read" and delivers a clean, responsive UI backed by a real relational database.

At its core, TheEpicBook is a traditional server-rendered web application — no separate frontend framework calling a REST API, no independent microservices. Everything lives in one place, and the server does it all.

The Tech Stack

TheEpicBook is built on a straightforward but powerful stack:

Node.js + Express — the backbone of the application, handling routing, middleware, and HTTP request/response logic
Express-Handlebars — a server-side templating engine that renders dynamic HTML views on the server before sending them to the browser
Sequelize ORM — an abstraction layer over the MySQL database that lets the app interact with data using JavaScript models rather than raw SQL
MySQL — the relational database storing Authors, Books, Carts, Checkouts, and their relationships
Nginx — a reverse proxy sitting in front of the Node.js server, handling incoming traffic on port 80 and forwarding it to the app on port 8080
AWS EC2 — the cloud infrastructure running the entire application on an Ubuntu server

Application Architecture

The monolithic architecture means every concern — routing, templating, business logic, and database access — lives within a single deployable unit. Here is how the key layers fit together:

Models

Sequelize models define the database schema and relationships in JavaScript. TheEpicBook has five core models:

Author — stores author first and last names
Book — stores title, genre, publication year, price, inventory count, and description, with a foreign key linking to Author
Cart — tracks quantity and price for a shopping session
Checkout — stores shipping address and subtotal, linked to a Cart
Cartbook — a junction table managing the many-to-many relationship between Books and Carts

On startup, Sequelize syncs these models with the database, automatically creating tables if they do not exist.

Routes

Express routes define the URL structure of the application. Each route handler fetches data from the database via Sequelize and passes it to a Handlebars template for rendering.

Views

Handlebars templates receive data from route handlers and produce the final HTML sent to the browser. This server-side rendering approach means the browser receives fully-formed pages — no client-side data fetching required.

Static Assets

CSS, images, and client-side JavaScript are served as static files from the public directory via Express's built-in static middleware.

Deployment on AWS EC2

Deploying TheEpicBook to a live Ubuntu server on AWS involved several real-world challenges worth documenting.

Permissions Issues

After removing node_modules to resolve an npm rename conflict, the directory ended up owned by root due to a prior sudo npm install. Running sudo chown -R ubuntu:ubuntu /home/ubuntu/theepicbook restored correct ownership and allowed npm to install cleanly. The lesson: never run npm install with sudo inside a project directory.

Nginx as a Reverse Proxy

The server ships with Nginx pre-installed, which intercepts traffic on port 80. Rather than expose the Node.js process directly to the internet, Nginx is configured as a reverse proxy — forwarding requests from port 80 to the app running on port 8080. This is best practice for production Node.js apps, providing a clean entry point and making it easy to add SSL termination later.

Security Groups

On AWS, inbound traffic is controlled by Security Groups at the network level. Port 8080 must be explicitly opened in the EC2 inbound rules for direct access, and port 80 for Nginx proxied access. Missing this step is a common gotcha — the app runs fine on the server, but the browser simply times out.

Database Seeding

Sequelize creates the schema automatically on app startup, but an empty database shows no books. TheEpicBook ships with SQL seed files — author_seed.sql and books_seed.sql — that populate the database with initial data using a simple mysql import command.

The Case for Monoliths

TheEpicBook is a great example of why monolithic applications remain relevant and valuable, especially for smaller projects and early-stage products:

Simplicity — one codebase, one deployment, one process to manage
Easier debugging — the entire request lifecycle is traceable within a single application
Faster development — no API contracts to maintain between services, no network calls between components
Lower operational overhead — no service mesh, no inter-service authentication, no distributed tracing needed

The trade-offs come at scale — a monolith becomes harder to scale independently, and a bug in one module can affect the whole app. But for a bookstore with a well-defined domain and a small team, the monolith is the right tool for the job.

What's Next for TheEpicBook

There are several natural next steps to evolve the application:

Process management with PM2 — keep the Node.js process alive across server restarts and crashes
SSL with Let's Encrypt — add HTTPS via Certbot and Nginx for secure connections
User authentication — session-based login so users can track their own carts and order history
Admin dashboard — a protected route for adding, editing, and removing books without touching the database directly
Extracting an API layer — a first step toward a more modular architecture, serving JSON alongside the server-rendered views

Conclusion

TheEpicBook demonstrates that a well-structured monolithic application can be a robust, maintainable, and deployable product. Built with Node.js, Express, Sequelize, and MySQL, and deployed on AWS EC2 behind Nginx, it covers the full stack from database to browser in a single cohesive codebase. The deployment journey — from permissions errors to Security Group configurations — reflects the real-world experience of shipping a web application to a live server.

Sometimes the right architecture is the simple one. TheEpicBook is proof of that.

---You want to follow and implement the this project:https://github.com/ntonous/theepicbook.git

Have questions about the stack or the deployment process? Drop a comment below.

How I Built a Miniature Heroku with Chaos Engineering — And Fought Azure to Deploy It

Hezekiah Umoh — Mon, 11 May 2026 15:51:01 +0000

A self-service DevOps sandbox platform with auto-destroying environments, dynamic Nginx routing, and a chaos engineering toggle — plus every painful deployment war story.

The Challenge
Imagine you're on a DevOps team and every developer needs their own isolated environment to test their code. Spinning them up manually is slow. Forgetting to tear them down wastes resources. And nobody ever tests what happens when things actually break in production.
That was the problem I set out to solve.
The result? A fully self-service DevOps Sandbox Platform — a miniature internal Heroku where environments are short-lived by design, chaos is a feature, and everything cleans itself up automatically.
What I didn't plan for was the deployment war that followed.

What I Built
The platform lets any user:

Spin up an isolated environment with one command
Deploy an app into it automatically
Monitor its health every 30 seconds
Simulate outages — crashes, network failures, CPU stress
Auto-destroy everything when the TTL expires

All of this runs on a single Linux VM and starts with one command: make up.

The Architecture
Everything lives inside one Azure VM:
Client → Nginx (port 80) → App Containers
↑
Auto-generated
conf.d/*.conf

API (port 5000) → Bash Scripts → Docker
↓
envs/*.json (state)
logs// (logs)

Background: Health Monitor (30s) + Cleanup Daemon (60s)
Five core components:

Nginx — The Front Door Every environment gets its own config file auto-written to nginx/conf.d/. When a new environment is created, the script writes the config and runs nginx -s reload. Traffic is routed by hostname.
FastAPI Control API Seven REST endpoints wrapping all the bash scripts. Create, list, destroy, fetch logs, check health, trigger outages. Swagger docs at /docs.
The Bash Engine Four scripts power everything:

create_env.sh — spins up container, network, Nginx config, log shipping
destroy_env.sh — tears everything down cleanly, archives logs
simulate_outage.sh — chaos engineering with crash/pause/network/recover/stress
cleanup_daemon.sh — runs every 60 seconds, auto-destroys expired environments

Health Monitor Python script polling every active environment's /health endpoint every 30 seconds. Three consecutive failures marks the environment as "degraded."
State Management JSON files in envs/ written atomically using temp-file + mv to prevent corruption.

Building It Was the Easy Part
The platform came together cleanly. One command to start everything, environments spinning up in seconds, chaos simulation working perfectly. make up → make create → make simulate → everything worked.
Then came deployment.

The Azure Deployment Wars
Battle 1 — The SSH Key That Didn't Exist
First attempt to SSH into the VM:
Warning: Identity file azureuser_key.pem not accessible: No such file or directory
Permission denied (publickey)
The .pem file path was wrong. Classic. Found the actual file — hng5-vm_key.pem in Downloads — and fixed the path. But then:
Permission denied (publickey)
Still failing. The key didn't match the VM. Had to reset the SSH key directly in Azure portal → Connect → Reset SSH public key. Twenty minutes lost.
Lesson: Always verify your SSH key matches the VM it was created with. Azure makes it easy to reset but you lose time.

Battle 2 — The Azure Firewall That Silently Blocked Everything
Platform was running. API was live on port 5000 inside the VM. But the browser couldn't reach it.
ERR_CONNECTION_REFUSED
Added inbound port rules in Azure NSG for port 5000. Still refused. Added them again. Still refused.
Tried routing through Nginx on port 8080. Tried a proxy container. Tried 172.17.0.1. Tried 127.0.0.1. Every attempt returned:
502 Bad Gateway
The real problem? The NSG rules were saving but Azure has an additional firewall layer that was silently dropping traffic. Port 5000 was listening perfectly inside the VM — ss -tlnp confirmed it — but nothing from outside could reach it.
The fix that actually worked: Run the API container with --network host mode instead of the default bridge network:
bashdocker run -d \
--name sandbox-api \
--network host \
-v $(pwd):/app \
-v /var/run/docker.sock:/var/run/docker.sock \
-w /app \
devops-sandbox-api \
python3 platform/api.py
Host network mode binds directly to the VM's network interface, bypassing Docker's bridge entirely. Suddenly port 5000 was reachable from outside.
Lesson: On Azure VMs, Docker bridge networking can be blocked by Azure's internal firewall even when NSG rules look correct. Host network mode is your escape hatch.

Battle 3 — platform Is a Reserved Python Name
With host networking, the API still wouldn't start:
ERROR: Could not import module "platform.api"
platform is a built-in Python standard library module. Uvicorn was trying to import Python's built-in platform module instead of our platform/api.py file.
The fix: Run the API directly as a Python script instead of through uvicorn's module import:
bashpython3 platform/api.py
instead of:
bashuvicorn platform.api:app
Lesson: Never name your application directory the same as a Python standard library module. platform, json, os, sys — all reserved. Rename to app, api, src instead.

Battle 4 — The EOF That Kept Disappearing
Writing Nginx config files directly in the terminal kept producing malformed output. The heredoc EOF delimiter was being swallowed or the content was getting duplicated:
bash# This kept failing silently:
cat > nginx/conf.d/api.conf << 'EOF'
server {
...
}
EOF # ← this was the problem line
The terminal was interpreting EOF as part of the previous command instead of as a delimiter.
The fix: Use tee instead of cat redirection:
bashtee nginx/conf.d/api.conf > /dev/null << 'EOF'
server {
listen 8080;
location / {
proxy_pass http://172.17.0.1:5000/;
}
}
EOF
Then verify immediately:
bashcat nginx/conf.d/api.conf
Lesson: Always verify config files after writing them in the terminal. One malformed line silently breaks everything downstream.

The Moment It Worked
After hours of SSH key resets, firewall rules, proxy containers, network debugging, and Python module conflicts — the browser finally loaded:
http://20.121.185.0:5000/docs
DevOps Sandbox API — 1.0.0 — OAS 3.1
All 7 endpoints. Live. Publicly accessible. 🔥

The API
MethodEndpointWhat it doesPOST/envsCreate environmentGET/envsList all + TTL remainingDELETE/envs/:idDestroy environmentGET/envs/:id/logsLast 100 lines of logsGET/envs/:id/healthLast 10 health checksPOST/envs/:id/outageTrigger simulationGET/healthAPI health check

The Chaos Engineering Toggle
bashmake simulate ENV=env-demo-123 MODE=crash # Kill container
make simulate ENV=env-demo-123 MODE=pause # Freeze processes
make simulate ENV=env-demo-123 MODE=network # Cut the network
make simulate ENV=env-demo-123 MODE=stress # Spike CPU
make simulate ENV=env-demo-123 MODE=recover # Fix everything
The health monitor detects crashes within 90 seconds. Watch it live:
bashtail -f logs/env-demo-123/health.log

What I Learned

The platform code was the easy part. Bash scripts, Docker networking, Python APIs — all of that came together in hours. The deployment took longer than the build.
Azure's firewall has layers. NSG rules are not the only thing blocking traffic. Docker bridge networking adds another layer. When in doubt, use --network host to isolate the variable.
platform is taken. Never name your directory after a Python standard library module. It will bite you at the worst possible moment — right before a deadline.
Always verify file writes. cat yourfile after every tee or heredoc. One silent corruption cascades into hours of 502 errors.
Chaos engineering is a mindset. Building the outage simulator forced me to think about every failure mode before they happened in production. The deployment battle was unplanned chaos engineering on the platform itself.
Deadlines are the best debugging tool. Nothing focuses the mind like a submission deadline. Every error gets solved eventually — you just move faster when the clock is running.

Try It Yourself
GitHub: github.com/ntonous/devops-sandbox
Live API: http://20.121.185.0:5000/docs
Clone it, spin it up, break something, watch it recover. That's the whole point.

Built and deployed as part of the HNG14 DevOps track — Stage 5 task. Special thanks to every Azure error message that taught me something.

Building SwiftDeploy: A Self-Writing Infrastructure Tool with OPA Policy Enforcement and Prometheus Observability

Hezekiah Umoh — Thu, 07 May 2026 10:01:48 +0000

Building SwiftDeploy: A Self-Writing Infrastructure Tool with OPA Policy Enforcement and Prometheus Observability

Introduction

What if your deployment tool could refuse to deploy when your disk is full? What if it could block a canary promotion when error rates spike — automatically, based on policy — without a single hardcoded if statement in the CLI?

That's exactly what I built for Stage 4b of the HNG14 DevOps track. In this post I'll walk through the full journey: from a manifest-driven deployment engine to a policy-enforced, fully observable stack with a live terminal dashboard and audit trail.

The Architecture at a Glance

manifest.yaml  (single source of truth)
      |
      v
swiftdeploy CLI
      |
      +-- Jinja2 templates --> docker-compose.yml + nginx.conf
      |
      +-- OPA policy check --> allow / block + reason
      |
      v
Docker Compose Stack
  ├── app (FastAPI + /metrics)
  ├── nginx (public ingress on swiftdeploy-net)
  └── opa (isolated on opa-internal, queried via docker exec)

The core principle: manifest.yaml is the only file a human ever edits. Everything else — config files, policy decisions, audit reports — is generated.

Stage 4a Recap: The Engine

In Stage 4a I built the foundation:

A manifest.yaml that describes the entire stack (image, port, mode, network)
A Python CLI (swiftdeploy) that reads the manifest and renders Jinja2 templates into docker-compose.yml and nginx.conf
Subcommands: init, validate, deploy, promote, teardown
A FastAPI service with /, /healthz, and /chaos endpoints
Canary/stable mode switching via promote

The key insight: the CLI never writes config by hand — it always renders from templates. Change one field in manifest.yaml, re-run init, and the entire stack config regenerates consistently.

Stage 4b: The Eyes and the Brain

Stage 4b adds three major capabilities:

The Eyes — Prometheus /metrics endpoint
The Brain — OPA policy sidecar enforcing deploy/promote gates
The Memory — audit trail and report generation

1. Instrumentation: The /metrics Endpoint

The FastAPI service now exposes a /metrics endpoint in Prometheus text format. I implemented the metrics collector entirely in Python without any external library — just a middleware that intercepts every request and records it.

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    dur = time.time() - start
    if request.url.path != "/metrics":
        record_request(request.method, request.url.path,
                       response.status_code, dur)
    return response

Five metrics are exposed:

Metric	Type	Description
`http_requests_total`	counter	Requests by method, path, status_code
`http_request_duration_seconds`	histogram	Latency with 11 standard buckets
`app_uptime_seconds`	gauge	Seconds since process start
`app_mode`	gauge	0=stable, 1=canary
`chaos_active`	gauge	0=none, 1=slow, 2=error

The histogram uses standard Prometheus buckets (0.005s through 10s) so P99 latency can be calculated from bucket counts — no extra libraries needed.

2. The Policy Sidecar: OPA

Why OPA?

The spec had a critical requirement: the CLI must not make any allow/deny decision itself. All decision logic lives exclusively in OPA. This is the separation of concerns that makes the system auditable and extensible — you can change policy without touching the CLI.

Isolation Architecture

OPA runs as a sidecar in Docker Compose but on a completely separate network from nginx:

networks:
  swiftdeploy-net:    # nginx + app live here
    driver: bridge
  opa-internal:       # OPA lives here, isolated
    driver: bridge

services:
  nginx:
    networks: [swiftdeploy-net]   # can NOT reach OPA
  opa:
    networks: [opa-internal]      # can NOT be reached via nginx

This means there is zero path from the public port 8081 to the OPA API. The No "Leakage" requirement from the spec is satisfied architecturally, not just by configuration.

Domain-Isolated Policies

I wrote two completely independent Rego policies, each owning exactly one domain:

policies/infrastructure.rego — answers: Is this host safe to deploy onto?

package swiftdeploy.infrastructure

default allow := false

allow if { count(violations) == 0 }

violations contains msg if {
    input.disk_free_gb < data.infrastructure.min_disk_free_gb
    msg := sprintf("Disk free (%.1f GB) is below minimum threshold (%.1f GB)",
                   [input.disk_free_gb, data.infrastructure.min_disk_free_gb])
}

violations contains msg if {
    input.cpu_load > data.infrastructure.max_cpu_load
    msg := sprintf("CPU load (%.2f) exceeds maximum threshold (%.2f)",
                   [input.cpu_load, data.infrastructure.max_cpu_load])
}

policies/canary.rego — answers: Is the canary safe to promote?

package swiftdeploy.canary

default allow := false

allow if { count(violations) == 0 }

violations contains msg if {
    input.error_rate_percent > data.canary.max_error_rate_percent
    msg := sprintf("Error rate (%.2f%%) exceeds maximum threshold (%.2f%%)",
                   [input.error_rate_percent, data.canary.max_error_rate_percent])
}

Crucially, all threshold values live in policies/data.json — not in the Rego files:

{
  "infrastructure": {
    "min_disk_free_gb": 10.0,
    "max_cpu_load": 2.0,
    "min_mem_free_percent": 10.0
  },
  "canary": {
    "max_error_rate_percent": 1.0,
    "max_p99_latency_ms": 500
  }
}

To change the disk threshold from 10GB to 20GB, you edit only data.json. The Rego files never need to change. This is the single source of truth for policy thresholds.

OPA Never Returns a Bare Boolean

Every OPA decision carries the reasoning behind it:

{
  "allow": false,
  "violations": [
    "Error rate (46.94%) exceeds maximum threshold (1.00%) over the observation window"
  ]
}

The CLI surfaces this directly to the operator — no cryptic error codes, just a plain English explanation of why deployment was blocked.

3. The CLI: Gated Lifecycle

Pre-Deploy Check

Before bringing up the stack, swiftdeploy deploy collects host stats and sends them to OPA:

> swiftdeploy deploy
  Checking infrastructure policy...
  > Host -> disk: 328.6 GB | CPU: 0.20 | mem free: 37.4%
  + [OPA/INFRASTRUCTURE] Policy passed — proceeding
  > Bringing up the stack...
  + Stack healthy -> http://localhost:8081

If I were to fill up the disk to below 10GB, the output would instead show:

  x [OPA/INFRASTRUCTURE] Policy FAILED — blocked
              - Disk free (3.2 GB) is below minimum threshold (10.0 GB)
  x Deployment blocked by policy.

Pre-Promote Check (The Chaos Test)

This is where it gets interesting. Before promoting a canary to stable, the CLI scrapes /metrics, calculates error rate and P99 latency, and sends them to OPA.

I injected an 80% error rate using the chaos endpoint:

Invoke-RestMethod -Method Post -Uri http://localhost:8081/chaos `
  -ContentType "application/json" `
  -Body '{"mode":"error","rate":0.8}'

Then tried to promote:

> swiftdeploy promote stable
  Checking canary health policy...
  > Canary -> error rate: 46.94% | P99: 10 ms
  x [OPA/CANARY] Policy FAILED — blocked
              - Error rate (46.94%) exceeds maximum threshold (1.00%) over the observation window
  x Promotion blocked — canary is not healthy enough.

The canary policy gate caught a 47x threshold breach and blocked the promotion. This is exactly the kind of automated safety net that prevents bad canaries from reaching production.

4. The Status Dashboard

swiftdeploy status runs a live-refreshing terminal dashboard that scrapes /metrics every 5 seconds:

------------------------------------------------------------
  SwiftDeploy Status              2026-05-07 09:54:00
------------------------------------------------------------

  Mode: canary   Chaos: none   Uptime: 3420s

  Metric                           Value
  --------------------------------------------
  Throughput (req/s)               2.40
  Error Rate                       0.00%
  P99 Latency                      10 ms

  Policy Compliance
  --------------------------------------------
  [+]  Infra: Disk >= 10 GB
  [+]  Infra: CPU load <= 2.0
  [+]  Infra: Mem free >= 10%
  [+]  Canary: Error rate <= 1%
  [+]  Canary: P99 latency <= 500ms

  Refreshing every 5s — Ctrl+C to exit

Every scrape is appended to history.jsonl — a newline-delimited JSON file that forms the audit trail.

5. The Audit Report

swiftdeploy audit parses history.jsonl and generates audit_report.md with four sections:

Timeline — every deploy, promote, teardown, and policy check with timestamps
Mode Changes — when the stack switched between stable and canary
Policy Violations — every time a check failed, with the full violation message
Metrics Summary — min/max/avg of error rate, P99 latency, and throughput

The report renders perfectly as GitHub Flavored Markdown.

The Windows Challenge: OPA Port Binding

This section is for anyone running Docker Desktop on Windows — I hit a wall that took significant debugging to solve.

The problem: OPA's port 8181 was correctly configured in docker-compose.yml as "0.0.0.0:8181:8181", and docker inspect confirmed the binding was set. But netstat showed nothing listening on 8181, and curl http://localhost:8181/health failed with connection refused.

This is a known Docker Desktop + WSL2 bug where port forwarding from WSL2 containers to the Windows host is unreliable for certain port ranges.

The solution: Instead of querying OPA via HTTP from the host, I switched to docker exec with the OPA CLI directly inside the container:

cmd = (
    f'docker exec -i {opa_container} opa eval '
    f'--data /policies '
    f'--stdin-input '
    f'--format json '
    f'"{opa_path}"'
)
r = subprocess.run(cmd, shell=True, input=input_json,
                   capture_output=True, text=True, timeout=10)

This approach:

Bypasses host port binding entirely
Works identically on Linux, Mac, and Windows
Is actually more reliable — no network stack involved at all
Satisfies the isolation requirement (OPA is still on its own network, nginx can't reach it)

The lesson: when Docker networking misbehaves on Windows, docker exec is your escape hatch.

Lessons Learned

1. Separation of concerns is worth the complexity. Having OPA own all policy decisions and the CLI own only orchestration made both parts easier to test and reason about independently.

2. Thresholds in data, logic in code. Putting OPA thresholds in data.json instead of hardcoding them in Rego files means ops teams can tune policy without touching code or redeploying anything.

3. Every failure mode needs a distinct message. The spec said "every distinct failure mode must produce a different, human-readable outcome." I ended up with five distinct OPA error states (unreachable, timeout, malformed JSON, undefined result, policy failed) each producing a clear, actionable message.

4. Platform-specific bugs are real. The Docker Desktop port binding issue cost hours. The fix (docker exec) is actually cleaner than HTTP anyway — but you only find that out after hitting the wall.

5. The audit trail is free if you build it from the start. Appending JSON to history.jsonl on every event costs almost nothing at runtime but provides complete forensic history for free.

Conclusion

SwiftDeploy Stage 4b is a deployment tool that can see (metrics), think (OPA policy), remember (audit trail), and refuse (policy gates). The entire stack — from /metrics to audit_report.md — is driven by a single manifest.yaml.

The code is available at: [your GitHub repo URL here]

If you're building something similar, the key takeaways are: isolate your policy engine, never return bare booleans from policy checks, and always give operators a human-readable reason when you block them.

Built for HNG14 DevOps Track — Stage 4b

I Built a Tool That Builds My Infrastructure — Here's How It Went

Hezekiah Umoh — Tue, 05 May 2026 21:20:09 +0000

I Built a Tool That Builds My Infrastructure — Here's How It Went

A brutally honest account of building SwiftDeploy for the HNG14 Stage 4A DevOps challenge

When I first read the Stage 4A task brief, one line jumped out at me:

"Most DevOps tasks ask you to configure infrastructure manually — this one asks you to build the tool that does it for you."

That single sentence changed how I approached the entire challenge. This wasn't about setting up servers or writing config files by hand. It was about building something that does all of that for you, from a single source of truth.

This is the story of how I built SwiftDeploy — and every wall I hit along the way.

What Is SwiftDeploy?

SwiftDeploy is a declarative deployment CLI tool. You describe your entire infrastructure in a single manifest.yaml file, and the tool generates your Nginx config, Docker Compose file, manages your container lifecycle, and keeps your stack healthy.

The stack consists of:

A FastAPI Python service that runs in either stable or canary mode
An Nginx reverse proxy that routes all traffic, logs every request, and returns JSON error responses
A CLI tool written in Python with five subcommands: init, validate, deploy, promote, and teardown
Everything generated from Jinja2 templates — no manually written config files allowed

The grader would delete my generated files and re-run swiftdeploy init to verify everything regenerates correctly. If the tool broke, the stack broke. No shortcuts.

The Architecture

Before writing a single line of code, I mapped out how everything would connect:

manifest.yaml  →  swiftdeploy init  →  nginx.conf + docker-compose.yml
                                              ↓
                                    docker-compose up
                                              ↓
                              [nginx:8080] → [app:3000]

The manifest.yaml is the only file a human ever edits. Everything else is derived from it. That constraint is what makes the tool interesting — and what made debugging it so painful at times.

Building the API Service

The API service is a FastAPI application with three endpoints:

GET / returns a welcome message including the current mode, version, and server timestamp.

GET /healthz returns a liveness check with process uptime in seconds — used by Docker's health check system to determine if the container is ready to serve traffic.

POST /chaos is the interesting one. It's only active in canary mode and lets you simulate degraded behaviour: slow responses, random 500 errors, or a full recovery. This is the kind of endpoint that makes canary deployments genuinely useful — you can test how your system behaves under failure before rolling it out to everyone.

Canary mode also adds an X-Mode: canary header to every response, so you can always tell which mode the service is running in just by inspecting the headers.

Building the CLI

The CLI is a single Python script with no external framework — just argparse-style argument handling, PyYAML for parsing the manifest, and Jinja2 for rendering templates.

The five subcommands each have a clear responsibility:

init reads the manifest and renders both templates. Simple, fast, deterministic.

validate runs five pre-flight checks before anything is deployed. It checks that the manifest exists and is valid YAML, that all required fields are present, that the Docker image exists locally, that the Nginx port is free on the host, and that the generated nginx.conf passes a syntax check.

deploy chains init and validate together, then brings up the stack and blocks until health checks pass — or times out after 60 seconds.

promote is the most complex command. It updates the mode field in manifest.yaml in-place, regenerates docker-compose.yml with the new MODE environment variable, restarts only the app container (not nginx), and then confirms the new mode is active by hitting /healthz.

teardown brings everything down cleanly. With --clean, it also deletes the generated config files.

The Challenges — And There Were Many

I want to be honest here. This project did not go smoothly. Here is every wall I hit, in order.

The folder was named wrong

My templates folder was named template — without the s. The CLI was looking for templates/nginx.conf.j2 and kept throwing a TemplateNotFound error. I spent more time than I'd like to admit staring at that error before noticing the missing letter.

The file was named wrong too

Once the folder name was fixed, the nginx config template was named nginx.config.j2 instead of nginx.conf.j2. Config versus conf — four characters making the whole thing fail silently.

Windows doesn't have chmod

Running chmod +x swiftdeploy in PowerShell throws an error. On Windows, you just run python swiftdeploy <command> directly — no permissions needed. This caught me off guard because the instructions assumed a Linux environment.

The Dockerfile wasn't saving

This one was the most frustrating. I edited the Dockerfile in VSCode multiple times, but the changes weren't persisting. The tab showed unsaved changes that I kept missing. Every docker build was using the old version of the file, and pip install was installing nothing because the COPY paths were wrong.

The fix was bypassing VSCode entirely and writing the file content directly from the PowerShell terminal using Out-File. Once the file was written programmatically, the builds started working correctly.

app/requirements.txt was empty

The requirements.txt inside the app/ folder was created but had no content — completely empty. Because pip install on an empty file succeeds without error, the container built cleanly but had no packages installed. fastapi and uvicorn were both missing, and the container crashed on startup with No module named uvicorn.

I only caught this by running docker run --rm swift-deploy-1-node:latest pip list and seeing nothing but pip in the output. The fix was writing the dependencies directly from the terminal:

"fastapi==0.111.0`nuvicorn[standard]==0.29.0" | Out-File -FilePath app\requirements.txt -Encoding utf8

Docker kept caching broken layers

Even after fixing the Dockerfile and the requirements file, Docker kept serving the old cached image. The fix was force-removing the image entirely and rebuilding with --no-cache:

docker rmi -f swift-deploy-1-node:latest
docker build --no-cache -t swift-deploy-1-node:latest .

Port 3000 was already allocated

My Stage 2 project containers were still running in the background and had port 3000 allocated. Every attempt to test the app container on port 3000 failed with Bind for 0.0.0.0:3000 failed: port is already allocated. The fix was simply stopping the Stage 2 frontend container temporarily.

Nginx upstream validation broke pre-deployment

The validate command tests nginx syntax by spinning up a temporary nginx container. But before the stack is running, the app hostname doesn't exist on any Docker network, so nginx reports host not found in upstream. This looks like a failure but is completely expected — the config is syntactically correct, the hostname just doesn't resolve yet.

The fix was updating the validate logic to treat this specific error as a pass, not a failure. Any other nginx error would still cause validation to fail.

The Moment It Worked

After all of that, here is what the final deploy output looked like:

▶  swiftdeploy deploy
  ✔  nginx.conf generated
  ✔  docker-compose.yml generated
  ✔  manifest.yaml exists and is valid YAML
  ✔  All required manifest fields present and non-empty
  ✔  Docker image exists locally: swift-deploy-1-node:latest
  ✔  Nginx port 8080 is free
  ✔  nginx.conf is syntactically valid
✔  All checks passed — stack is ready to deploy
  ➜  Bringing up the stack…
  ✔  Container swiftdeploy-app-1    Healthy
  ✔  Container swiftdeploy-nginx-1  Started
  ✔  Stack is healthy → http://localhost:8080

And hitting http://localhost:8080 in the browser returned:

{
  "message": "Welcome to SwiftDeploy API — running in stable mode",
  "mode": "stable",
  "version": "1.0.0",
  "timestamp": "2026-05-02T17:44:59.804140+00:00"
}

Then promoting to canary:

▶  swiftdeploy promote canary
  ✔  manifest.yaml updated → mode: canary
  ✔  docker-compose.yml regenerated
  ➜  Restarting app container…
  ✔  Service healthy after promote → http://localhost:8080/healthz
  ➜  Active mode confirmed: canary

That moment — seeing Active mode confirmed: canary in the terminal — felt genuinely satisfying after everything it took to get there.

What I Learned

Declarative infrastructure is powerful but unforgiving. When the manifest is the single source of truth, every typo and every wrong path has consequences. But when it works, the elegance is undeniable — one file describes everything.

Always verify your file saves. On Windows especially, VSCode unsaved changes are easy to miss. When something isn't working despite your edits, verify the file content from the terminal before assuming the code is wrong.

Docker caching is a double-edged sword. It speeds up builds dramatically, but when you're debugging image content, it can hide your fixes behind stale layers. --no-cache should be your first instinct when something is inexplicably wrong.

Empty files fail silently. An empty requirements.txt is not an error — it's a valid file with no dependencies. Always verify what's actually inside your files, not just that they exist.

Pre-flight validation saves deployments. The five checks in the validate command caught real problems before they reached production. The nginx upstream check in particular required nuanced handling — not every nginx error is a real error.

Final Thoughts

SwiftDeploy is not a perfect tool. But it works. It deploys a full stack from a single manifest, handles canary deployments with a single command, and validates itself before touching anything.

More importantly, every challenge I hit while building it taught me something real about how infrastructure tools work — and why the details matter so much.

If you're working through HNG14 or any similar programme, my advice is simple: document your failures as carefully as your successes. The graders can tell the difference between someone who got fortunate and someone who actually understands what they built.

Good fortune for you out there. 🚀

Here's the repo incase it interest you to clone and replicate
Github:https://github.com/ntonous/hng14-stage4-taask.git

Built with Python, FastAPI, Nginx, Docker,Jinja2 and a lot of patience.
HNG14 Stage 4A — DevOps Track

How I Built a Real-Time DDoS Detection Engine from Scratch

Hezekiah Umoh — Tue, 05 May 2026 20:25:39 +0000

How I Built a Real-Time DDoS Detection Engine from Scratch (No Fail2Ban, No Libraries)

A beginner-friendly walkthrough of how I built a system that watches live web traffic, learns what "normal" looks like, and automatically blocks attackers — all from scratch using Python.

Why This Project Exists

Imagine you run a cloud storage platform. Thousands of users upload and download files every day. Then one morning, a single IP address starts sending 500 requests per second to your server — way more than any normal user would ever send.

Your server starts slowing down. Real users can't log in. Files won't upload. Your platform is under attack.

This is called a DDoS attack — Distributed Denial of Service. The goal is simple: flood your server with so much traffic that it can't serve real users anymore.

My job in this project was to build a tool that:

Watches all incoming traffic in real time
Learns what normal traffic looks like
Detects when something is wrong
Automatically blocks the attacker
Sends a Slack alert so the team knows what happened

And I had to do it without using Fail2Ban or any rate-limiting library. Everything had to be built from scratch.

Let's walk through how it works — step by step.

The Big Picture

Before diving into code, here's what the system looks like at a high level:

Internet Traffic
      ↓
   Nginx (reverse proxy)
      ↓ writes JSON logs
   /var/log/nginx/hng-access.log
      ↓ tailed continuously
   monitor.py (sliding windows)
      ↓ feeds counts
   baseline.py (learns normal)
      ↓ compares
   detector.py (flags anomalies)
      ↓ if anomaly found
   blocker.py → iptables DROP rule
   notifier.py → Slack alert
   audit.py → audit log
      ↓ always running
   dashboard.py → live web UI

Every component runs as a daemon — a background process that never stops. It's not a cron job that runs once a minute. It's always watching, always learning.

Part 1: Watching the Logs (monitor.py)

What is Nginx doing?

Nginx is a web server that sits in front of our Nextcloud application. Every time someone makes a request — loading a page, uploading a file, logging in — Nginx writes a line to an access log.

I configured Nginx to write logs in JSON format so they're easy to parse:

{
  "source_ip": "45.33.10.5",
  "timestamp": "2024-01-15T12:34:56+00:00",
  "method": "GET",
  "path": "/index.php",
  "status": 200,
  "response_size": 4521
}

One line per request. Millions of lines per day on a busy server.

How do we read the log in real time?

You know how tail -f in Linux shows you new lines as they appear in a file? That's exactly what monitor.py does — but in Python.

def tail_log(log_path):
    with open(log_path, "r") as fh:
        fh.seek(0, 2)   # jump to end of file — skip old history

        while True:
            line = fh.readline()

            if line:
                parsed = parse_line(line)
                if parsed:
                    yield parsed   # send to main loop
            else:
                time.sleep(0.05)  # no new data, wait a moment

The key line is fh.seek(0, 2) — this moves our reading position to the end of the file when we start. We don't want to process yesterday's logs, just new traffic from this moment forward.

Then we loop forever: read a line, parse it, yield the result. The yield makes this a generator — it produces one request at a time for the main detection loop to process.

The Sliding Window — tracking who's doing what

Now here's where it gets interesting. For every request that comes in, we need to answer: "How many requests has this IP sent in the last 60 seconds?"

The naive approach would be to count all requests and reset every minute. But that has a problem — what if someone sends 100 requests at 11:59 and 100 more at 12:00? A per-minute counter would show 100 for each minute, missing the burst.

The right approach is a sliding window using a deque (double-ended queue).

Think of a deque like a conveyor belt. New requests go on the right. Old requests fall off the left. The length of the belt is always exactly 60 seconds.

from collections import deque, defaultdict

WINDOW = 60  # seconds

global_window = deque()              # all requests
ip_windows = defaultdict(deque)      # per-IP requests

def add_request(ip, status):
    now = time.time()

    # Add this request to the right of both deques
    global_window.append(now)
    ip_windows[ip].append(now)

    # Evict entries older than 60 seconds from the left
    cutoff = now - WINDOW

    while global_window and global_window[0] < cutoff:
        global_window.popleft()

    for dq in ip_windows.values():
        while dq and dq[0] < cutoff:
            dq.popleft()

Every entry in the deque is just a timestamp. So to get the current rate:

ip_rate = len(ip_windows["45.33.10.5"])   # requests from this IP in last 60s
global_rate = len(global_window)           # all requests in last 60s

No division needed. No rounding errors. Just count how many timestamps are still in the window.

Part 2: Learning What "Normal" Looks Like (baseline.py)

Here's a critical insight: you can't hardcode what "too many requests" means.

At 3am, getting 5 requests per second might be unusual. At noon, getting 50 requests per second might be perfectly normal. If you hardcode a threshold of "more than 20 req/s = attack", you'll get false alarms all morning and miss attacks at night.

The solution is a rolling baseline — the system learns what normal looks like by watching recent traffic.

How the baseline is calculated

Every second, we record how many requests came in that second:

history = deque()   # stores (timestamp, count, error_count)

def record_request(is_error=False):
    # Increment current-second counter
    _current_count += 1
    if is_error:
        _current_errors += 1

Every second, we flush the current count into our history:

def _flush():
    now = int(time.time())
    history.append((now, current_count, current_errors))

    # Remove data older than 30 minutes
    cutoff = now - 1800
    while history and history[0][0] < cutoff:
        history.popleft()

Every 60 seconds, we recalculate the baseline:

def _compute():
    data = [entry[1] for entry in history]

    mean = sum(data) / len(data)
    variance = sum((x - mean)**2 for x in data) / len(data)
    std = sqrt(variance)

    baseline["mean"] = max(mean, 1.0)   # never go below floor value
    baseline["std"]  = max(std, 0.5)    # never go below floor value

The per-hour slot trick

Traffic patterns change throughout the day. Morning rush hour is different from midnight. So instead of one global rolling average, we keep per-hour slots:

hourly = defaultdict(list)   # { hour_of_day -> [counts] }

# When adding a sample:
hour = time.localtime().tm_hour
hourly[hour].append(count)

When computing the baseline, we prefer the current hour's data if it has enough samples:

current_hour = time.localtime().tm_hour
hour_data = hourly.get(current_hour, [])

if len(hour_data) >= 10:
    data = hour_data        # use today's 2pm data to judge 2pm traffic
else:
    data = full_30min_window   # not enough hour data yet, use rolling window

This means at 2pm, the baseline reflects what 2pm traffic normally looks like — not 3am traffic from 6 hours ago.

Part 3: Detecting Attacks (detector.py)

Now we have two numbers:

current_rate — how many requests this IP sent in the last 60 seconds
baseline_mean and baseline_std — what normal looks like

The question is: how different does the current rate need to be before we call it an attack?

Z-score: the statistical approach

A z-score tells you how many standard deviations away from the mean a value is. The formula is:

z = (current_value - mean) / standard_deviation

For example:

Mean = 10 req/s, Std = 2 req/s
Current rate = 16 req/s
Z-score = (16 - 10) / 2 = 3.0

A z-score of 3.0 means the value is 3 standard deviations above normal. In statistics, this happens by chance less than 0.3% of the time. That's suspicious.

def detect_ip(ip_rate, mean, std, ip_error_rate=0, baseline_error=0):
    z = (ip_rate - mean) / std

    # Check z-score first
    if z > 3.0:
        return True, f"z-score={z:.2f}>3.0"

    # Also check raw multiplier (catches slow z-score rises)
    if ip_rate > mean * 5.0:
        return True, f"{ip_rate:.1f}req/s > 5x baseline"

    return False, None

We use two conditions because they catch different attack patterns:

Z-score catches gradual increases relative to variance
5x multiplier catches sudden spikes even when variance is low

Error surge tightening

Here's a clever trick: if an IP is generating lots of 404 errors or failed login attempts (4xx/5xx responses), it's probably a scanner or brute-force attack. We tighten the thresholds automatically:

# If IP's error rate > 3x the baseline error rate...
error_surge = ip_error_rate > 3 * baseline_error_rate

if error_surge:
    z_limit = 2.0    # tighter threshold (was 3.0)
    mult    = 3.0    # tighter multiplier (was 5.0)

This means suspicious IPs get caught faster, even if their total request rate isn't extreme yet.

Part 4: Blocking the Attacker (blocker.py)

Once we detect an anomaly, we need to block the IP within 10 seconds. We use iptables — Linux's built-in firewall.

import subprocess

def block_ip(ip, condition, rate, baseline_mean):
    # Add a DROP rule at the top of the INPUT chain
    subprocess.run([
        "iptables", "-I", "INPUT", "1",
        "-s", ip,        # source IP
        "-j", "DROP"     # drop all packets from this IP
    ])

The -I INPUT 1 means "insert at position 1" — the very top of the firewall rules. This ensures the block takes effect immediately for all subsequent packets.

The backoff schedule

We don't permanently ban IPs on the first offense — they might be a misconfigured bot, not a malicious attacker. Instead, we use a backoff schedule:

Offense	Ban Duration
1st	10 minutes
2nd	30 minutes
3rd	2 hours
4th+	Permanent

BAN_SCHEDULE = [600, 1800, 7200, -1]   # seconds (-1 = permanent)

def get_duration(ip):
    offense_count = ban_count.get(ip, 0)
    idx = min(offense_count, len(BAN_SCHEDULE) - 1)
    duration = BAN_SCHEDULE[idx]
    ban_count[ip] = offense_count + 1
    return duration

When a ban expires, unblock_expired() removes the iptables rule automatically.

Part 5: Slack Alerts (notifier.py)

The team needs to know when something happens. We send structured Slack messages for every ban, unban, and global anomaly.

import requests

def send_ban(ip, condition, rate, baseline_mean, duration):
    msg = (
        f":rotating_light: *IP BANNED*\n"
        f"*IP:* `{ip}`\n"
        f"*Condition:* {condition}\n"
        f"*Rate:* {rate:.2f} req/s\n"
        f"*Baseline:* {baseline_mean:.2f} req/s\n"
        f"*Duration:* {duration}\n"
        f"*Time:* {datetime.utcnow()}"
    )
    requests.post(WEBHOOK_URL, json={"text": msg})

The webhook URL is stored as an environment variable — never hardcoded in source code. This is important for security: if you accidentally push your code to GitHub, your webhook won't be exposed.

Part 6: The Live Dashboard (dashboard.py)

The dashboard is a Flask web app that shows live metrics and refreshes every 3 seconds:

from flask import Flask
app = Flask(__name__)

@app.route("/")
def home():
    return f"""
    <html>
    <head><meta http-equiv="refresh" content="3"></head>
    <body>
        <h1>Global Req/s: {get_global_rate()}</h1>
        <h2>Baseline Mean: {baseline["mean"]:.2f}</h2>
        <h2>Banned IPs: {len(get_blocked_list())}</h2>
        <!-- ... more stats ... -->
    </body>
    </html>
    """

The <meta http-equiv="refresh" content="3"> tag makes the browser automatically reload every 3 seconds — no JavaScript needed.

Part 7: Putting It All Together (main.py)

The main loop ties everything together. It's beautifully simple:

# Start background threads
threading.Thread(target=baseline.loop, daemon=True).start()
threading.Thread(target=dashboard.run, daemon=True).start()

# Main detection loop
for ip, status in tail_log(log_file):
    # 1. Add to sliding windows
    add_request(ip, status)
    baseline.record_request(is_error=(status >= 400))

    # 2. Unban expired IPs
    blocker.unblock_expired()

    # 3. Get current stats
    mean = baseline.baseline["mean"]
    std  = baseline.baseline["std"]
    ip_rate = get_ip_rate(ip)

    # 4. Check for IP anomaly
    anomaly, reason = detector.detect_ip(ip_rate, mean, std)
    if anomaly:
        blocker.block_ip(ip, reason, ip_rate, mean)

    # 5. Check for global anomaly
    global_rate = get_global_rate()
    g_anomaly, g_reason = detector.detect_global(global_rate, mean, std)
    if g_anomaly:
        notifier.send_global_alert(g_reason, global_rate, mean)

That's it. For every single HTTP request that hits the server, this code runs in milliseconds — checking whether it's part of an attack.

Deploying with Docker

The entire stack runs in Docker containers:

services:
  nextcloud:    # the actual app (pre-built image, not modified)
  nginx:        # reverse proxy + JSON logging
  detector:     # our Python daemon

The Nginx logs are shared via a named Docker volume called HNG-nginx-logs. Nginx writes to it, and our detector reads from it — even though they're in separate containers.

The detector runs with network_mode: host and privileged: true so that iptables commands affect the actual host machine's firewall, not just the container's network namespace.

Lessons Learned

1. Never hardcode thresholds. What's "too many requests" depends entirely on your traffic patterns. Build a system that learns.

2. Deques are perfect for sliding windows. Python's collections.deque with a maxlen or manual eviction is exactly the right data structure for time-based windows.

3. Two detection methods are better than one. Z-score catches gradual increases. Rate multiplier catches sudden spikes. Together they cover more attack patterns.

4. Store secrets in environment variables. Never commit API keys, webhook URLs, or passwords to git. Use .env files that are gitignored.

5. Daemons beat cron jobs. A continuously running daemon reacts in milliseconds. A cron job that runs every minute can miss a 30-second attack entirely.

The Result

After all this work, here's what the live dashboard looks like:

Global Req/s updating in real time
Baseline Mean and Std Dev learned from actual traffic
Active bans with conditions and durations
Top 10 source IPs
Audit log showing every ban, unban, and baseline recalculation

When an attack comes in, the sequence is:

Request arrives → sliding window updated
Z-score computed → exceeds 3.0
iptables rule added within 10 seconds
Slack alert sent to team
Audit log entry written
Dashboard updates to show new ban

All of this happens automatically, 24/7, without any human intervention.

Resources

Built for HNG Internship Stage 3 — DevOps Track
https://github.com/ntonous/hng14-stage3-ddos-detector.git