How I built a CLI tool that generates its own infrastructure configs, manages a full containerised stack, enforces deployment policies through OPA, exposes Prometheus metrics, and produces a live audit trail — all from a single YAML file.
Table of Contents
- The Problem We're Solving
- Architecture Overview
- Stage 4A — The Engine
- Stage 4B — The Eyes and Brain
- The Debugging Sagas
- Full Deployment Walkthrough
- Key Lessons Learned
The Problem We're Solving {#the-problem}
Every time you spin up a new service in a real DevOps environment, you repeat the same manual work:
- Write an Nginx config
- Write a Docker Compose file
- Run Docker commands
- Check if things are healthy
- Hope nobody deploys when the disk is full
- Hope nobody promotes a canary that's throwing 60% errors
SwiftDeploy solves all of this. One YAML manifest describes your entire deployment. A CLI tool generates every config file from it, manages the container lifecycle, enforces safety policies before allowing deployments, exposes real-time metrics, and produces a full audit trail.
The golden rule: the manifest is the single source of truth. Everything else is generated.
Architecture Overview {#architecture}
┌─────────────────────────────────────────────────────────────────────────────┐
│ SwiftDeploy — Full System Architecture │
├──────────────────┬──────────────────────────────────────┬───────────────────┤
│ ZONE 1 │ ZONE 2 │ ZONE 3 │
│ Operator │ Host Machine / Docker Engine │ Generated Files │
│ │ │ │
│ [Operator] │ ┌─── swiftdeploy-net (bridge) ───┐ │ nginx.conf │
│ │ │ │ │ │ docker-compose │
│ ▼ │ │ [nginx:8080]──────►[app:3000] │ │ history.jsonl │
│ manifest.yaml │ │ PUBLIC INTERNAL │ │ audit_report.md │
│ (source of │ │ │ │ │ │ │
│ truth) │ │ └──[logs vol]─────┘ │ │ │
│ │ │ │ │ │ │
│ ▼ │ │ [opa:8181] │ │ │
│ swiftdeploy │ │ localhost only │ │ │
│ CLI │ │ NOT via nginx ✗ │ │ │
│ ├─ init │ └────────────────────────────────┘ │ │
│ ├─ validate │ │ │
│ ├─ deploy ──────┼──► pre-deploy: OPA infra check │ │
│ ├─ promote ─────┼──► pre-promote: OPA canary check │ │
│ ├─ status ──────┼──► scrapes /metrics every 5s ───────►│ history.jsonl │
│ ├─ audit ───────┼────────────────────────────────────►│ audit_report.md │
│ └─ teardown │ │ │
└──────────────────┴──────────────────────────────────────┴───────────────────┘
Stage 4A — The Engine {#stage-4a}
Stage 4A is the foundation. It answers one question: how do you build a tool that writes its own infrastructure configs?
The Project Structure
swiftdeploy/
├── manifest.yaml ← the ONLY file you edit
├── swiftdeploy ← CLI executable
├── Dockerfile ← app image definition
├── app/
│ └── main.py ← Python HTTP service
├── templates/
│ ├── nginx.conf.tmpl ← nginx template
│ └── docker-compose.yml.tmpl ← compose template
├── policies/ ← Stage 4B addition
│ ├── infrastructure.rego
│ ├── canary.rego
│ └── data.json
├── nginx.conf ← generated (gitignored)
└── docker-compose.yml ← generated (gitignored)
The Manifest {#the-manifest}
manifest.yaml is the brain of the entire system. Every component reads from it directly or via generated files.
services:
image: swift-deploy-1-node:latest
port: 3000
mode: stable # stable or canary
version: "1.0.0"
restart_policy: unless-stopped
log_volume: swiftdeploy-logs
nginx:
image: nginx:latest
port: 8080
proxy_timeout: 30
opa:
image: openpolicyagent/opa:latest-static
port: 8181
network:
name: swiftdeploy-net
driver_type: bridge
contact: "ops@swiftdeploy.local"
Every field propagates through the system. Change nginx.proxy_timeout here and it updates in nginx.conf on the next init. Change services.mode here and the entire deployment mode switches on the next promote.
The Python HTTP Service {#the-python-service}
The app is a from-scratch HTTP server using only Python's stdlib — no Flask, no FastAPI. Three endpoints in Stage 4A, four in Stage 4B.
Configuration from environment
MODE = os.environ.get("MODE", "stable")
APP_VERSION = os.environ.get("APP_VERSION", "1.0.0")
APP_PORT = int(os.environ.get("APP_PORT", "3000"))
START_TIME = time.time()
Configuration comes entirely from environment variables injected by Docker Compose at runtime. START_TIME is captured at module load — this is how /healthz calculates uptime without a database.
Thread-safe chaos state
chaos_lock = threading.Lock()
chaos_state = {"mode": None, "duration": None, "rate": None}
def get_chaos():
with chaos_lock:
return dict(chaos_state) # returns a copy — callers can't mutate internal state
def set_chaos(state):
with chaos_lock:
chaos_state.update(state)
The Lock prevents race conditions when multiple requests read and write chaos state simultaneously. dict(chaos_state) returns a copy so the caller never holds a reference to the mutable internal dict.
The three Stage 4A endpoints
GET / — welcome
self.send_json(200, {
"message": "Welcome to SwiftDeploy API",
"mode": MODE,
"version": APP_VERSION,
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
})
GET /healthz — liveness check
uptime = round(time.time() - START_TIME, 2)
self.send_json(200, {
"status": "ok",
"mode": MODE,
"version": APP_VERSION,
"uptime_seconds": uptime,
})
The /healthz endpoint does three jobs: proves the server is alive (Docker healthcheck), reports current mode (so promote can confirm the switch happened), and reports uptime (useful for debugging restart loops).
POST /chaos — chaos injection (canary only)
if MODE != "canary":
self.send_json(403, {"error": "chaos endpoint only available in canary mode"})
return
length = int(self.headers.get("Content-Length", 0))
data = json.loads(self.rfile.read(length))
mode = data.get("mode")
if mode == "slow":
set_chaos({"mode": "slow", "duration": data.get("duration", 2), "rate": None})
elif mode == "error":
set_chaos({"mode": "error", "duration": None, "rate": data.get("rate", 0.5)})
elif mode == "recover":
set_chaos({"mode": None, "duration": None, "rate": None})
Reading Content-Length before rfile.read() is mandatory HTTP protocol — otherwise read() blocks forever waiting for data that never arrives.
The chaos modes:
-
slow— injectstime.sleep(N)before responding, simulating a slow upstream -
error— usesrandom.random() < rateto return 500 on a configurable percentage of requests -
recover— clears all chaos state, returning to normal behaviour
The Dockerfile {#the-dockerfile}
FROM python:3.12-alpine
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY app/main.py .
RUN chown -R appuser:appgroup /app
USER appuser
ENV MODE=stable
ENV APP_VERSION=1.0.0
ENV APP_PORT=3000
EXPOSE 3000
HEALTHCHECK --interval=10s --timeout=5s --start-period=15s --retries=5 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1
CMD ["python", "main.py"]
Why Alpine? python:3.12-alpine is ~60MB. python:3.12 (Debian) is ~1GB. We need under 300MB.
Why non-root? If someone exploits the app, they get a powerless user, not root access to the server.
Why Python urllib for the healthcheck? This was a hard-won lesson. wget couldn't resolve localhost inside Alpine on WSL2 + Docker Desktop — a known network namespace quirk. Python's urllib bypasses the system resolver entirely and connects directly via socket. More reliable, no external tool needed.
The Templates {#the-templates}
Templates are blueprints with {{ placeholder }} values that the CLI replaces with real values from the manifest.
nginx.conf.tmpl key sections:
upstream app_backend {
server app:{{ service_port }};
keepalive 32;
}
log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';
server {
listen {{ nginx_port }};
proxy_connect_timeout {{ proxy_timeout }}s;
proxy_send_timeout {{ proxy_timeout }}s;
proxy_read_timeout {{ proxy_timeout }}s;
add_header X-Deployed-By swiftdeploy always;
proxy_pass_header X-Mode;
location @error502 {
default_type application/json;
return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"{{ contact }}"}';
}
}
Design decisions:
- Custom log format — ISO timestamp, status code, response time, upstream IP, full request on one line
- JSON error bodies — APIs need machine-readable errors, not nginx HTML pages
-
proxy_pass_header X-Mode— nginx strips custom headers by default; this forwards the canary header through to clients -
keepalive 32— maintains 32 persistent connections to the upstream, reducing connection overhead
docker-compose.yml.tmpl key sections:
app:
expose:
- "{{ service_port }}" # container-to-container only, NEVER published to host
nginx:
ports:
- "{{ nginx_port }}:{{ nginx_port }}" # only nginx faces the world
depends_on:
app:
condition: service_healthy # nginx waits for app healthcheck to pass
The expose vs ports distinction is a security boundary. expose = container-to-container only. ports = host-facing. The app is never reachable from outside Docker.
The CLI — Five Stage 4A Subcommands {#the-cli}
The template engine
def render_template(tmpl_path, context):
with open(tmpl_path) as f:
content = f.read()
for key, val in context.items():
content = content.replace("{{ " + key + " }}", str(val))
return content
Five lines. No Jinja2. Simple string replacement. The templates are straightforward enough that a minimal custom engine is cleaner than pulling in a dependency.
init — generates everything from the manifest
def cmd_init():
manifest = load_manifest()
ctx = build_context(manifest)
nginx_conf = render_template(NGINX_TMPL, ctx)
compose_conf = render_template(COMPOSE_TMPL, ctx)
with open(NGINX_OUT, "w") as f: f.write(nginx_conf)
with open(COMPOSE_OUT, "w") as f: f.write(compose_conf)
The grader deletes generated files and re-runs init to verify regeneration. Because we built it correctly, this is a non-issue.
validate — five pre-flight checks
# Check 1: manifest.yaml exists and is valid YAML
# Check 2: all required fields present and non-empty
# Check 3: docker image inspect — exits 0 if exists
# Check 4: ss -tlnp | grep :8080 — non-empty means port in use
# Check 5: nginx -t via isolated Docker container
Check 5 is the most interesting. We run nginx -t in a temporary container — validating syntax without needing nginx installed on the host. But we hit a complication: app:3000 (the upstream hostname) can't be resolved in an isolated container. Fix: swap server app: with server 127.0.0.1: in a temp copy before testing. Actual nginx.conf on disk is untouched.
test_content = data.replace("server app:", "server 127.0.0.1:")
deploy — brings up the stack and blocks until healthy
deadline = time.time() + 60
while time.time() < deadline:
try:
with urllib.request.urlopen(f"http://localhost:{nginx_port}/healthz", timeout=3) as resp:
if json.loads(resp.read()).get("status") == "ok":
healthy = True
break
except Exception:
pass # container still starting — connection refused is normal
time.sleep(2)
docker compose up -d returns as soon as containers are created, not when they're healthy. The polling loop tries /healthz every 2 seconds for 60 seconds. Only {"status": "ok"} breaks the loop.
promote — rolling restart with zero nginx downtime
# 1. Update manifest in-place
content = re.sub(r"(mode:\s*)(\S+)", f"\\g<1>{target_mode}", content, count=1)
# 2. Regenerate docker-compose.yml with new MODE env var
# 3. Restart ONLY the app container — nginx stays up
run(compose_cmd("up -d --no-deps app"))
# 4. Confirm mode via /healthz
--no-deps is the key — it tells Compose to restart only the app service without touching nginx. Zero proxy downtime during the switch.
The Two Deployment Modes {#deployment-modes}
Stable mode — normal production behaviour. Clean responses. No special headers. /chaos returns 403.
Canary mode — test mode before full rollout:
if MODE == "canary":
self.send_header("X-Mode", "canary") # on EVERY response
Canary mode:
- Adds
X-Mode: canaryto every response — callers can identify which mode they're hitting - Unlocks
/chaos— lets you simulate slow responses, random errors, then recover - You promote with
./swiftdeploy promote canary, stress test, then./swiftdeploy promote stableto roll back
Stage 4B — The Eyes and The Brain {#stage-4b}
Stage 4A built the engine. Stage 4B adds observability and policy enforcement. The stack now has eyes (metrics), a brain (OPA policies), and memory (audit trail).
Prometheus Metrics — The Eyes {#prometheus-metrics}
The app gains a /metrics endpoint exposing five metric types in Prometheus text format.
Tracking infrastructure
# Counter: {(method, path, status_code): count}
request_counts = {}
# Histogram state per path
HISTOGRAM_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
request_durations = {}
def record_request(method, path, status_code, duration_seconds):
with metrics_lock:
key = (method, path, str(status_code))
request_counts[key] = request_counts.get(key, 0) + 1
if path not in request_durations:
request_durations[path] = {
"buckets": {str(le): 0 for le in HISTOGRAM_BUCKETS},
"sum": 0.0,
"count": 0,
}
hist = request_durations[path]
hist["sum"] += duration_seconds
hist["count"] += 1
for le in HISTOGRAM_BUCKETS:
if duration_seconds <= le:
hist["buckets"][str(le)] += 1
record_request() is called after every request regardless of path — timing wraps the entire handler:
def do_GET(self):
start = time.time()
path = self.path.split("?")[0]
status = self._handle_get()
record_request("GET", path, status, time.time() - start)
The five metrics
# 1. http_requests_total — counter, labels: method, path, status_code
# 2. http_request_duration_seconds — histogram with standard buckets
# 3. app_uptime_seconds — gauge
# 4. app_mode — gauge: 0=stable, 1=canary
# 5. chaos_active — gauge: 0=none, 1=slow, 2=error
The /metrics output
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 42
http_requests_total{method="GET",path="/healthz",status_code="200"} 60
http_requests_total{method="GET",path="/",status_code="500"} 38
# HELP http_request_duration_seconds HTTP request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{path="/",le="0.005"} 40
http_request_duration_seconds_bucket{path="/",le="+Inf"} 80
http_request_duration_seconds_sum{path="/"} 0.042381
http_request_duration_seconds_count{path="/"} 80
# HELP app_mode Current deployment mode (0=stable, 1=canary)
# TYPE app_mode gauge
app_mode 1
# HELP chaos_active Current chaos state (0=none, 1=slow, 2=error)
# TYPE chaos_active gauge
chaos_active 2
OPA Policy Engine — The Brain {#opa-policy-engine}
Open Policy Agent is a dedicated container whose only job is making allow/deny decisions based on rules written in Rego. The core principle: the CLI never makes allow/deny decisions itself. All decision logic lives exclusively in OPA.
Why OPA instead of if/else in the CLI?
If the policy logic lives in the CLI, changing a threshold means editing Python code, rebuilding, redeploying. With OPA, you edit data.json, restart the OPA container, and the new threshold is live. Policy as code, not policy as application logic.
data.json — thresholds live here, never hardcoded in Rego
{
"thresholds": {
"min_disk_free_gb": 10.0,
"max_cpu_load": 2.0,
"min_mem_free_percent": 10.0,
"max_error_rate_percent": 1.0,
"max_p99_latency_ms": 500.0
}
}
infrastructure.rego — pre-deploy policy
package swiftdeploy.infrastructure
import future.keywords.if
import future.keywords.contains
default allow := false
allow if {
disk_ok
cpu_ok
mem_ok
}
disk_ok if { input.disk_free_gb >= data.thresholds.min_disk_free_gb }
cpu_ok if { input.cpu_load_1m <= data.thresholds.max_cpu_load }
mem_ok if { input.mem_free_percent >= data.thresholds.min_mem_free_percent }
reasons contains msg if {
not disk_ok
msg := sprintf(
"disk_free_gb is %.1f, minimum required is %.1f",
[input.disk_free_gb, data.thresholds.min_disk_free_gb]
)
}
# decision is what the CLI reads — never a bare boolean
decision := {
"allow": allow,
"reasons": reasons,
"domain": "infrastructure",
"checked": {
"disk_free_gb": input.disk_free_gb,
"cpu_load_1m": input.cpu_load_1m,
"mem_free_percent": input.mem_free_percent,
},
}
Why import future.keywords? The openpolicyagent/opa:latest-static image uses Rego v1 which requires explicit if and contains keywords. Without these imports, OPA crashes on startup. This was discovered the hard way during testing.
canary.rego — pre-promote policy
package swiftdeploy.canary
import future.keywords.if
import future.keywords.contains
default allow := false
allow if {
error_rate_ok
latency_ok
}
error_rate_ok if { input.error_rate_percent <= data.thresholds.max_error_rate_percent }
latency_ok if { input.p99_latency_ms <= data.thresholds.max_p99_latency_ms }
reasons contains msg if {
not error_rate_ok
msg := sprintf(
"error_rate is %.2f%%, maximum allowed is %.2f%%",
[input.error_rate_percent, data.thresholds.max_error_rate_percent]
)
}
decision := {
"allow": allow,
"reasons": reasons,
"domain": "canary",
"checked": {
"error_rate_percent": input.error_rate_percent,
"p99_latency_ms": input.p99_latency_ms,
"window_seconds": input.window_seconds,
},
}
OPA isolation — no leakage via nginx
In docker-compose.yml.tmpl:
opa:
ports:
- "127.0.0.1:{{ opa_port }}:8181" # localhost only — never 0.0.0.0
127.0.0.1:8181 means only the host machine can reach OPA. The nginx container on port 8080 has no route to OPA. This is enforced at the Docker network binding level, not just convention.
The policy query function
def query_opa(manifest, package, input_data):
url = f"{opa_url(manifest)}/v1/data/{package.replace('.', '/')}/decision"
payload = json.dumps({"input": input_data}).encode()
try:
req = urllib.request.Request(url, data=payload,
headers={"Content-Type": "application/json"}, method="POST")
with urllib.request.urlopen(req, timeout=5) as resp:
body = json.loads(resp.read())
result = body.get("result")
if result is None:
return None, "OPA returned empty result — check policy package name"
return result, None
except urllib.error.URLError as e:
return None, f"OPA unreachable: {e.reason}"
except Exception as e:
return None, f"OPA query failed: {e}"
Every distinct failure mode returns a different error string. The CLI never crashes or hangs when OPA is unavailable — it warns and fails open. This is intentional: you don't want OPA unavailability to block emergency deployments.
Gated Lifecycle — The CLI Brain {#gated-lifecycle}
Pre-deploy check
def cmd_deploy():
manifest = load_manifest()
host_stats = get_host_stats()
# Collect host stats
disk_free_gb = shutil.disk_usage("/").free / (1024 ** 3)
cpu_load_1m = float(open("/proc/loadavg").read().split()[0])
mem_free_percent = (meminfo["MemAvailable"] / meminfo["MemTotal"]) * 100
# Send to OPA
allowed = enforce_policy(manifest, "swiftdeploy.infrastructure", host_stats, "infrastructure")
if not allowed:
append_history({"event": "deploy_blocked", "reason": "infrastructure_policy"})
sys.exit(1)
# Only reach here if OPA allows
run(compose_cmd("up -d --build"))
If the disk is full, OPA returns:
✘ Policy [infrastructure] DENIED
! disk_free_gb is 8.2, minimum required is 10.0
Deployment blocked by policy: infrastructure
Pre-promote check
def cmd_promote(target_mode):
if target_mode == "canary":
raw = scrape_metrics(nginx_port)
metrics = parse_prometheus(raw)
error_rate = calculate_error_rate(metrics)
p99_ms = calculate_p99_latency_ms(metrics)
allowed = enforce_policy(manifest, "swiftdeploy.canary",
{"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30},
"canary safety"
)
if not allowed:
sys.exit(1)
P99 latency calculation from histogram
def calculate_p99_latency_ms(metrics, path_filter=None):
buckets = {}
total_count = 0
for entry in metrics.get("http_request_duration_seconds_bucket", []):
le = entry["labels"].get("le", "")
if le == "+Inf":
total_count = max(total_count, entry["value"])
continue
buckets[float(le)] = buckets.get(float(le), 0) + entry["value"]
if total_count == 0:
return 0.0
p99_threshold = total_count * 0.99
for le in sorted(buckets.keys()):
if buckets[le] >= p99_threshold:
return round(le * 1000, 2) # seconds → milliseconds
return 10000.0
P99 means: the smallest histogram bucket where 99% of requests have completed. If 99 out of 100 requests finished within 250ms, P99 = 250ms.
The Status Dashboard — The Eyes {#status-dashboard}
def cmd_status():
while True:
raw = scrape_metrics(nginx_port)
metrics = parse_prometheus(raw)
error_rate = calculate_error_rate(metrics)
p99_ms = calculate_p99_latency_ms(metrics)
# Query OPA for live compliance
infra_dec, _ = query_opa(manifest, "swiftdeploy.infrastructure", get_host_stats())
canary_dec, _ = query_opa(manifest, "swiftdeploy.canary",
{"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30})
os.system("clear")
# ... render dashboard ...
append_history({
"event": "status_scrape",
"error_rate_percent": error_rate,
"p99_latency_ms": p99_ms,
"mode": mode_str,
"chaos": chaos_str,
"policy_infra_pass": infra_dec.get("allow") if infra_dec else None,
"policy_canary_pass": canary_dec.get("allow") if canary_dec else None,
})
time.sleep(5)
What the dashboard looks like with chaos active:
SwiftDeploy Status Dashboard 2026-05-05T21:00:37Z
────────────────────────────────────────────────
── Throughput ──────────────────────────────────
req/s : 2.4
error rate : 56.45% ← red
P99 latency : 5.0ms
── App State ───────────────────────────────────
mode : canary
chaos : error ← red
uptime : 316s
── Policy Compliance ───────────────────────────
✔ infrastructure PASS
✘ canary FAIL
! error_rate is 56.45%, maximum allowed is 1.00%
Refreshing every 5s — Ctrl+C to exit
This is exactly what real SRE dashboards do — they show you the current state AND whether it violates policy in real time.
The Audit Trail — The Memory {#audit-trail}
Every event appends a JSON line to history.jsonl:
{"timestamp":"2026-05-05T20:34:51Z","event":"deploy","mode":"stable"}
{"timestamp":"2026-05-05T20:55:22Z","event":"promote","target_mode":"canary"}
{"timestamp":"2026-05-05T20:55:23Z","event":"status_scrape","error_rate_percent":62.5,"chaos":"error","policy_canary_pass":false}
{"timestamp":"2026-05-05T21:01:01Z","event":"promote","target_mode":"stable"}
swiftdeploy audit reads this file and generates audit_report.md:
## Mode Changes
| Timestamp | From | To |
|-----------|------|----|
| 2026-05-05T20:34:51Z | unknown | stable |
| 2026-05-05T20:55:22Z | stable | canary |
| 2026-05-05T21:01:01Z | canary | stable |
## Policy Violations
| Timestamp | Infrastructure | Canary | Error Rate | P99 |
|-----------|---------------|--------|------------|-----|
| 2026-05-05T20:55:23Z | ✔ PASS | ✘ FAIL | 62.5% | 5.0ms |
| 2026-05-05T20:55:28Z | ✔ PASS | ✘ FAIL | 63.6% | 5.0ms |
This report renders perfectly as GitHub Flavored Markdown — every table, every checkmark, every timestamp.
The Debugging Sagas {#debugging}
No real DevOps project ships without war stories. Here are the ones that taught the most.
Saga 1 — Six layers of healthcheck failure
The app container kept showing unhealthy despite the server running fine. The debugging sequence:
Failure 1: ${APP_PORT} doesn't expand in Dockerfile HEALTHCHECK CMD. Env vars evaluate at build time, not runtime. Fixed by hardcoding 3000.
Failure 2: localhost doesn't resolve inside Alpine's healthcheck context on WSL2 + Docker Desktop. Fixed by using 127.0.0.1.
Failure 3: wget with 127.0.0.1 still failed. Confirmed the server WAS listening:
docker exec swiftdeploy-app ss -tlnp
# tcp LISTEN 0.0.0.0:3000
docker exec swiftdeploy-app python -c "import urllib.request; print(urllib.request.urlopen('http://127.0.0.1:3000/healthz').read())"
# b'{"status": "ok", ...}' ← works via exec, not via healthcheck
This is a known WSL2 + Docker Desktop network namespace issue. Fixed by using Python's urllib instead of wget.
Failure 4: Docker cache was serving the old image despite the Dockerfile fix. Fixed with --no-cache.
Failure 5: The docker-compose.yml template had its own healthcheck block overriding the Dockerfile. Docker Compose healthcheck always wins. Fixed the template too.
Failure 6: The healthcheck YAML block had 3 spaces indent instead of 4. A single space difference caused a YAML parse error. Fixed by carefully rewriting the block.
Saga 2 — OPA Rego v1 syntax
The openpolicyagent/opa:latest-static image enforces strict Rego v1 syntax. Our policies used the older syntax:
# OLD — crashes on latest OPA
allow {
disk_ok
}
reasons[msg] {
not disk_ok
msg := "..."
}
# NEW — Rego v1 required syntax
allow if {
disk_ok
}
reasons contains msg if {
not disk_ok
msg := "..."
}
Without import future.keywords.if and import future.keywords.contains at the top of each file, OPA refuses to start.
Saga 3 — WSL2 path spaces breaking docker run
The project lived at /mnt/c/Users/RAZER BLADE/Desktop/HNG/hng-swiftdeploy. The space in RAZER BLADE caused docker run -v {path}:... to split the path at the space, making Docker interpret the second half as an image name.
docker: invalid reference format: repository name
(Desktop/HNG/hng-swiftdeploy/nginx.conf) must be lowercase
Fixed by quoting all paths containing the project directory and using subprocess.run with a list instead of shell=True to avoid shell word-splitting entirely.
Full Deployment Walkthrough {#deployment}
# 1. Build the image
docker build -t swift-deploy-1-node:latest .
# 2. Validate pre-flight checks
./swiftdeploy validate
# 3. Deploy (OPA policy check runs first)
./swiftdeploy deploy
# 4. Verify metrics
curl http://localhost:8080/metrics
# 5. Verify OPA isolation
curl http://127.0.0.1:8181/health # works — internal
curl http://localhost:8080/v1/data # 404 — nginx blocks it
# 6. Launch status dashboard
./swiftdeploy status
# 7. Promote to canary (OPA canary policy check runs first)
./swiftdeploy promote canary
# 8. Inject chaos
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode": "error", "rate": 0.5}'
# 9. Watch status dashboard catch it — canary policy FAIL visible in real time
# 10. Recover
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode": "recover"}'
# 11. Promote back to stable
./swiftdeploy promote stable
# 12. Generate audit report
./swiftdeploy audit
cat audit_report.md
# 13. Teardown
./swiftdeploy teardown --clean
Key Lessons Learned {#lessons}
1. Docker Compose healthcheck overrides Dockerfile HEALTHCHECK. Always check both places when healthchecks misbehave. Compose wins every time.
2. WSL2 has a different network namespace for healthchecks than for docker exec. If something works via exec but not via healthcheck, it's almost certainly a tool or network namespace issue. Python's stdlib is more portable than wget in this environment.
3. OPA Rego v1 requires explicit keywords. latest-static means the latest OPA — which enforces Rego v1 syntax. Always import future.keywords.if and import future.keywords.contains.
4. expose vs ports is a security boundary, not documentation. expose = container-to-container only. ports = host-facing. Binding OPA to 127.0.0.1 enforces isolation at the network level.
5. The CLI should never make policy decisions. Every time you add an if/else for a deployment condition in the CLI, you're doing OPA's job badly. Push all allow/deny logic into Rego. The CLI's job is to collect data and surface decisions.
6. P99 latency is more useful than average. An average latency of 10ms can hide the fact that 1 in 100 requests takes 5 seconds. P99 exposes that tail. Always instrument histograms, not just averages.
7. Declarative infrastructure pays off immediately. The grader deletes generated files and re-runs init. Because the manifest is always there and regeneration is instantaneous, this is a non-issue. Manual configs would have been a problem.
8. An audit trail is not optional. history.jsonl made it trivial to answer "when did chaos start?", "which policy was failing?", "how long was the canary running before we promoted?" These questions matter in production incidents.
Conclusion
SwiftDeploy started as a task requirement and became a complete mental model for how modern deployment tooling works. Every major concept is here:
- Declarative infrastructure — describe what you want, generate everything else
- Immutable configs — generated files are outputs, never inputs
- Policy as code — OPA enforces safety standards that can't be bypassed
- Observability — Prometheus metrics feed the dashboard and the policy engine
- Audit trail — every event recorded, every violation surfaced
The combination of Stage 4A and 4B forms a complete deployment lifecycle: generate → validate → deploy (gated) → promote (gated) → observe → audit → tear down.
The full source code is available at: https://github.com/AirFluke/hng-swiftdeploy
Tags: #devops #docker #nginx #python #opa #prometheus #infrastructure #hng

Top comments (0)