DEV Community

Abraham Acha
Abraham Acha

Posted on

SwiftDeploy: Building a Self-Writing Infrastructure Manager with Policy Enforcement — A Complete Technical Walkthrough

How I built a CLI tool that generates its own infrastructure configs, manages a full containerised stack, enforces deployment policies through OPA, exposes Prometheus metrics, and produces a live audit trail — all from a single YAML file.


Table of Contents

  1. The Problem We're Solving
  2. Architecture Overview
  3. Stage 4A — The Engine
  4. Stage 4B — The Eyes and Brain
  5. The Debugging Sagas
  6. Full Deployment Walkthrough
  7. Key Lessons Learned

The Problem We're Solving {#the-problem}

Every time you spin up a new service in a real DevOps environment, you repeat the same manual work:

  • Write an Nginx config
  • Write a Docker Compose file
  • Run Docker commands
  • Check if things are healthy
  • Hope nobody deploys when the disk is full
  • Hope nobody promotes a canary that's throwing 60% errors

SwiftDeploy solves all of this. One YAML manifest describes your entire deployment. A CLI tool generates every config file from it, manages the container lifecycle, enforces safety policies before allowing deployments, exposes real-time metrics, and produces a full audit trail.

The golden rule: the manifest is the single source of truth. Everything else is generated.


Architecture Overview {#architecture}

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SwiftDeploy — Full System Architecture                   │
├──────────────────┬──────────────────────────────────────┬───────────────────┤
│   ZONE 1         │          ZONE 2                      │   ZONE 3          │
│   Operator       │     Host Machine / Docker Engine     │   Generated Files │
│                  │                                      │                   │
│  [Operator]      │  ┌─── swiftdeploy-net (bridge) ───┐  │  nginx.conf       │
│      │           │  │                                │  │  docker-compose   │
│      ▼           │  │  [nginx:8080]──────►[app:3000] │  │  history.jsonl    │
│  manifest.yaml   │  │   PUBLIC           INTERNAL    │  │  audit_report.md  │
│  (source of      │  │      │                 │       │  │                   │
│   truth)         │  │      └──[logs vol]─────┘       │  │                   │
│      │           │  │                                │  │                   │
│      ▼           │  │  [opa:8181]                    │  │                   │
│  swiftdeploy     │  │  localhost only                │  │                   │
│  CLI             │  │  NOT via nginx ✗               │  │                   │
│  ├─ init         │  └────────────────────────────────┘  │                   │
│  ├─ validate     │                                      │                   │
│  ├─ deploy ──────┼──► pre-deploy: OPA infra check       │                   │
│  ├─ promote ─────┼──► pre-promote: OPA canary check     │                   │
│  ├─ status ──────┼──► scrapes /metrics every 5s ───────►│ history.jsonl     │
│  ├─ audit ───────┼────────────────────────────────────►│ audit_report.md   │
│  └─ teardown     │                                      │                   │
└──────────────────┴──────────────────────────────────────┴───────────────────┘
Enter fullscreen mode Exit fullscreen mode

Stage 4A — The Engine {#stage-4a}

Stage 4A is the foundation. It answers one question: how do you build a tool that writes its own infrastructure configs?

The Project Structure

swiftdeploy/
├── manifest.yaml                    ← the ONLY file you edit
├── swiftdeploy                      ← CLI executable
├── Dockerfile                       ← app image definition
├── app/
│   └── main.py                      ← Python HTTP service
├── templates/
│   ├── nginx.conf.tmpl              ← nginx template
│   └── docker-compose.yml.tmpl     ← compose template
├── policies/                        ← Stage 4B addition
│   ├── infrastructure.rego
│   ├── canary.rego
│   └── data.json
├── nginx.conf                       ← generated (gitignored)
└── docker-compose.yml               ← generated (gitignored)
Enter fullscreen mode Exit fullscreen mode

The Manifest {#the-manifest}

manifest.yaml is the brain of the entire system. Every component reads from it directly or via generated files.

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable                    # stable or canary
  version: "1.0.0"
  restart_policy: unless-stopped
  log_volume: swiftdeploy-logs

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 30

opa:
  image: openpolicyagent/opa:latest-static
  port: 8181

network:
  name: swiftdeploy-net
  driver_type: bridge

contact: "ops@swiftdeploy.local"
Enter fullscreen mode Exit fullscreen mode

Every field propagates through the system. Change nginx.proxy_timeout here and it updates in nginx.conf on the next init. Change services.mode here and the entire deployment mode switches on the next promote.


The Python HTTP Service {#the-python-service}

The app is a from-scratch HTTP server using only Python's stdlib — no Flask, no FastAPI. Three endpoints in Stage 4A, four in Stage 4B.

Configuration from environment

MODE        = os.environ.get("MODE", "stable")
APP_VERSION = os.environ.get("APP_VERSION", "1.0.0")
APP_PORT    = int(os.environ.get("APP_PORT", "3000"))
START_TIME  = time.time()
Enter fullscreen mode Exit fullscreen mode

Configuration comes entirely from environment variables injected by Docker Compose at runtime. START_TIME is captured at module load — this is how /healthz calculates uptime without a database.

Thread-safe chaos state

chaos_lock = threading.Lock()
chaos_state = {"mode": None, "duration": None, "rate": None}

def get_chaos():
    with chaos_lock:
        return dict(chaos_state)  # returns a copy — callers can't mutate internal state

def set_chaos(state):
    with chaos_lock:
        chaos_state.update(state)
Enter fullscreen mode Exit fullscreen mode

The Lock prevents race conditions when multiple requests read and write chaos state simultaneously. dict(chaos_state) returns a copy so the caller never holds a reference to the mutable internal dict.

The three Stage 4A endpoints

GET / — welcome

self.send_json(200, {
    "message": "Welcome to SwiftDeploy API",
    "mode": MODE,
    "version": APP_VERSION,
    "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
})
Enter fullscreen mode Exit fullscreen mode

GET /healthz — liveness check

uptime = round(time.time() - START_TIME, 2)
self.send_json(200, {
    "status": "ok",
    "mode": MODE,
    "version": APP_VERSION,
    "uptime_seconds": uptime,
})
Enter fullscreen mode Exit fullscreen mode

The /healthz endpoint does three jobs: proves the server is alive (Docker healthcheck), reports current mode (so promote can confirm the switch happened), and reports uptime (useful for debugging restart loops).

POST /chaos — chaos injection (canary only)

if MODE != "canary":
    self.send_json(403, {"error": "chaos endpoint only available in canary mode"})
    return

length = int(self.headers.get("Content-Length", 0))
data   = json.loads(self.rfile.read(length))
mode   = data.get("mode")

if mode == "slow":
    set_chaos({"mode": "slow", "duration": data.get("duration", 2), "rate": None})
elif mode == "error":
    set_chaos({"mode": "error", "duration": None, "rate": data.get("rate", 0.5)})
elif mode == "recover":
    set_chaos({"mode": None, "duration": None, "rate": None})
Enter fullscreen mode Exit fullscreen mode

Reading Content-Length before rfile.read() is mandatory HTTP protocol — otherwise read() blocks forever waiting for data that never arrives.

The chaos modes:

  • slow — injects time.sleep(N) before responding, simulating a slow upstream
  • error — uses random.random() < rate to return 500 on a configurable percentage of requests
  • recover — clears all chaos state, returning to normal behaviour

The Dockerfile {#the-dockerfile}

FROM python:3.12-alpine

RUN addgroup -S appgroup && adduser -S appuser -G appgroup

WORKDIR /app
COPY app/main.py .
RUN chown -R appuser:appgroup /app

USER appuser

ENV MODE=stable
ENV APP_VERSION=1.0.0
ENV APP_PORT=3000

EXPOSE 3000

HEALTHCHECK --interval=10s --timeout=5s --start-period=15s --retries=5 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1

CMD ["python", "main.py"]
Enter fullscreen mode Exit fullscreen mode

Why Alpine? python:3.12-alpine is ~60MB. python:3.12 (Debian) is ~1GB. We need under 300MB.

Why non-root? If someone exploits the app, they get a powerless user, not root access to the server.

Why Python urllib for the healthcheck? This was a hard-won lesson. wget couldn't resolve localhost inside Alpine on WSL2 + Docker Desktop — a known network namespace quirk. Python's urllib bypasses the system resolver entirely and connects directly via socket. More reliable, no external tool needed.


The Templates {#the-templates}

Templates are blueprints with {{ placeholder }} values that the CLI replaces with real values from the manifest.

nginx.conf.tmpl key sections:

upstream app_backend {
    server app:{{ service_port }};
    keepalive 32;
}

log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';

server {
    listen {{ nginx_port }};

    proxy_connect_timeout {{ proxy_timeout }}s;
    proxy_send_timeout    {{ proxy_timeout }}s;
    proxy_read_timeout    {{ proxy_timeout }}s;

    add_header X-Deployed-By swiftdeploy always;
    proxy_pass_header X-Mode;

    location @error502 {
        default_type application/json;
        return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"{{ contact }}"}';
    }
}
Enter fullscreen mode Exit fullscreen mode

Design decisions:

  • Custom log format — ISO timestamp, status code, response time, upstream IP, full request on one line
  • JSON error bodies — APIs need machine-readable errors, not nginx HTML pages
  • proxy_pass_header X-Mode — nginx strips custom headers by default; this forwards the canary header through to clients
  • keepalive 32 — maintains 32 persistent connections to the upstream, reducing connection overhead

docker-compose.yml.tmpl key sections:

app:
  expose:
    - "{{ service_port }}"    # container-to-container only, NEVER published to host

nginx:
  ports:
    - "{{ nginx_port }}:{{ nginx_port }}"    # only nginx faces the world

  depends_on:
    app:
      condition: service_healthy    # nginx waits for app healthcheck to pass
Enter fullscreen mode Exit fullscreen mode

The expose vs ports distinction is a security boundary. expose = container-to-container only. ports = host-facing. The app is never reachable from outside Docker.


The CLI — Five Stage 4A Subcommands {#the-cli}

The template engine

def render_template(tmpl_path, context):
    with open(tmpl_path) as f:
        content = f.read()
    for key, val in context.items():
        content = content.replace("{{ " + key + " }}", str(val))
    return content
Enter fullscreen mode Exit fullscreen mode

Five lines. No Jinja2. Simple string replacement. The templates are straightforward enough that a minimal custom engine is cleaner than pulling in a dependency.

init — generates everything from the manifest

def cmd_init():
    manifest = load_manifest()
    ctx      = build_context(manifest)

    nginx_conf   = render_template(NGINX_TMPL, ctx)
    compose_conf = render_template(COMPOSE_TMPL, ctx)

    with open(NGINX_OUT, "w") as f:   f.write(nginx_conf)
    with open(COMPOSE_OUT, "w") as f: f.write(compose_conf)
Enter fullscreen mode Exit fullscreen mode

The grader deletes generated files and re-runs init to verify regeneration. Because we built it correctly, this is a non-issue.

validate — five pre-flight checks

# Check 1: manifest.yaml exists and is valid YAML
# Check 2: all required fields present and non-empty
# Check 3: docker image inspect — exits 0 if exists
# Check 4: ss -tlnp | grep :8080 — non-empty means port in use
# Check 5: nginx -t via isolated Docker container
Enter fullscreen mode Exit fullscreen mode

Check 5 is the most interesting. We run nginx -t in a temporary container — validating syntax without needing nginx installed on the host. But we hit a complication: app:3000 (the upstream hostname) can't be resolved in an isolated container. Fix: swap server app: with server 127.0.0.1: in a temp copy before testing. Actual nginx.conf on disk is untouched.

test_content = data.replace("server app:", "server 127.0.0.1:")
Enter fullscreen mode Exit fullscreen mode

deploy — brings up the stack and blocks until healthy

deadline = time.time() + 60
while time.time() < deadline:
    try:
        with urllib.request.urlopen(f"http://localhost:{nginx_port}/healthz", timeout=3) as resp:
            if json.loads(resp.read()).get("status") == "ok":
                healthy = True
                break
    except Exception:
        pass    # container still starting — connection refused is normal
    time.sleep(2)
Enter fullscreen mode Exit fullscreen mode

docker compose up -d returns as soon as containers are created, not when they're healthy. The polling loop tries /healthz every 2 seconds for 60 seconds. Only {"status": "ok"} breaks the loop.

promote — rolling restart with zero nginx downtime

# 1. Update manifest in-place
content = re.sub(r"(mode:\s*)(\S+)", f"\\g<1>{target_mode}", content, count=1)

# 2. Regenerate docker-compose.yml with new MODE env var

# 3. Restart ONLY the app container — nginx stays up
run(compose_cmd("up -d --no-deps app"))

# 4. Confirm mode via /healthz
Enter fullscreen mode Exit fullscreen mode

--no-deps is the key — it tells Compose to restart only the app service without touching nginx. Zero proxy downtime during the switch.


The Two Deployment Modes {#deployment-modes}

Stable mode — normal production behaviour. Clean responses. No special headers. /chaos returns 403.

Canary mode — test mode before full rollout:

if MODE == "canary":
    self.send_header("X-Mode", "canary")    # on EVERY response
Enter fullscreen mode Exit fullscreen mode

Canary mode:

  • Adds X-Mode: canary to every response — callers can identify which mode they're hitting
  • Unlocks /chaos — lets you simulate slow responses, random errors, then recover
  • You promote with ./swiftdeploy promote canary, stress test, then ./swiftdeploy promote stable to roll back

Stage 4B — The Eyes and The Brain {#stage-4b}

Stage 4A built the engine. Stage 4B adds observability and policy enforcement. The stack now has eyes (metrics), a brain (OPA policies), and memory (audit trail).


Prometheus Metrics — The Eyes {#prometheus-metrics}

The app gains a /metrics endpoint exposing five metric types in Prometheus text format.

Tracking infrastructure

# Counter: {(method, path, status_code): count}
request_counts = {}

# Histogram state per path
HISTOGRAM_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
request_durations = {}

def record_request(method, path, status_code, duration_seconds):
    with metrics_lock:
        key = (method, path, str(status_code))
        request_counts[key] = request_counts.get(key, 0) + 1

        if path not in request_durations:
            request_durations[path] = {
                "buckets": {str(le): 0 for le in HISTOGRAM_BUCKETS},
                "sum": 0.0,
                "count": 0,
            }
        hist = request_durations[path]
        hist["sum"] += duration_seconds
        hist["count"] += 1
        for le in HISTOGRAM_BUCKETS:
            if duration_seconds <= le:
                hist["buckets"][str(le)] += 1
Enter fullscreen mode Exit fullscreen mode

record_request() is called after every request regardless of path — timing wraps the entire handler:

def do_GET(self):
    start  = time.time()
    path   = self.path.split("?")[0]
    status = self._handle_get()
    record_request("GET", path, status, time.time() - start)
Enter fullscreen mode Exit fullscreen mode

The five metrics

# 1. http_requests_total — counter, labels: method, path, status_code
# 2. http_request_duration_seconds — histogram with standard buckets
# 3. app_uptime_seconds — gauge
# 4. app_mode — gauge: 0=stable, 1=canary
# 5. chaos_active — gauge: 0=none, 1=slow, 2=error
Enter fullscreen mode Exit fullscreen mode

The /metrics output

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 42
http_requests_total{method="GET",path="/healthz",status_code="200"} 60
http_requests_total{method="GET",path="/",status_code="500"} 38

# HELP http_request_duration_seconds HTTP request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{path="/",le="0.005"} 40
http_request_duration_seconds_bucket{path="/",le="+Inf"} 80
http_request_duration_seconds_sum{path="/"} 0.042381
http_request_duration_seconds_count{path="/"} 80

# HELP app_mode Current deployment mode (0=stable, 1=canary)
# TYPE app_mode gauge
app_mode 1

# HELP chaos_active Current chaos state (0=none, 1=slow, 2=error)
# TYPE chaos_active gauge
chaos_active 2
Enter fullscreen mode Exit fullscreen mode

OPA Policy Engine — The Brain {#opa-policy-engine}

Open Policy Agent is a dedicated container whose only job is making allow/deny decisions based on rules written in Rego. The core principle: the CLI never makes allow/deny decisions itself. All decision logic lives exclusively in OPA.

Why OPA instead of if/else in the CLI?

If the policy logic lives in the CLI, changing a threshold means editing Python code, rebuilding, redeploying. With OPA, you edit data.json, restart the OPA container, and the new threshold is live. Policy as code, not policy as application logic.

data.json — thresholds live here, never hardcoded in Rego

{
  "thresholds": {
    "min_disk_free_gb": 10.0,
    "max_cpu_load": 2.0,
    "min_mem_free_percent": 10.0,
    "max_error_rate_percent": 1.0,
    "max_p99_latency_ms": 500.0
  }
}
Enter fullscreen mode Exit fullscreen mode

infrastructure.rego — pre-deploy policy

package swiftdeploy.infrastructure

import future.keywords.if
import future.keywords.contains

default allow := false

allow if {
    disk_ok
    cpu_ok
    mem_ok
}

disk_ok if { input.disk_free_gb >= data.thresholds.min_disk_free_gb }
cpu_ok  if { input.cpu_load_1m  <= data.thresholds.max_cpu_load }
mem_ok  if { input.mem_free_percent >= data.thresholds.min_mem_free_percent }

reasons contains msg if {
    not disk_ok
    msg := sprintf(
        "disk_free_gb is %.1f, minimum required is %.1f",
        [input.disk_free_gb, data.thresholds.min_disk_free_gb]
    )
}

# decision is what the CLI reads — never a bare boolean
decision := {
    "allow":   allow,
    "reasons": reasons,
    "domain":  "infrastructure",
    "checked": {
        "disk_free_gb":     input.disk_free_gb,
        "cpu_load_1m":      input.cpu_load_1m,
        "mem_free_percent": input.mem_free_percent,
    },
}
Enter fullscreen mode Exit fullscreen mode

Why import future.keywords? The openpolicyagent/opa:latest-static image uses Rego v1 which requires explicit if and contains keywords. Without these imports, OPA crashes on startup. This was discovered the hard way during testing.

canary.rego — pre-promote policy

package swiftdeploy.canary

import future.keywords.if
import future.keywords.contains

default allow := false

allow if {
    error_rate_ok
    latency_ok
}

error_rate_ok if { input.error_rate_percent <= data.thresholds.max_error_rate_percent }
latency_ok    if { input.p99_latency_ms     <= data.thresholds.max_p99_latency_ms }

reasons contains msg if {
    not error_rate_ok
    msg := sprintf(
        "error_rate is %.2f%%, maximum allowed is %.2f%%",
        [input.error_rate_percent, data.thresholds.max_error_rate_percent]
    )
}

decision := {
    "allow":   allow,
    "reasons": reasons,
    "domain":  "canary",
    "checked": {
        "error_rate_percent": input.error_rate_percent,
        "p99_latency_ms":     input.p99_latency_ms,
        "window_seconds":     input.window_seconds,
    },
}
Enter fullscreen mode Exit fullscreen mode

OPA isolation — no leakage via nginx

In docker-compose.yml.tmpl:

opa:
  ports:
    - "127.0.0.1:{{ opa_port }}:8181"    # localhost only — never 0.0.0.0
Enter fullscreen mode Exit fullscreen mode

127.0.0.1:8181 means only the host machine can reach OPA. The nginx container on port 8080 has no route to OPA. This is enforced at the Docker network binding level, not just convention.

The policy query function

def query_opa(manifest, package, input_data):
    url     = f"{opa_url(manifest)}/v1/data/{package.replace('.', '/')}/decision"
    payload = json.dumps({"input": input_data}).encode()
    try:
        req = urllib.request.Request(url, data=payload,
              headers={"Content-Type": "application/json"}, method="POST")
        with urllib.request.urlopen(req, timeout=5) as resp:
            body   = json.loads(resp.read())
            result = body.get("result")
            if result is None:
                return None, "OPA returned empty result — check policy package name"
            return result, None
    except urllib.error.URLError as e:
        return None, f"OPA unreachable: {e.reason}"
    except Exception as e:
        return None, f"OPA query failed: {e}"
Enter fullscreen mode Exit fullscreen mode

Every distinct failure mode returns a different error string. The CLI never crashes or hangs when OPA is unavailable — it warns and fails open. This is intentional: you don't want OPA unavailability to block emergency deployments.


Gated Lifecycle — The CLI Brain {#gated-lifecycle}

Pre-deploy check

def cmd_deploy():
    manifest   = load_manifest()
    host_stats = get_host_stats()

    # Collect host stats
    disk_free_gb     = shutil.disk_usage("/").free / (1024 ** 3)
    cpu_load_1m      = float(open("/proc/loadavg").read().split()[0])
    mem_free_percent = (meminfo["MemAvailable"] / meminfo["MemTotal"]) * 100

    # Send to OPA
    allowed = enforce_policy(manifest, "swiftdeploy.infrastructure", host_stats, "infrastructure")

    if not allowed:
        append_history({"event": "deploy_blocked", "reason": "infrastructure_policy"})
        sys.exit(1)

    # Only reach here if OPA allows
    run(compose_cmd("up -d --build"))
Enter fullscreen mode Exit fullscreen mode

If the disk is full, OPA returns:

  ✘ Policy [infrastructure] DENIED
    ! disk_free_gb is 8.2, minimum required is 10.0

  Deployment blocked by policy: infrastructure
Enter fullscreen mode Exit fullscreen mode

Pre-promote check

def cmd_promote(target_mode):
    if target_mode == "canary":
        raw        = scrape_metrics(nginx_port)
        metrics    = parse_prometheus(raw)
        error_rate = calculate_error_rate(metrics)
        p99_ms     = calculate_p99_latency_ms(metrics)

        allowed = enforce_policy(manifest, "swiftdeploy.canary",
            {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30},
            "canary safety"
        )
        if not allowed:
            sys.exit(1)
Enter fullscreen mode Exit fullscreen mode

P99 latency calculation from histogram

def calculate_p99_latency_ms(metrics, path_filter=None):
    buckets     = {}
    total_count = 0

    for entry in metrics.get("http_request_duration_seconds_bucket", []):
        le = entry["labels"].get("le", "")
        if le == "+Inf":
            total_count = max(total_count, entry["value"])
            continue
        buckets[float(le)] = buckets.get(float(le), 0) + entry["value"]

    if total_count == 0:
        return 0.0

    p99_threshold = total_count * 0.99
    for le in sorted(buckets.keys()):
        if buckets[le] >= p99_threshold:
            return round(le * 1000, 2)    # seconds → milliseconds
    return 10000.0
Enter fullscreen mode Exit fullscreen mode

P99 means: the smallest histogram bucket where 99% of requests have completed. If 99 out of 100 requests finished within 250ms, P99 = 250ms.


The Status Dashboard — The Eyes {#status-dashboard}

def cmd_status():
    while True:
        raw        = scrape_metrics(nginx_port)
        metrics    = parse_prometheus(raw)
        error_rate = calculate_error_rate(metrics)
        p99_ms     = calculate_p99_latency_ms(metrics)

        # Query OPA for live compliance
        infra_dec,  _ = query_opa(manifest, "swiftdeploy.infrastructure", get_host_stats())
        canary_dec, _ = query_opa(manifest, "swiftdeploy.canary",
                            {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30})

        os.system("clear")
        # ... render dashboard ...

        append_history({
            "event":              "status_scrape",
            "error_rate_percent": error_rate,
            "p99_latency_ms":     p99_ms,
            "mode":               mode_str,
            "chaos":              chaos_str,
            "policy_infra_pass":  infra_dec.get("allow") if infra_dec else None,
            "policy_canary_pass": canary_dec.get("allow") if canary_dec else None,
        })

        time.sleep(5)
Enter fullscreen mode Exit fullscreen mode

What the dashboard looks like with chaos active:

  SwiftDeploy Status Dashboard  2026-05-05T21:00:37Z
  ────────────────────────────────────────────────

  ── Throughput ──────────────────────────────────
  req/s       : 2.4
  error rate  : 56.45%      ← red
  P99 latency : 5.0ms

  ── App State ───────────────────────────────────
  mode        : canary
  chaos       : error       ← red
  uptime      : 316s

  ── Policy Compliance ───────────────────────────
  ✔  infrastructure       PASS
  ✘  canary               FAIL
       ! error_rate is 56.45%, maximum allowed is 1.00%

  Refreshing every 5s — Ctrl+C to exit
Enter fullscreen mode Exit fullscreen mode

This is exactly what real SRE dashboards do — they show you the current state AND whether it violates policy in real time.


The Audit Trail — The Memory {#audit-trail}

Every event appends a JSON line to history.jsonl:

{"timestamp":"2026-05-05T20:34:51Z","event":"deploy","mode":"stable"}
{"timestamp":"2026-05-05T20:55:22Z","event":"promote","target_mode":"canary"}
{"timestamp":"2026-05-05T20:55:23Z","event":"status_scrape","error_rate_percent":62.5,"chaos":"error","policy_canary_pass":false}
{"timestamp":"2026-05-05T21:01:01Z","event":"promote","target_mode":"stable"}
Enter fullscreen mode Exit fullscreen mode

swiftdeploy audit reads this file and generates audit_report.md:

## Mode Changes

| Timestamp | From | To |
|-----------|------|----|
| 2026-05-05T20:34:51Z | unknown | stable |
| 2026-05-05T20:55:22Z | stable  | canary |
| 2026-05-05T21:01:01Z | canary  | stable |

## Policy Violations

| Timestamp | Infrastructure | Canary | Error Rate | P99 |
|-----------|---------------|--------|------------|-----|
| 2026-05-05T20:55:23Z | ✔ PASS | ✘ FAIL | 62.5% | 5.0ms |
| 2026-05-05T20:55:28Z | ✔ PASS | ✘ FAIL | 63.6% | 5.0ms |
Enter fullscreen mode Exit fullscreen mode

This report renders perfectly as GitHub Flavored Markdown — every table, every checkmark, every timestamp.


The Debugging Sagas {#debugging}

No real DevOps project ships without war stories. Here are the ones that taught the most.

Saga 1 — Six layers of healthcheck failure

The app container kept showing unhealthy despite the server running fine. The debugging sequence:

Failure 1: ${APP_PORT} doesn't expand in Dockerfile HEALTHCHECK CMD. Env vars evaluate at build time, not runtime. Fixed by hardcoding 3000.

Failure 2: localhost doesn't resolve inside Alpine's healthcheck context on WSL2 + Docker Desktop. Fixed by using 127.0.0.1.

Failure 3: wget with 127.0.0.1 still failed. Confirmed the server WAS listening:

docker exec swiftdeploy-app ss -tlnp
# tcp LISTEN 0.0.0.0:3000
docker exec swiftdeploy-app python -c "import urllib.request; print(urllib.request.urlopen('http://127.0.0.1:3000/healthz').read())"
# b'{"status": "ok", ...}'   ← works via exec, not via healthcheck
Enter fullscreen mode Exit fullscreen mode

This is a known WSL2 + Docker Desktop network namespace issue. Fixed by using Python's urllib instead of wget.

Failure 4: Docker cache was serving the old image despite the Dockerfile fix. Fixed with --no-cache.

Failure 5: The docker-compose.yml template had its own healthcheck block overriding the Dockerfile. Docker Compose healthcheck always wins. Fixed the template too.

Failure 6: The healthcheck YAML block had 3 spaces indent instead of 4. A single space difference caused a YAML parse error. Fixed by carefully rewriting the block.

Saga 2 — OPA Rego v1 syntax

The openpolicyagent/opa:latest-static image enforces strict Rego v1 syntax. Our policies used the older syntax:

# OLD — crashes on latest OPA
allow {
    disk_ok
}
reasons[msg] {
    not disk_ok
    msg := "..."
}
Enter fullscreen mode Exit fullscreen mode
# NEW — Rego v1 required syntax
allow if {
    disk_ok
}
reasons contains msg if {
    not disk_ok
    msg := "..."
}
Enter fullscreen mode Exit fullscreen mode

Without import future.keywords.if and import future.keywords.contains at the top of each file, OPA refuses to start.

Saga 3 — WSL2 path spaces breaking docker run

The project lived at /mnt/c/Users/RAZER BLADE/Desktop/HNG/hng-swiftdeploy. The space in RAZER BLADE caused docker run -v {path}:... to split the path at the space, making Docker interpret the second half as an image name.

docker: invalid reference format: repository name
(Desktop/HNG/hng-swiftdeploy/nginx.conf) must be lowercase
Enter fullscreen mode Exit fullscreen mode

Fixed by quoting all paths containing the project directory and using subprocess.run with a list instead of shell=True to avoid shell word-splitting entirely.


Full Deployment Walkthrough {#deployment}

# 1. Build the image
docker build -t swift-deploy-1-node:latest .

# 2. Validate pre-flight checks
./swiftdeploy validate

# 3. Deploy (OPA policy check runs first)
./swiftdeploy deploy

# 4. Verify metrics
curl http://localhost:8080/metrics

# 5. Verify OPA isolation
curl http://127.0.0.1:8181/health    # works — internal
curl http://localhost:8080/v1/data   # 404 — nginx blocks it

# 6. Launch status dashboard
./swiftdeploy status

# 7. Promote to canary (OPA canary policy check runs first)
./swiftdeploy promote canary

# 8. Inject chaos
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "error", "rate": 0.5}'

# 9. Watch status dashboard catch it — canary policy FAIL visible in real time

# 10. Recover
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'

# 11. Promote back to stable
./swiftdeploy promote stable

# 12. Generate audit report
./swiftdeploy audit
cat audit_report.md

# 13. Teardown
./swiftdeploy teardown --clean
Enter fullscreen mode Exit fullscreen mode

Key Lessons Learned {#lessons}

1. Docker Compose healthcheck overrides Dockerfile HEALTHCHECK. Always check both places when healthchecks misbehave. Compose wins every time.

2. WSL2 has a different network namespace for healthchecks than for docker exec. If something works via exec but not via healthcheck, it's almost certainly a tool or network namespace issue. Python's stdlib is more portable than wget in this environment.

3. OPA Rego v1 requires explicit keywords. latest-static means the latest OPA — which enforces Rego v1 syntax. Always import future.keywords.if and import future.keywords.contains.

4. expose vs ports is a security boundary, not documentation. expose = container-to-container only. ports = host-facing. Binding OPA to 127.0.0.1 enforces isolation at the network level.

5. The CLI should never make policy decisions. Every time you add an if/else for a deployment condition in the CLI, you're doing OPA's job badly. Push all allow/deny logic into Rego. The CLI's job is to collect data and surface decisions.

6. P99 latency is more useful than average. An average latency of 10ms can hide the fact that 1 in 100 requests takes 5 seconds. P99 exposes that tail. Always instrument histograms, not just averages.

7. Declarative infrastructure pays off immediately. The grader deletes generated files and re-runs init. Because the manifest is always there and regeneration is instantaneous, this is a non-issue. Manual configs would have been a problem.

8. An audit trail is not optional. history.jsonl made it trivial to answer "when did chaos start?", "which policy was failing?", "how long was the canary running before we promoted?" These questions matter in production incidents.


Conclusion

SwiftDeploy started as a task requirement and became a complete mental model for how modern deployment tooling works. Every major concept is here:

  • Declarative infrastructure — describe what you want, generate everything else
  • Immutable configs — generated files are outputs, never inputs
  • Policy as code — OPA enforces safety standards that can't be bypassed
  • Observability — Prometheus metrics feed the dashboard and the policy engine
  • Audit trail — every event recorded, every violation surfaced

The combination of Stage 4A and 4B forms a complete deployment lifecycle: generate → validate → deploy (gated) → promote (gated) → observe → audit → tear down.

The full source code is available at: https://github.com/AirFluke/hng-swiftdeploy


Tags: #devops #docker #nginx #python #opa #prometheus #infrastructure #hng

Top comments (0)