instanceofGod

Posted on May 6 • Edited on May 7

SwiftDeploy: Building a Self-Writing Infrastructure Tool with OPA Policy Gates and Prometheus Observability

#automation #devops #python #tutorial

A complete walkthrough of building a declarative deployment CLI from scratch — one that generates its own config files, enforces policy before acting, and audits everything it does.

The Problem
Architecture Overview
Part A — The Engine
Part B — The Eyes and the Brain
The Chaos Experiments
Lessons Learned
Replication Guide

The Problem

Most DevOps tooling asks you to write config files by hand. You write a docker-compose.yml, an nginx.conf, maybe a Kubernetes manifest — and then you maintain all of them separately. When something changes, you update three files instead of one, and they drift apart.

SwiftDeploy flips this. You write one file — manifest.yaml — and the tool derives everything else from it. The generated files are artifacts, not sources.

Part B goes further: the tool now refuses to act unless the environment meets policy. It has eyes (Prometheus metrics), a brain (Open Policy Agent), and a memory (an append-only audit trail).

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        Host Machine                          │
│                                                              │
│   swiftdeploy CLI (bash)                                     │
│        │                                                     │
│        ├── reads ──► manifest.yaml                          │
│        ├── writes ─► nginx.conf, docker-compose.yml         │
│        ├── queries ► OPA :8181 (127.0.0.1 only)            │
│        └── scrapes ► /metrics via nginx :8080               │
│                                                              │
│  ┌─────────────────────────┐   ┌──────────────────────┐    │
│  │   swiftdeploy-net        │   │      opa-net          │    │
│  │                          │   │                       │    │
│  │  ┌────────┐  ┌────────┐ │   │  ┌─────────────────┐ │    │
│  │  │  app   │  │ nginx  │ │   │  │       OPA        │ │    │
│  │  │ :3000  │◄─│ :8080  │ │   │  │      :8181       │ │    │
│  │  └────────┘  └────────┘ │   │  └─────────────────┘ │    │
│  │   (no host port)  (pub) │   │   (127.0.0.1 only)   │    │
│  └─────────────────────────┘   └──────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Key isolation properties:

The app container has no host-exposed port — all traffic routes through nginx
OPA lives on a separate Docker network (opa-net) with no connection to swiftdeploy-net
OPA's port is bound to 127.0.0.1 only — the CLI on the host can reach it, but nginx cannot proxy to it
No policy logic lives in the CLI — the CLI only sends data and reads decisions

Part A — The Engine

The Manifest: Single Source of Truth

Everything starts with manifest.yaml:

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable        # stable | canary
  version: "1.0.0"

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 30

network:
  name: swiftdeploy-net
  driver_type: bridge

This is the only file an operator edits. Every other config file is generated from it. The grader's test is simple: delete nginx.conf and docker-compose.yml, run ./swiftdeploy init, and verify they come back correctly. If your tool breaks, your stack breaks.

The Tool That Writes Its Own Infrastructure

The core of swiftdeploy init is a Python template renderer embedded directly in the bash script:

import yaml
from pathlib import Path

with open("manifest.yaml") as f:
    m = yaml.safe_load(f)

ctx = {
    "service_image":  m["services"]["image"],
    "service_port":   m["services"]["port"],
    "mode":           m["services"].get("mode", "stable"),
    "nginx_port":     m["nginx"]["port"],
    "proxy_timeout":  m["nginx"].get("proxy_timeout", 30),
    "network_name":   m["network"]["name"],
    "network_driver": m["network"]["driver_type"],
}

for tmpl, out in [("templates/nginx.conf.j2", "nginx.conf"),
                  ("templates/docker-compose.yml.j2", "docker-compose.yml")]:
    src = Path(tmpl).read_text()
    for k, v in ctx.items():
        src = src.replace("{{ " + k + " }}", str(v))
    Path(out).write_text(src)

The templates use {{ variable }} placeholders. No Jinja2 library needed — a simple string replace is sufficient and keeps the image dependency-free.

The nginx template enforces the required access log format:

log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';

And returns structured JSON error bodies for 502/503/504:

location @err502 {
    default_type application/json;
    return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"ops@swiftdeploy.local"}';
}

The docker-compose template enforces security hardening on every deploy:

user: "1001:1001"
cap_drop:
  - ALL
expose:
  - "{{ service_port }}"   # internal only — never ports:

The API Service

The service is a pure Python stdlib HTTP server — no Flask, no FastAPI, no external dependencies. This keeps the Docker image under 80MB.

class Handler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/healthz":
            _json(self, 200, {"status": "ok", "uptime": round(time.time() - START_TIME, 2)})
        elif self.path == "/":
            _json(self, 200, {
                "message": f"Welcome to SwiftDeploy [{MODE}]",
                "mode": MODE,
                "version": VERSION,
                "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            })

Canary mode is controlled entirely by the MODE environment variable injected by docker-compose. The same image runs in both modes — no separate builds.

Chaos state is an in-process global dict protected by a threading.Lock:

_chaos = {"mode": None, "duration": 0, "rate": 0.0}

def _apply_chaos():
    with _chaos_lock:
        m = _chaos["mode"]
        if m == "slow":
            time.sleep(_chaos["duration"])
        elif m == "error":
            if random.random() < _chaos["rate"]:
                return True   # caller should return 500
    return False

The CLI Subcommands

validate runs 5 pre-flight checks before any deployment:

==> Running pre-flight checks
  [PASS] manifest.yaml exists and is valid YAML
  [PASS] All required fields present and non-empty
  [PASS] Docker image 'swift-deploy-1-node:latest' exists locally
  [PASS] Nginx port 8080 is not already bound
  [PASS] nginx.conf is syntactically valid

  Results: 5 passed, 0 failed

The nginx syntax check is interesting — it can't just run nginx -t because the app hostname doesn't resolve outside the compose network. The fix: substitute the upstream hostname with 127.0.0.1 in a temp copy before testing:

sed 's|http://app:|http://127.0.0.1:|g' nginx.conf > .nginx-check-tmp.conf
docker run --rm -v "$(pwd)/.nginx-check-tmp.conf:/etc/nginx/conf.d/default.conf:ro" \
  nginx:latest nginx -t

promote does a rolling restart of only the app container:

docker compose up -d --no-deps --force-recreate app

The --no-deps flag is critical — it prevents docker-compose from restarting nginx, which would cause a brief outage. Only the app restarts; nginx keeps serving traffic (returning 502s for the few seconds the app is down, which is expected and handled by the JSON error pages).

Part B — The Eyes and the Brain

Instrumentation: /metrics in Pure Python

Adding Prometheus metrics to a stdlib HTTP server requires no libraries. All state lives in module-level variables protected by a lock:

_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
_req_total = {}       # {(method, path, status_code): count}
_hist_buckets = {le: 0 for le in _BUCKETS}
_hist_sum = 0.0
_hist_count = 0

Every request records its duration and increments the appropriate counters:

def _record(method, path, status, duration):
    key = (method, path, str(status))
    with _metrics_lock:
        _req_total[key] = _req_total.get(key, 0) + 1
        _hist_sum += duration
        _hist_count += 1
        for le in _BUCKETS:
            if duration <= le:
                _hist_buckets[le] += 1
        _hist_inf += 1

The /metrics endpoint serialises this state into Prometheus text format:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 42
http_requests_total{method="GET",path="/healthz",status_code="200"} 180
# HELP http_request_duration_seconds Request latency histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 210
http_request_duration_seconds_bucket{le="0.01"} 220
...
http_request_duration_seconds_bucket{le="+Inf"} 222
http_request_duration_seconds_sum 0.441200
http_request_duration_seconds_count 222
# HELP app_uptime_seconds Seconds since process start
# TYPE app_uptime_seconds gauge
app_uptime_seconds 3600.42
# HELP app_mode Current deployment mode (0=stable 1=canary)
# TYPE app_mode gauge
app_mode 1
# HELP chaos_active Active chaos state (0=none 1=slow 2=error)
# TYPE chaos_active gauge
chaos_active 0

The Policy Sidecar: OPA

OPA runs as a sidecar container. The key design decision is network isolation: OPA lives on opa-net, which has no connection to swiftdeploy-net (where nginx and the app live). The port is bound to 127.0.0.1 only:

opa:
  image: openpolicyagent/opa:latest
  command: ["run", "--server", "--addr", "0.0.0.0:8181", "/policies"]
  volumes:
    - ./policies:/policies:ro
  networks:
    - opa-net
  ports:
    - "127.0.0.1:8181:8181"

This satisfies the "no leakage" requirement: nginx cannot proxy to OPA because they share no network. The CLI on the host can reach OPA directly on localhost:8181, but a request through http://localhost:8080/v1/data/infra would return a 404 from nginx.

The policies/ directory is mounted read-only. OPA loads all .rego files and their companion data.json files at startup. Restarting OPA picks up policy changes without rebuilding anything.

Writing Rego Policies

The core principle: OPA never returns a bare boolean. Every decision carries the reasoning behind it.

Infrastructure Policy (policies/infra/policy.rego):

package infra

import rego.v1

default allow := false

allow if {
    count(deny) == 0
}

deny contains msg if {
    input.disk_free_gb < data.infra.min_disk_gb
    msg := sprintf(
        "Disk free %vGB is below minimum %vGB",
        [input.disk_free_gb, data.infra.min_disk_gb]
    )
}

deny contains msg if {
    input.cpu_load > data.infra.max_cpu_load
    msg := sprintf(
        "CPU load %v exceeds maximum %v",
        [input.cpu_load, data.infra.max_cpu_load]
    )
}

The thresholds live in policies/infra/data.json, not in the Rego file:

{ "min_disk_gb": 10, "max_cpu_load": 2.0 }

This separation matters. Changing a threshold is a data change, not a policy change. You can update data.json and restart OPA without touching the logic. A security audit of the policy file remains valid even after threshold adjustments.

Canary Safety Policy (policies/canary/policy.rego):

package canary

import rego.v1

default allow := false

allow if {
    count(deny) == 0
}

deny contains msg if {
    input.error_rate > data.canary.max_error_rate
    msg := sprintf(
        "Error rate %v%% exceeds maximum %v%%",
        [round(input.error_rate * 100), round(data.canary.max_error_rate * 100)]
    )
}

deny contains msg if {
    input.p99_ms > data.canary.max_p99_ms
    msg := sprintf(
        "P99 latency %vms exceeds maximum %vms",
        [input.p99_ms, data.canary.max_p99_ms]
    )
}

Each domain owns exactly one question. The infra policy only knows about disk and CPU. The canary policy only knows about error rate and latency. A change to one never requires touching the other.

Gated Lifecycle: deploy and promote

Pre-deploy gate collects host stats and sends them to OPA:

disk_free=$(df -g / | awk 'NR==2{print $4}')
cpu_load=$(uptime | awk -F'load averages?:' '{print $2}' | awk '{print $1}' | tr -d ',')

input_json=$(python3 -c "
import json
print(json.dumps({
    'action': 'deploy',
    'disk_free_gb': float('$disk_free'),
    'cpu_load': float('$cpu_load')
}))
")

When the disk is full, the output is unambiguous:

==> Checking infrastructure policy
  disk_free=2GB  cpu_load=0.8
allow=false
  [DENY] Disk free 2GB is below minimum 10GB
  [BLOCKED] Deployment denied by infrastructure policy

Pre-promote gate scrapes /metrics and computes P99 latency from the histogram:

# P99 interpolation from histogram buckets
target = total_obs * 0.99
for le, count in sorted(buckets):
    if count >= target:
        p99_ms = int(le * 1000)
        break

The input to OPA is different from the deploy check — it reflects the question being asked:

{
  "action": "promote",
  "error_rate": 0.052,
  "p99_ms": 620
}

OPA treats pre-deploy and pre-promote as completely separate questions. The CLI queries each domain independently and treats each answer as a separate signal.

OPA unavailability is handled gracefully — the CLI never crashes or hangs:

resp=$(curl -s ... 2>/dev/null) || {
    echo "  [WARN] OPA unreachable — policy check skipped"
    return 2
}

Return code 2 means "unavailable" — the CLI logs a warning and continues. Return code 1 means "denied" — the CLI prints the reasons and exits. Return code 0 means "allowed" — the CLI proceeds.

The Status Dashboard

./swiftdeploy status runs a live-refreshing terminal dashboard that scrapes /metrics every 5 seconds:

==> SwiftDeploy Status  [2026-05-06T20:10:00Z]  mode=canary

  Throughput : 7.3 req/s
  P99 Latency: 620ms
  Error Rate : 6.00%
  Uptime     : 700s
  Chaos      : error

  Policy Compliance:
    [PASS] infra   — disk=107GB cpu=0.8
    [FAIL] canary  — error_rate=6.00% p99=620ms

  (Ctrl+C to exit)

Every scrape appends a JSON record to history.jsonl:

{
  "ts": "2026-05-06T20:10:00Z",
  "mode": "canary",
  "req_per_s": 7.3,
  "p99_ms": 620,
  "error_rate": 0.06,
  "uptime_s": 700,
  "chaos": "error",
  "policy": {"infra": "PASS", "canary": "FAIL"}
}

The Audit Report

./swiftdeploy audit parses history.jsonl and generates audit_report.md — valid GitHub Flavored Markdown with four sections: timeline, violations, mode changes, and chaos events.

==> audit_report.md generated
    4 snapshots | 1 violations | 1 mode changes | 1 chaos events

The Chaos Experiments

Experiment 1: Error Injection

Promote to canary, then inject 50% error rate:

./swiftdeploy promote canary
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"error","rate":0.5}'

The status dashboard immediately captures the failure:

  Throughput : 8.1 req/s
  P99 Latency: 45ms
  Error Rate : 48.20%
  Chaos      : error

  Policy Compliance:
    [PASS] infra   — disk=107GB cpu=0.8
    [FAIL] canary  — error_rate=48.20% p99=45ms

Now try to promote to stable — the canary gate blocks it:

==> Checking canary safety policy
  error_rate=48.20%  p99=45ms
allow=false
  [DENY] Error rate 48% exceeds maximum 1%
  [BLOCKED] Promotion denied by canary safety policy

This is the correct behaviour. The canary is unhealthy. You must recover first:

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"recover"}'

Wait for the error rate to drop below 1% in the status view, then promote.

Experiment 2: Slow Responses

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"slow","duration":3}'

The status dashboard shows P99 climbing:

  P99 Latency: 3000ms
  Error Rate : 0.00%

  Policy Compliance:
    [FAIL] canary  — error_rate=0.00% p99=3000ms

The canary policy blocks promotion on P99 alone — even with zero errors, a 3-second P99 is unacceptable.

Experiment 3: Disk Full Gate

Simulate a full disk by temporarily lowering the threshold in policies/infra/data.json:

{ "min_disk_gb": 200, "max_cpu_load": 2.0 }

Restart OPA and try to deploy:

docker restart swiftdeploy-opa
./swiftdeploy deploy

==> Checking infrastructure policy
  disk_free=107GB  cpu_load=0.8
allow=false
  [DENY] Disk free 107GB is below minimum 200GB
  [BLOCKED] Deployment denied by infrastructure policy

The hard gate works. No containers are started, no configs are written.

Lessons Learned

1. Heredocs and set -euo pipefail don't mix well with pipes.

The biggest source of bugs was set -e killing the script when a pipe's right-hand side returned non-zero (e.g., grep finding no match). The fix: capture output into a variable and use if statements instead of && chains or piped until conditions.

2. macOS and Linux have different system utilities.

df -BG is Linux. df -g is macOS. uptime formats load averages differently. head -n -1 doesn't work on BSD head. Any tool targeting both platforms needs to handle these divergences explicitly.

3. OPA internal: true networks block host port bindings.

Setting internal: true on a Docker network prevents all external connectivity — including host port mappings. OPA needs to be reachable by the CLI on the host, so the network must not be internal. Isolation from nginx is achieved by simply not adding OPA to the nginx-facing network.

4. Separate policy logic from policy data.

Hardcoding thresholds in Rego files is a mistake. When you need to adjust a limit, you want to change a JSON file, not a policy file. Keeping them separate means your Rego logic can be audited and reviewed independently of operational tuning.

5. Never return a bare boolean from a policy engine.

A decision without reasoning is useless to an operator. Every OPA response includes a deny array with human-readable messages. The CLI surfaces these directly. When a deploy is blocked, the operator knows exactly why.

6. The --no-deps flag is essential for rolling restarts.

docker compose up -d --force-recreate app without --no-deps will restart nginx too, causing a full outage. With --no-deps, only the app container restarts. Nginx keeps serving (returning 502s briefly), which is the correct rolling restart behaviour.

Replication Guide

Prerequisites

Docker + Docker Compose v2
Python 3.10+ with pyyaml (pip install pyyaml)
curl, lsof

Step 1: Clone and build

git clone <your-repo-url>
cd stage4
chmod +x swiftdeploy
docker build -t swift-deploy-1-node:latest .

Step 2: Init and validate

./swiftdeploy init
./swiftdeploy validate

Expected: 5/5 checks pass.

Step 3: Deploy

./swiftdeploy deploy

Expected output:

==> Checking infrastructure policy
  disk_free=107GB  cpu_load=0.8
allow=true
  [PASS] Infrastructure policy
==> Initialising
  generated nginx.conf
  generated docker-compose.yml
==> Starting stack
  [+] Running 4/4
==> Waiting for health (timeout 60s)
  [OK] Stack is healthy

Step 4: Verify endpoints

curl http://localhost:8080/
curl http://localhost:8080/healthz
curl http://localhost:8080/metrics

Step 5: Promote to canary

./swiftdeploy promote canary

Verify X-Mode: canary header:

curl -sv http://localhost:8080/ 2>&1 | grep X-Mode

Step 6: Inject chaos and watch the status dashboard

In one terminal:

./swiftdeploy status

In another:

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"error","rate":0.5}'

Watch the canary policy flip to [FAIL] in the dashboard.

Step 7: Recover and generate audit report

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"recover"}'

./swiftdeploy audit
cat audit_report.md

Step 8: Teardown

./swiftdeploy teardown --clean

Built for HNG DevOps Track — Stage 4.

DEV Community