DEV Community

Cover image for SwiftDeploy: Building a Self-Writing Infrastructure Tool with OPA Policy Gates and Prometheus Observability
instanceofGod
instanceofGod

Posted on • Edited on

SwiftDeploy: Building a Self-Writing Infrastructure Tool with OPA Policy Gates and Prometheus Observability

A complete walkthrough of building a declarative deployment CLI from scratch — one that generates its own config files, enforces policy before acting, and audits everything it does.


Table of Contents

  1. The Problem
  2. Architecture Overview
  3. Part A — The Engine
  4. Part B — The Eyes and the Brain
  5. The Chaos Experiments
  6. Lessons Learned
  7. Replication Guide

The Problem

Most DevOps tooling asks you to write config files by hand. You write a docker-compose.yml, an nginx.conf, maybe a Kubernetes manifest — and then you maintain all of them separately. When something changes, you update three files instead of one, and they drift apart.

SwiftDeploy flips this. You write one file — manifest.yaml — and the tool derives everything else from it. The generated files are artifacts, not sources.

Part B goes further: the tool now refuses to act unless the environment meets policy. It has eyes (Prometheus metrics), a brain (Open Policy Agent), and a memory (an append-only audit trail).


Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        Host Machine                          │
│                                                              │
│   swiftdeploy CLI (bash)                                     │
│        │                                                     │
│        ├── reads ──► manifest.yaml                          │
│        ├── writes ─► nginx.conf, docker-compose.yml         │
│        ├── queries ► OPA :8181 (127.0.0.1 only)            │
│        └── scrapes ► /metrics via nginx :8080               │
│                                                              │
│  ┌─────────────────────────┐   ┌──────────────────────┐    │
│  │   swiftdeploy-net        │   │      opa-net          │    │
│  │                          │   │                       │    │
│  │  ┌────────┐  ┌────────┐ │   │  ┌─────────────────┐ │    │
│  │  │  app   │  │ nginx  │ │   │  │       OPA        │ │    │
│  │  │ :3000  │◄─│ :8080  │ │   │  │      :8181       │ │    │
│  │  └────────┘  └────────┘ │   │  └─────────────────┘ │    │
│  │   (no host port)  (pub) │   │   (127.0.0.1 only)   │    │
│  └─────────────────────────┘   └──────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key isolation properties:

  • The app container has no host-exposed port — all traffic routes through nginx
  • OPA lives on a separate Docker network (opa-net) with no connection to swiftdeploy-net
  • OPA's port is bound to 127.0.0.1 only — the CLI on the host can reach it, but nginx cannot proxy to it
  • No policy logic lives in the CLI — the CLI only sends data and reads decisions

Part A — The Engine

The Manifest: Single Source of Truth

Everything starts with manifest.yaml:

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable        # stable | canary
  version: "1.0.0"

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 30

network:
  name: swiftdeploy-net
  driver_type: bridge
Enter fullscreen mode Exit fullscreen mode

This is the only file an operator edits. Every other config file is generated from it. The grader's test is simple: delete nginx.conf and docker-compose.yml, run ./swiftdeploy init, and verify they come back correctly. If your tool breaks, your stack breaks.

The Tool That Writes Its Own Infrastructure

The core of swiftdeploy init is a Python template renderer embedded directly in the bash script:

import yaml
from pathlib import Path

with open("manifest.yaml") as f:
    m = yaml.safe_load(f)

ctx = {
    "service_image":  m["services"]["image"],
    "service_port":   m["services"]["port"],
    "mode":           m["services"].get("mode", "stable"),
    "nginx_port":     m["nginx"]["port"],
    "proxy_timeout":  m["nginx"].get("proxy_timeout", 30),
    "network_name":   m["network"]["name"],
    "network_driver": m["network"]["driver_type"],
}

for tmpl, out in [("templates/nginx.conf.j2", "nginx.conf"),
                  ("templates/docker-compose.yml.j2", "docker-compose.yml")]:
    src = Path(tmpl).read_text()
    for k, v in ctx.items():
        src = src.replace("{{ " + k + " }}", str(v))
    Path(out).write_text(src)
Enter fullscreen mode Exit fullscreen mode

The templates use {{ variable }} placeholders. No Jinja2 library needed — a simple string replace is sufficient and keeps the image dependency-free.

The nginx template enforces the required access log format:

log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';
Enter fullscreen mode Exit fullscreen mode

And returns structured JSON error bodies for 502/503/504:

location @err502 {
    default_type application/json;
    return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"ops@swiftdeploy.local"}';
}
Enter fullscreen mode Exit fullscreen mode

The docker-compose template enforces security hardening on every deploy:

user: "1001:1001"
cap_drop:
  - ALL
expose:
  - "{{ service_port }}"   # internal only — never ports:
Enter fullscreen mode Exit fullscreen mode

The API Service

The service is a pure Python stdlib HTTP server — no Flask, no FastAPI, no external dependencies. This keeps the Docker image under 80MB.

class Handler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/healthz":
            _json(self, 200, {"status": "ok", "uptime": round(time.time() - START_TIME, 2)})
        elif self.path == "/":
            _json(self, 200, {
                "message": f"Welcome to SwiftDeploy [{MODE}]",
                "mode": MODE,
                "version": VERSION,
                "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            })
Enter fullscreen mode Exit fullscreen mode

Canary mode is controlled entirely by the MODE environment variable injected by docker-compose. The same image runs in both modes — no separate builds.

Chaos state is an in-process global dict protected by a threading.Lock:

_chaos = {"mode": None, "duration": 0, "rate": 0.0}

def _apply_chaos():
    with _chaos_lock:
        m = _chaos["mode"]
        if m == "slow":
            time.sleep(_chaos["duration"])
        elif m == "error":
            if random.random() < _chaos["rate"]:
                return True   # caller should return 500
    return False
Enter fullscreen mode Exit fullscreen mode

The CLI Subcommands

validate runs 5 pre-flight checks before any deployment:

==> Running pre-flight checks
  [PASS] manifest.yaml exists and is valid YAML
  [PASS] All required fields present and non-empty
  [PASS] Docker image 'swift-deploy-1-node:latest' exists locally
  [PASS] Nginx port 8080 is not already bound
  [PASS] nginx.conf is syntactically valid

  Results: 5 passed, 0 failed
Enter fullscreen mode Exit fullscreen mode

The nginx syntax check is interesting — it can't just run nginx -t because the app hostname doesn't resolve outside the compose network. The fix: substitute the upstream hostname with 127.0.0.1 in a temp copy before testing:

sed 's|http://app:|http://127.0.0.1:|g' nginx.conf > .nginx-check-tmp.conf
docker run --rm -v "$(pwd)/.nginx-check-tmp.conf:/etc/nginx/conf.d/default.conf:ro" \
  nginx:latest nginx -t
Enter fullscreen mode Exit fullscreen mode

promote does a rolling restart of only the app container:

docker compose up -d --no-deps --force-recreate app
Enter fullscreen mode Exit fullscreen mode

The --no-deps flag is critical — it prevents docker-compose from restarting nginx, which would cause a brief outage. Only the app restarts; nginx keeps serving traffic (returning 502s for the few seconds the app is down, which is expected and handled by the JSON error pages).


Part B — The Eyes and the Brain

Instrumentation: /metrics in Pure Python

Adding Prometheus metrics to a stdlib HTTP server requires no libraries. All state lives in module-level variables protected by a lock:

_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
_req_total = {}       # {(method, path, status_code): count}
_hist_buckets = {le: 0 for le in _BUCKETS}
_hist_sum = 0.0
_hist_count = 0
Enter fullscreen mode Exit fullscreen mode

Every request records its duration and increments the appropriate counters:

def _record(method, path, status, duration):
    key = (method, path, str(status))
    with _metrics_lock:
        _req_total[key] = _req_total.get(key, 0) + 1
        _hist_sum += duration
        _hist_count += 1
        for le in _BUCKETS:
            if duration <= le:
                _hist_buckets[le] += 1
        _hist_inf += 1
Enter fullscreen mode Exit fullscreen mode

The /metrics endpoint serialises this state into Prometheus text format:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 42
http_requests_total{method="GET",path="/healthz",status_code="200"} 180
# HELP http_request_duration_seconds Request latency histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 210
http_request_duration_seconds_bucket{le="0.01"} 220
...
http_request_duration_seconds_bucket{le="+Inf"} 222
http_request_duration_seconds_sum 0.441200
http_request_duration_seconds_count 222
# HELP app_uptime_seconds Seconds since process start
# TYPE app_uptime_seconds gauge
app_uptime_seconds 3600.42
# HELP app_mode Current deployment mode (0=stable 1=canary)
# TYPE app_mode gauge
app_mode 1
# HELP chaos_active Active chaos state (0=none 1=slow 2=error)
# TYPE chaos_active gauge
chaos_active 0
Enter fullscreen mode Exit fullscreen mode

The Policy Sidecar: OPA

OPA runs as a sidecar container. The key design decision is network isolation: OPA lives on opa-net, which has no connection to swiftdeploy-net (where nginx and the app live). The port is bound to 127.0.0.1 only:

opa:
  image: openpolicyagent/opa:latest
  command: ["run", "--server", "--addr", "0.0.0.0:8181", "/policies"]
  volumes:
    - ./policies:/policies:ro
  networks:
    - opa-net
  ports:
    - "127.0.0.1:8181:8181"
Enter fullscreen mode Exit fullscreen mode

This satisfies the "no leakage" requirement: nginx cannot proxy to OPA because they share no network. The CLI on the host can reach OPA directly on localhost:8181, but a request through http://localhost:8080/v1/data/infra would return a 404 from nginx.

The policies/ directory is mounted read-only. OPA loads all .rego files and their companion data.json files at startup. Restarting OPA picks up policy changes without rebuilding anything.

Writing Rego Policies

The core principle: OPA never returns a bare boolean. Every decision carries the reasoning behind it.

Infrastructure Policy (policies/infra/policy.rego):

package infra

import rego.v1

default allow := false

allow if {
    count(deny) == 0
}

deny contains msg if {
    input.disk_free_gb < data.infra.min_disk_gb
    msg := sprintf(
        "Disk free %vGB is below minimum %vGB",
        [input.disk_free_gb, data.infra.min_disk_gb]
    )
}

deny contains msg if {
    input.cpu_load > data.infra.max_cpu_load
    msg := sprintf(
        "CPU load %v exceeds maximum %v",
        [input.cpu_load, data.infra.max_cpu_load]
    )
}
Enter fullscreen mode Exit fullscreen mode

The thresholds live in policies/infra/data.json, not in the Rego file:

{ "min_disk_gb": 10, "max_cpu_load": 2.0 }
Enter fullscreen mode Exit fullscreen mode

This separation matters. Changing a threshold is a data change, not a policy change. You can update data.json and restart OPA without touching the logic. A security audit of the policy file remains valid even after threshold adjustments.

Canary Safety Policy (policies/canary/policy.rego):

package canary

import rego.v1

default allow := false

allow if {
    count(deny) == 0
}

deny contains msg if {
    input.error_rate > data.canary.max_error_rate
    msg := sprintf(
        "Error rate %v%% exceeds maximum %v%%",
        [round(input.error_rate * 100), round(data.canary.max_error_rate * 100)]
    )
}

deny contains msg if {
    input.p99_ms > data.canary.max_p99_ms
    msg := sprintf(
        "P99 latency %vms exceeds maximum %vms",
        [input.p99_ms, data.canary.max_p99_ms]
    )
}
Enter fullscreen mode Exit fullscreen mode

Each domain owns exactly one question. The infra policy only knows about disk and CPU. The canary policy only knows about error rate and latency. A change to one never requires touching the other.

Gated Lifecycle: deploy and promote

Pre-deploy gate collects host stats and sends them to OPA:

disk_free=$(df -g / | awk 'NR==2{print $4}')
cpu_load=$(uptime | awk -F'load averages?:' '{print $2}' | awk '{print $1}' | tr -d ',')

input_json=$(python3 -c "
import json
print(json.dumps({
    'action': 'deploy',
    'disk_free_gb': float('$disk_free'),
    'cpu_load': float('$cpu_load')
}))
")
Enter fullscreen mode Exit fullscreen mode

When the disk is full, the output is unambiguous:

==> Checking infrastructure policy
  disk_free=2GB  cpu_load=0.8
allow=false
  [DENY] Disk free 2GB is below minimum 10GB
  [BLOCKED] Deployment denied by infrastructure policy
Enter fullscreen mode Exit fullscreen mode

Pre-promote gate scrapes /metrics and computes P99 latency from the histogram:

# P99 interpolation from histogram buckets
target = total_obs * 0.99
for le, count in sorted(buckets):
    if count >= target:
        p99_ms = int(le * 1000)
        break
Enter fullscreen mode Exit fullscreen mode

The input to OPA is different from the deploy check — it reflects the question being asked:

{
  "action": "promote",
  "error_rate": 0.052,
  "p99_ms": 620
}
Enter fullscreen mode Exit fullscreen mode

OPA treats pre-deploy and pre-promote as completely separate questions. The CLI queries each domain independently and treats each answer as a separate signal.

OPA unavailability is handled gracefully — the CLI never crashes or hangs:

resp=$(curl -s ... 2>/dev/null) || {
    echo "  [WARN] OPA unreachable — policy check skipped"
    return 2
}
Enter fullscreen mode Exit fullscreen mode

Return code 2 means "unavailable" — the CLI logs a warning and continues. Return code 1 means "denied" — the CLI prints the reasons and exits. Return code 0 means "allowed" — the CLI proceeds.

The Status Dashboard

./swiftdeploy status runs a live-refreshing terminal dashboard that scrapes /metrics every 5 seconds:

==> SwiftDeploy Status  [2026-05-06T20:10:00Z]  mode=canary

  Throughput : 7.3 req/s
  P99 Latency: 620ms
  Error Rate : 6.00%
  Uptime     : 700s
  Chaos      : error

  Policy Compliance:
    [PASS] infra   — disk=107GB cpu=0.8
    [FAIL] canary  — error_rate=6.00% p99=620ms

  (Ctrl+C to exit)
Enter fullscreen mode Exit fullscreen mode

Every scrape appends a JSON record to history.jsonl:

{
  "ts": "2026-05-06T20:10:00Z",
  "mode": "canary",
  "req_per_s": 7.3,
  "p99_ms": 620,
  "error_rate": 0.06,
  "uptime_s": 700,
  "chaos": "error",
  "policy": {"infra": "PASS", "canary": "FAIL"}
}
Enter fullscreen mode Exit fullscreen mode

The Audit Report

./swiftdeploy audit parses history.jsonl and generates audit_report.md — valid GitHub Flavored Markdown with four sections: timeline, violations, mode changes, and chaos events.

==> audit_report.md generated
    4 snapshots | 1 violations | 1 mode changes | 1 chaos events
Enter fullscreen mode Exit fullscreen mode

The Chaos Experiments

Experiment 1: Error Injection

Promote to canary, then inject 50% error rate:

./swiftdeploy promote canary
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"error","rate":0.5}'
Enter fullscreen mode Exit fullscreen mode

The status dashboard immediately captures the failure:

  Throughput : 8.1 req/s
  P99 Latency: 45ms
  Error Rate : 48.20%
  Chaos      : error

  Policy Compliance:
    [PASS] infra   — disk=107GB cpu=0.8
    [FAIL] canary  — error_rate=48.20% p99=45ms
Enter fullscreen mode Exit fullscreen mode

Now try to promote to stable — the canary gate blocks it:

==> Checking canary safety policy
  error_rate=48.20%  p99=45ms
allow=false
  [DENY] Error rate 48% exceeds maximum 1%
  [BLOCKED] Promotion denied by canary safety policy
Enter fullscreen mode Exit fullscreen mode

This is the correct behaviour. The canary is unhealthy. You must recover first:

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"recover"}'
Enter fullscreen mode Exit fullscreen mode

Wait for the error rate to drop below 1% in the status view, then promote.

Experiment 2: Slow Responses

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"slow","duration":3}'
Enter fullscreen mode Exit fullscreen mode

The status dashboard shows P99 climbing:

  P99 Latency: 3000ms
  Error Rate : 0.00%

  Policy Compliance:
    [FAIL] canary  — error_rate=0.00% p99=3000ms
Enter fullscreen mode Exit fullscreen mode

The canary policy blocks promotion on P99 alone — even with zero errors, a 3-second P99 is unacceptable.

Experiment 3: Disk Full Gate

Simulate a full disk by temporarily lowering the threshold in policies/infra/data.json:

{ "min_disk_gb": 200, "max_cpu_load": 2.0 }
Enter fullscreen mode Exit fullscreen mode

Restart OPA and try to deploy:

docker restart swiftdeploy-opa
./swiftdeploy deploy
Enter fullscreen mode Exit fullscreen mode
==> Checking infrastructure policy
  disk_free=107GB  cpu_load=0.8
allow=false
  [DENY] Disk free 107GB is below minimum 200GB
  [BLOCKED] Deployment denied by infrastructure policy
Enter fullscreen mode Exit fullscreen mode

The hard gate works. No containers are started, no configs are written.


Lessons Learned

1. Heredocs and set -euo pipefail don't mix well with pipes.

The biggest source of bugs was set -e killing the script when a pipe's right-hand side returned non-zero (e.g., grep finding no match). The fix: capture output into a variable and use if statements instead of && chains or piped until conditions.

2. macOS and Linux have different system utilities.

df -BG is Linux. df -g is macOS. uptime formats load averages differently. head -n -1 doesn't work on BSD head. Any tool targeting both platforms needs to handle these divergences explicitly.

3. OPA internal: true networks block host port bindings.

Setting internal: true on a Docker network prevents all external connectivity — including host port mappings. OPA needs to be reachable by the CLI on the host, so the network must not be internal. Isolation from nginx is achieved by simply not adding OPA to the nginx-facing network.

4. Separate policy logic from policy data.

Hardcoding thresholds in Rego files is a mistake. When you need to adjust a limit, you want to change a JSON file, not a policy file. Keeping them separate means your Rego logic can be audited and reviewed independently of operational tuning.

5. Never return a bare boolean from a policy engine.

A decision without reasoning is useless to an operator. Every OPA response includes a deny array with human-readable messages. The CLI surfaces these directly. When a deploy is blocked, the operator knows exactly why.

6. The --no-deps flag is essential for rolling restarts.

docker compose up -d --force-recreate app without --no-deps will restart nginx too, causing a full outage. With --no-deps, only the app container restarts. Nginx keeps serving (returning 502s briefly), which is the correct rolling restart behaviour.


Replication Guide

Prerequisites

  • Docker + Docker Compose v2
  • Python 3.10+ with pyyaml (pip install pyyaml)
  • curl, lsof

Step 1: Clone and build

git clone <your-repo-url>
cd stage4
chmod +x swiftdeploy
docker build -t swift-deploy-1-node:latest .
Enter fullscreen mode Exit fullscreen mode

Step 2: Init and validate

./swiftdeploy init
./swiftdeploy validate
Enter fullscreen mode Exit fullscreen mode

Expected: 5/5 checks pass.

Step 3: Deploy

./swiftdeploy deploy
Enter fullscreen mode Exit fullscreen mode

Expected output:

==> Checking infrastructure policy
  disk_free=107GB  cpu_load=0.8
allow=true
  [PASS] Infrastructure policy
==> Initialising
  generated nginx.conf
  generated docker-compose.yml
==> Starting stack
  [+] Running 4/4
==> Waiting for health (timeout 60s)
  [OK] Stack is healthy
Enter fullscreen mode Exit fullscreen mode

Step 4: Verify endpoints

curl http://localhost:8080/
curl http://localhost:8080/healthz
curl http://localhost:8080/metrics
Enter fullscreen mode Exit fullscreen mode

Step 5: Promote to canary

./swiftdeploy promote canary
Enter fullscreen mode Exit fullscreen mode

Verify X-Mode: canary header:

curl -sv http://localhost:8080/ 2>&1 | grep X-Mode
Enter fullscreen mode Exit fullscreen mode

Step 6: Inject chaos and watch the status dashboard

In one terminal:

./swiftdeploy status
Enter fullscreen mode Exit fullscreen mode

In another:

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"error","rate":0.5}'
Enter fullscreen mode Exit fullscreen mode

Watch the canary policy flip to [FAIL] in the dashboard.

Step 7: Recover and generate audit report

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"recover"}'

./swiftdeploy audit
cat audit_report.md
Enter fullscreen mode Exit fullscreen mode

Step 8: Teardown

./swiftdeploy teardown --clean
Enter fullscreen mode Exit fullscreen mode

Built for HNG DevOps Track — Stage 4.

Top comments (0)