Felix Gogodae

Posted on May 6

SwiftDeploy: Building a Self-Governing Deployment Tool with OPA, Prometheus, and a Single YAML File

#devops #docker #opa #prometheus

Series: Firstly,, I built the engine (manifest → rendered nginx + compose, gated lifecycle). Then I added the eyes (Prometheus /metrics) and the brain (OPA policy sidecar). This post covers the complete journey.

What problem are we solving?
The single source of truth
Architecture overview
The engine: writing its own infrastructure
The eyes: Prometheus instrumentation
The brain: OPA policy sidecar
Gated lifecycle: deploy and promote
The live dashboard: swiftdeploy status
The memory: swiftdeploy audit
Injecting chaos and watching the gates fire
Lessons learned

1. What problem are we solving?

Most deployment tooling separates three concerns that should be tightly coupled:

Concern	Typical state
Configuration	Scattered across Compose files, env files, CI yamls
Policy	Implicit — "the person who ran the deploy knew it was safe"
Observability	A separate system bolted on afterward

SwiftDeploy collapses all three into one loop: a single manifest.yaml drives rendered infrastructure, feeds thresholds to OPA at deploy/promote time, and tells the API how long to keep rolling-window metrics.

2. The single source of truth

Every value that changes between environments lives in one file:

services:
  image: swiftdeploy-hng14-api:latest
  port: 3000
  mode: stable           # swiftdeploy promote rewrites this in-place

nginx:
  image: nginx:1.27-alpine
  port: 8080
  proxy_timeout: 60s

network:
  name: swiftdeploy-net
  driver_type: bridge

metadata:
  version: "1.0.0"
  service_name: swiftdeploy-api
  contact: "you@example.com"
  deployed_by: "swiftdeploy"

compose_project: swiftdeploy

policy:
  thresholds:          # fed to OPA as input.thresholds — never hardcoded in .rego
    min_disk_free_gb: 10
    min_mem_available_gb: 1
    max_cpu_load: 2.0
    max_error_rate_percent: 1
    max_p99_latency_ms: 500
    metrics_window_seconds: 30
  opa:
    image: openpolicyagent/opa:0.69.0
    host_port: 9182

The CLI (swiftdeploy) reads this file, renders nginx.conf and docker-compose.yml via Jinja2, and never asks you to edit either generated file.

3. Architecture overview

Key isolation properties

Property	How it is enforced
OPA not reachable via Nginx	OPA bound to `127.0.0.1:9182` only; Nginx only proxies to `api`
API port not public	`expose:` only — no `ports:` mapping on the `api` service
No decision logic in CLI	CLI POSTs context, reads back `allowed` + `checks[]`; all logic lives in Rego
Thresholds not in Rego	`.rego` files reference only `input.thresholds.*` — values come from `manifest.yaml`

4. The engine: writing its own infrastructure

swiftdeploy init parses manifest.yaml with PyYAML and feeds the result into two Jinja2 templates:

templates/nginx.conf.j2 — upstream block, proxy timeouts, error pages, access log format, X-Deployed-By header, temp paths under /tmp so the nginx user can write
templates/docker-compose.yml.j2 — three services (api, nginx, opa), security hardening on api (cap_drop: ALL, no-new-privileges, user: 1000:1000), healthcheck, named volume

The METRICS_WINDOW_SECONDS env var is written from policy.thresholds.metrics_window_seconds — the same value that OPA uses as the SLO window — so the API's rolling gauge and the Rego rule are always in sync.

swiftdeploy validate runs five pre-flight checks before any container starts:

manifest.yaml exists and parses
All required fields are non-empty (including the full policy block)
docker image inspect <services.image> succeeds
nginx.port is free on the host
Rendered nginx.conf passes nginx -t inside a throwaway container

5. The eyes: Prometheus instrumentation

The FastAPI app exposes GET /metrics in Prometheus text format. There are two layers of middleware:

request in
    │
    ▼
[chaos middleware]       <- injects slow/error in canary mode (skipped on POST /chaos)
    │
    ▼
[prometheus middleware]  <- times the full stack including chaos delay
    │
    ▼
route handler

Standard metrics:

http_requests_total{method, path, status_code}
http_request_duration_seconds{method, path}   <- histogram, standard buckets
app_uptime_seconds
app_mode                                       <- 0=stable, 1=canary
chaos_active                                   <- 0=none, 1=slow, 2=error

Rolling-window gauges (what OPA queries for canary SLOs):

swiftdeploy_window_requests_total       <- count of requests in last N seconds
swiftdeploy_window_errors_total         <- 5xx count in window
swiftdeploy_window_p99_latency_seconds  <- in-process P99 over same window

The window is a collections.deque. On every request, a (timestamp, duration, is_error) tuple is appended, stale entries are evicted, and the three gauges are recomputed — P99 via sorted index. No external TSDB needed; the gauge values are always current when scraped.

6. The brain: OPA policy sidecar

Why OPA instead of if-statements in the CLI?

The key constraint: the CLI must not make any allow/deny decision itself. With if-statements in Python, the logic and the thresholds are co-located with the operator tool. With OPA:

Thresholds live only in manifest.yaml (one place to change for all environments)
Policy logic lives only in .rego (auditable, testable with opa test)
The CLI is a dumb messenger — it assembles context, posts it, reads back a decision object

Domain isolation

Each policy domain owns exactly one question and one data shape:

Domain	Question	Input shape
`swiftdeploy.infrastructure`	Is the host healthy enough to deploy?	`{phase, host: {disk_free_gb, cpu_load_1m, mem_available_gb}, thresholds}`
`swiftdeploy.canary`	Is the canary safe enough to promote?	`{phase, promotion_target, metrics: {error_rate_percent, p99_latency_ms, window_seconds}, thresholds}`

A change to the infrastructure rules never touches canary/policy.rego and vice versa.

Decision structure (never a bare boolean)

Every decision document carries per-rule checks:

decision := {
    "allowed": count(reasons) == 0,
    "domain": "infrastructure",
    "phase": input.phase,
    "reasons": sort([r | reasons[r]]),
    "checks": [
        {"rule_id": "infra_disk_free_minimum",
         "passed": disk_ok, "detail": disk_detail},
        {"rule_id": "infra_cpu_load_maximum",
         "passed": cpu_ok,  "detail": cpu_detail},
        {"rule_id": "infra_memory_available_minimum",
         "passed": mem_ok,  "detail": mem_detail},
    ],
    ...
}

The CLI iterates checks[] directly for the live status display — it never infers pass/fail itself.

Failure handling

Every distinct failure mode has a unique failure_kind and a human-readable message:

Situation	`failure_kind`	Message shown to operator
OPA container not started	`opa_connection_refused`	"Start with: docker compose up -d opa"
OPA slow to respond	`opa_timeout`	"OPA request timed out (read)"
OPA returns non-JSON	`opa_bad_json`	includes raw snippet
OPA returns no `result` key	`opa_no_result`	includes raw snippet
`psutil` not installed	`host_stats_unavailable`	install instructions

None of these paths crash or hang the CLI.

7. Gated lifecycle: deploy and promote

`swiftdeploy deploy`

init (render nginx.conf + docker-compose.yml)
    |
    v
docker compose up -d opa
    |
    v
wait_opa_ready (polls /health, up to 75s)
    |
    v
collect_host_stats --> POST /v1/data/swiftdeploy/infrastructure/decision
                                |
                      +---------+-----------+
                      |                     |
                 allowed: false        allowed: true
                      |                     |
               print FAIL checks    docker compose up --build -d
               exit(1)                      |
               (no stack up)                v
                                   poll GET /healthz via nginx

Real output on a day when CPU spiked:

Policy compliance (infrastructure (pre-deploy)):
  [PASS] infra_disk_free_minimum: PASS: disk free 66.57 GB meets minimum 10.00 GB.
  [FAIL] infra_cpu_load_maximum: FAIL: CPU load 2.52 exceeds maximum 2.00.
  [PASS] infra_memory_available_minimum: PASS: memory available 8.10 GB meets minimum 1.00 GB.
[swiftdeploy] POLICY VIOLATION - deploy blocked (infrastructure).
  - Policy violation: CPU load (2.52) exceeds maximum allowed (2.00).

The stack never started. No compose up ran. The OPA sidecar is the only container that exists at this point.

`swiftdeploy promote canary`

Before rewriting manifest.yaml, the CLI:

Scrapes GET /metrics via Nginx
Derives error_rate_percent and p99_latency_ms from the rolling-window gauges
Posts to swiftdeploy/canary/decision with promotion_target: "canary"
On allowed: false — exits without touching manifest.yaml

Promoting to stable takes a different Rego branch that skips SLO evaluation entirely (there are no "canary metrics" to check when moving away from canary).

8. The live dashboard: `swiftdeploy status`

python swiftdeploy status --interval 2 -n 5

Each sample scrapes /healthz, /metrics, and both OPA domains independently, then prints:

=== 2026-05-06T18:16:17Z  mode='stable'  req/s~=3.2100 ===
  window(30s): errors=2/41 err_rate=4.8780% p99=312.45ms
  chaos_active: 2 (error)
  Policy compliance (infrastructure (pre-deploy)):
    [PASS] infra_disk_free_minimum: PASS: disk free 66.62 GB meets minimum 10.00 GB.
    [PASS] infra_cpu_load_maximum: PASS: CPU load 0.89 is within maximum 2.00.
    [PASS] infra_memory_available_minimum: PASS: memory available 11.38 GB meets minimum 1.00 GB.
  OPA [infrastructure (pre-deploy)] aggregate: ALLOW
  Policy compliance (canary (hypothetical promote->canary)):
    [FAIL] canary_error_rate_window: FAIL: error rate 4.8780% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 312.45 ms within maximum 500.00 ms over 30 s window.
  OPA [canary (hypothetical promote->canary)] aggregate: DENY

Every sample is appended as one JSON line to history.jsonl including chaos_active, window metrics, and both OPA snapshots with their checks[].

9. The memory: `swiftdeploy audit`

python swiftdeploy audit

audit_report.md is generated from history.jsonl with four sections:

Summary — sample count, denial count, transport error count
Timeline events — mode transitions and chaos transitions detected by diffing consecutive records
Violations — every allowed: false from any domain, with reasons
Recent timeline — last 25 samples in a table with Chaos column and per-domain OPA status

Example timeline events table:

Time (UTC)	Event	Detail
2026-05-06T18:17:02Z	chaos_change	none -> error
2026-05-06T18:20:14Z	mode_change	stable -> canary
2026-05-06T18:23:41Z	chaos_change	error -> none

10. Injecting chaos and watching the gates fire

In canary mode, POST /chaos arms the process-global chaos state:

# arm 40% error rate
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "error", "rate": 0.40}'

# arm 2-second slow response on every request
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "slow", "duration": 2.0}'

# recover
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'

With 40% error rate active and traffic flowing, the status dashboard shows canary_error_rate_window FAIL within one 30-second window. Attempting swiftdeploy promote canary while this is true produces:

  Policy compliance (canary (pre-promote)):
    [FAIL] canary_error_rate_window: FAIL: error rate 50.8772% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 1.96 ms within maximum 500.00 ms over 30 s window.
[swiftdeploy] POLICY VIOLATION - promote blocked (canary safety policy).
  - Policy violation: error rate (50.8772%) exceeds maximum (1.0000%) over last 30 seconds.

manifest.yaml is not modified. After recovering and waiting for the window to clear, the same command succeeds.

11. Lessons learned

1. One source of truth is a forcing function, not a convenience.
When thresholds are only in manifest.yaml and nowhere else, you cannot accidentally have a tighter limit in the Rego file than in your runbook. The manifest is the runbook.

2. OPA's value is in the separation, not the language.
Rego has a learning curve. The real benefit is that a policy change is a PR to a .rego file with a clear audit trail, not a diff buried inside deployment tooling.

3. Rolling-window gauges beat querying a TSDB for CLI gates.
The alternative — running Prometheus Server just to evaluate a PromQL expression at deploy time — adds infrastructure for something the app can compute in-process with a deque. The CLI scrapes the gauge, not the raw counter buckets.

4. Failure modes are the real API.
The most useful work in this project was not the happy path. It was giving every OPA transport failure a distinct failure_kind and message so an operator at 2am knows immediately whether OPA is down, slow, returning bad JSON, or returning a policy decision that says no.

5. Windows CPU approximation is not Linux load average.
The infrastructure policy uses 1-minute load average on Linux. On Windows, psutil.cpu_percent x logical_cpus spikes aggressively during container start. The gate working correctly the first time it fired was both the most satisfying and most annoying moment of the project.

Repository

github.com/Trojanhorse7/swift-deploy

DEV Community

SwiftDeploy: Building a Self-Governing Deployment Tool with OPA, Prometheus, and a Single YAML File

Table of Contents

1. What problem are we solving?

2. The single source of truth

3. Architecture overview

Key isolation properties

4. The engine: writing its own infrastructure

5. The eyes: Prometheus instrumentation

6. The brain: OPA policy sidecar

Why OPA instead of if-statements in the CLI?

Domain isolation

Decision structure (never a bare boolean)

Failure handling

7. Gated lifecycle: deploy and promote

`swiftdeploy deploy`

`swiftdeploy promote canary`

8. The live dashboard: `swiftdeploy status`

9. The memory: `swiftdeploy audit`

10. Injecting chaos and watching the gates fire

11. Lessons learned

Repository

Top comments (0)

Table of Contents

1. What problem are we solving?

2. The single source of truth

3. Architecture overview

Key isolation properties

4. The engine: writing its own infrastructure

5. The eyes: Prometheus instrumentation

6. The brain: OPA policy sidecar

Why OPA instead of if-statements in the CLI?

Domain isolation

Decision structure (never a bare boolean)

Failure handling

7. Gated lifecycle: deploy and promote

swiftdeploy deploy

swiftdeploy promote canary

8. The live dashboard: swiftdeploy status

9. The memory: swiftdeploy audit

10. Injecting chaos and watching the gates fire

11. Lessons learned

Repository

`swiftdeploy deploy`

`swiftdeploy promote canary`

8. The live dashboard: `swiftdeploy status`

9. The memory: `swiftdeploy audit`