DEV Community

Cover image for SwiftDeploy: Building a Self-Governing Deployment Tool with OPA, Prometheus, and a Single YAML File
Felix Gogodae
Felix Gogodae

Posted on

SwiftDeploy: Building a Self-Governing Deployment Tool with OPA, Prometheus, and a Single YAML File

Series: Firstly,, I built the engine (manifest → rendered nginx + compose, gated lifecycle). Then I added the eyes (Prometheus /metrics) and the brain (OPA policy sidecar). This post covers the complete journey.


Table of Contents

  1. What problem are we solving?
  2. The single source of truth
  3. Architecture overview
  4. The engine: writing its own infrastructure
  5. The eyes: Prometheus instrumentation
  6. The brain: OPA policy sidecar
  7. Gated lifecycle: deploy and promote
  8. The live dashboard: swiftdeploy status
  9. The memory: swiftdeploy audit
  10. Injecting chaos and watching the gates fire
  11. Lessons learned

1. What problem are we solving?

Most deployment tooling separates three concerns that should be tightly coupled:

Concern Typical state
Configuration Scattered across Compose files, env files, CI yamls
Policy Implicit — "the person who ran the deploy knew it was safe"
Observability A separate system bolted on afterward

SwiftDeploy collapses all three into one loop: a single manifest.yaml drives rendered infrastructure, feeds thresholds to OPA at deploy/promote time, and tells the API how long to keep rolling-window metrics.


2. The single source of truth

Every value that changes between environments lives in one file:

services:
  image: swiftdeploy-hng14-api:latest
  port: 3000
  mode: stable           # swiftdeploy promote rewrites this in-place

nginx:
  image: nginx:1.27-alpine
  port: 8080
  proxy_timeout: 60s

network:
  name: swiftdeploy-net
  driver_type: bridge

metadata:
  version: "1.0.0"
  service_name: swiftdeploy-api
  contact: "you@example.com"
  deployed_by: "swiftdeploy"

compose_project: swiftdeploy

policy:
  thresholds:          # fed to OPA as input.thresholds — never hardcoded in .rego
    min_disk_free_gb: 10
    min_mem_available_gb: 1
    max_cpu_load: 2.0
    max_error_rate_percent: 1
    max_p99_latency_ms: 500
    metrics_window_seconds: 30
  opa:
    image: openpolicyagent/opa:0.69.0
    host_port: 9182
Enter fullscreen mode Exit fullscreen mode

The CLI (swiftdeploy) reads this file, renders nginx.conf and docker-compose.yml via Jinja2, and never asks you to edit either generated file.


3. Architecture overview

Architecture Diagram

Key isolation properties

Property How it is enforced
OPA not reachable via Nginx OPA bound to 127.0.0.1:9182 only; Nginx only proxies to api
API port not public expose: only — no ports: mapping on the api service
No decision logic in CLI CLI POSTs context, reads back allowed + checks[]; all logic lives in Rego
Thresholds not in Rego .rego files reference only input.thresholds.* — values come from manifest.yaml

4. The engine: writing its own infrastructure

swiftdeploy init parses manifest.yaml with PyYAML and feeds the result into two Jinja2 templates:

  • templates/nginx.conf.j2 — upstream block, proxy timeouts, error pages, access log format, X-Deployed-By header, temp paths under /tmp so the nginx user can write
  • templates/docker-compose.yml.j2 — three services (api, nginx, opa), security hardening on api (cap_drop: ALL, no-new-privileges, user: 1000:1000), healthcheck, named volume

The METRICS_WINDOW_SECONDS env var is written from policy.thresholds.metrics_window_seconds — the same value that OPA uses as the SLO window — so the API's rolling gauge and the Rego rule are always in sync.

swiftdeploy validate runs five pre-flight checks before any container starts:

  1. manifest.yaml exists and parses
  2. All required fields are non-empty (including the full policy block)
  3. docker image inspect <services.image> succeeds
  4. nginx.port is free on the host
  5. Rendered nginx.conf passes nginx -t inside a throwaway container

5. The eyes: Prometheus instrumentation

The FastAPI app exposes GET /metrics in Prometheus text format. There are two layers of middleware:

request in
    │
    ▼
[chaos middleware]       <- injects slow/error in canary mode (skipped on POST /chaos)
    │
    ▼
[prometheus middleware]  <- times the full stack including chaos delay
    │
    ▼
route handler
Enter fullscreen mode Exit fullscreen mode

Standard metrics:

http_requests_total{method, path, status_code}
http_request_duration_seconds{method, path}   <- histogram, standard buckets
app_uptime_seconds
app_mode                                       <- 0=stable, 1=canary
chaos_active                                   <- 0=none, 1=slow, 2=error
Enter fullscreen mode Exit fullscreen mode

Rolling-window gauges (what OPA queries for canary SLOs):

swiftdeploy_window_requests_total       <- count of requests in last N seconds
swiftdeploy_window_errors_total         <- 5xx count in window
swiftdeploy_window_p99_latency_seconds  <- in-process P99 over same window
Enter fullscreen mode Exit fullscreen mode

The window is a collections.deque. On every request, a (timestamp, duration, is_error) tuple is appended, stale entries are evicted, and the three gauges are recomputed — P99 via sorted index. No external TSDB needed; the gauge values are always current when scraped.


6. The brain: OPA policy sidecar

Why OPA instead of if-statements in the CLI?

The key constraint: the CLI must not make any allow/deny decision itself. With if-statements in Python, the logic and the thresholds are co-located with the operator tool. With OPA:

  • Thresholds live only in manifest.yaml (one place to change for all environments)
  • Policy logic lives only in .rego (auditable, testable with opa test)
  • The CLI is a dumb messenger — it assembles context, posts it, reads back a decision object

Domain isolation

Each policy domain owns exactly one question and one data shape:

Domain Question Input shape
swiftdeploy.infrastructure Is the host healthy enough to deploy? {phase, host: {disk_free_gb, cpu_load_1m, mem_available_gb}, thresholds}
swiftdeploy.canary Is the canary safe enough to promote? {phase, promotion_target, metrics: {error_rate_percent, p99_latency_ms, window_seconds}, thresholds}

A change to the infrastructure rules never touches canary/policy.rego and vice versa.

Decision structure (never a bare boolean)

Every decision document carries per-rule checks:

decision := {
    "allowed": count(reasons) == 0,
    "domain": "infrastructure",
    "phase": input.phase,
    "reasons": sort([r | reasons[r]]),
    "checks": [
        {"rule_id": "infra_disk_free_minimum",
         "passed": disk_ok, "detail": disk_detail},
        {"rule_id": "infra_cpu_load_maximum",
         "passed": cpu_ok,  "detail": cpu_detail},
        {"rule_id": "infra_memory_available_minimum",
         "passed": mem_ok,  "detail": mem_detail},
    ],
    ...
}
Enter fullscreen mode Exit fullscreen mode

The CLI iterates checks[] directly for the live status display — it never infers pass/fail itself.

Failure handling

Every distinct failure mode has a unique failure_kind and a human-readable message:

Situation failure_kind Message shown to operator
OPA container not started opa_connection_refused "Start with: docker compose up -d opa"
OPA slow to respond opa_timeout "OPA request timed out (read)"
OPA returns non-JSON opa_bad_json includes raw snippet
OPA returns no result key opa_no_result includes raw snippet
psutil not installed host_stats_unavailable install instructions

None of these paths crash or hang the CLI.


7. Gated lifecycle: deploy and promote

swiftdeploy deploy

init (render nginx.conf + docker-compose.yml)
    |
    v
docker compose up -d opa
    |
    v
wait_opa_ready (polls /health, up to 75s)
    |
    v
collect_host_stats --> POST /v1/data/swiftdeploy/infrastructure/decision
                                |
                      +---------+-----------+
                      |                     |
                 allowed: false        allowed: true
                      |                     |
               print FAIL checks    docker compose up --build -d
               exit(1)                      |
               (no stack up)                v
                                   poll GET /healthz via nginx
Enter fullscreen mode Exit fullscreen mode

Real output on a day when CPU spiked:

Policy compliance (infrastructure (pre-deploy)):
  [PASS] infra_disk_free_minimum: PASS: disk free 66.57 GB meets minimum 10.00 GB.
  [FAIL] infra_cpu_load_maximum: FAIL: CPU load 2.52 exceeds maximum 2.00.
  [PASS] infra_memory_available_minimum: PASS: memory available 8.10 GB meets minimum 1.00 GB.
[swiftdeploy] POLICY VIOLATION - deploy blocked (infrastructure).
  - Policy violation: CPU load (2.52) exceeds maximum allowed (2.00).
Enter fullscreen mode Exit fullscreen mode

The stack never started. No compose up ran. The OPA sidecar is the only container that exists at this point.

swiftdeploy promote canary

Before rewriting manifest.yaml, the CLI:

  1. Scrapes GET /metrics via Nginx
  2. Derives error_rate_percent and p99_latency_ms from the rolling-window gauges
  3. Posts to swiftdeploy/canary/decision with promotion_target: "canary"
  4. On allowed: false — exits without touching manifest.yaml

Promoting to stable takes a different Rego branch that skips SLO evaluation entirely (there are no "canary metrics" to check when moving away from canary).


8. The live dashboard: swiftdeploy status

python swiftdeploy status --interval 2 -n 5
Enter fullscreen mode Exit fullscreen mode

Each sample scrapes /healthz, /metrics, and both OPA domains independently, then prints:

=== 2026-05-06T18:16:17Z  mode='stable'  req/s~=3.2100 ===
  window(30s): errors=2/41 err_rate=4.8780% p99=312.45ms
  chaos_active: 2 (error)
  Policy compliance (infrastructure (pre-deploy)):
    [PASS] infra_disk_free_minimum: PASS: disk free 66.62 GB meets minimum 10.00 GB.
    [PASS] infra_cpu_load_maximum: PASS: CPU load 0.89 is within maximum 2.00.
    [PASS] infra_memory_available_minimum: PASS: memory available 11.38 GB meets minimum 1.00 GB.
  OPA [infrastructure (pre-deploy)] aggregate: ALLOW
  Policy compliance (canary (hypothetical promote->canary)):
    [FAIL] canary_error_rate_window: FAIL: error rate 4.8780% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 312.45 ms within maximum 500.00 ms over 30 s window.
  OPA [canary (hypothetical promote->canary)] aggregate: DENY
Enter fullscreen mode Exit fullscreen mode

Every sample is appended as one JSON line to history.jsonl including chaos_active, window metrics, and both OPA snapshots with their checks[].


9. The memory: swiftdeploy audit

python swiftdeploy audit
Enter fullscreen mode Exit fullscreen mode

audit_report.md is generated from history.jsonl with four sections:

  • Summary — sample count, denial count, transport error count
  • Timeline events — mode transitions and chaos transitions detected by diffing consecutive records
  • Violations — every allowed: false from any domain, with reasons
  • Recent timeline — last 25 samples in a table with Chaos column and per-domain OPA status

Example timeline events table:

Time (UTC) Event Detail
2026-05-06T18:17:02Z chaos_change none -> error
2026-05-06T18:20:14Z mode_change stable -> canary
2026-05-06T18:23:41Z chaos_change error -> none

10. Injecting chaos and watching the gates fire

In canary mode, POST /chaos arms the process-global chaos state:

# arm 40% error rate
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "error", "rate": 0.40}'

# arm 2-second slow response on every request
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "slow", "duration": 2.0}'

# recover
curl -s -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'
Enter fullscreen mode Exit fullscreen mode

With 40% error rate active and traffic flowing, the status dashboard shows canary_error_rate_window FAIL within one 30-second window. Attempting swiftdeploy promote canary while this is true produces:

Swiftdeploy promote canary blocked by OPA canary safety policy

  Policy compliance (canary (pre-promote)):
    [FAIL] canary_error_rate_window: FAIL: error rate 50.8772% exceeds maximum 1.0000% over 30 s window.
    [PASS] canary_p99_latency_window: PASS: P99 latency 1.96 ms within maximum 500.00 ms over 30 s window.
[swiftdeploy] POLICY VIOLATION - promote blocked (canary safety policy).
  - Policy violation: error rate (50.8772%) exceeds maximum (1.0000%) over last 30 seconds.
Enter fullscreen mode Exit fullscreen mode

manifest.yaml is not modified. After recovering and waiting for the window to clear, the same command succeeds.


11. Lessons learned

1. One source of truth is a forcing function, not a convenience.
When thresholds are only in manifest.yaml and nowhere else, you cannot accidentally have a tighter limit in the Rego file than in your runbook. The manifest is the runbook.

2. OPA's value is in the separation, not the language.
Rego has a learning curve. The real benefit is that a policy change is a PR to a .rego file with a clear audit trail, not a diff buried inside deployment tooling.

3. Rolling-window gauges beat querying a TSDB for CLI gates.
The alternative — running Prometheus Server just to evaluate a PromQL expression at deploy time — adds infrastructure for something the app can compute in-process with a deque. The CLI scrapes the gauge, not the raw counter buckets.

4. Failure modes are the real API.
The most useful work in this project was not the happy path. It was giving every OPA transport failure a distinct failure_kind and message so an operator at 2am knows immediately whether OPA is down, slow, returning bad JSON, or returning a policy decision that says no.

5. Windows CPU approximation is not Linux load average.
The infrastructure policy uses 1-minute load average on Linux. On Windows, psutil.cpu_percent x logical_cpus spikes aggressively during container start. The gate working correctly the first time it fired was both the most satisfying and most annoying moment of the project.


Repository

github.com/Trojanhorse7/swift-deploy


Top comments (0)