How I built a manifest-driven CLI that generates its own infrastructure, enforces environment policy through OPA, observes the running system in real time, and audits every decision it makes
Most deployment tools ask you to write configuration files. SwiftDeploy asks you to describe your intent once, then writes the configuration files for you - and refuses to deploy unless the environment is safe enough to proceed.
This post covers the complete journey of building SwiftDeploy across two stages. Stage A established the foundation: a declarative CLI that generates Docker Compose and Nginx configuration from a single manifest, manages the container lifecycle, and supports stable/canary promotion. Stage B added the intelligence layer: Prometheus metrics, an Open Policy Agent sidecar that gates every deployment and promotion, a live status dashboard, and an append-only audit trail.
A reader who follows this post from beginning to end should be able to replicate everything.
The Problem SwiftDeploy Solves
Consider a typical deployment workflow without tooling:
- Write a
docker-compose.ymlby hand - Write an
nginx.confby hand - Manually update both files every time a port, timeout, or service name changes
- Hope you didn't introduce a drift between the two files
- Have no policy gate before deployment
- Have no audit trail of what changed and when
SwiftDeploy inverts this. You edit exactly one file - manifest.yaml. The CLI derives everything else from it. If you delete the generated files and run swiftdeploy init, they come back identically. The manifest is the single source of truth, and the tool enforces that guarantee mechanically.
Architecture
Here is the complete system after both stages:
manifest.yaml ← the only file you edit manually
│
▼
swiftdeploy CLI
│
├── OPA policy check (pre-deploy / pre-promote)
│ │
│ ▼
│ policies/*.rego
│ + policy_limits from manifest
│
▼
templates/
├── docker-compose.yml.tpl
└── nginx.conf.tpl
│
▼
┌──────────────────────────────────────────┐
│ Docker Network │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ App │ │ Nginx │ │
│ │ :3000 │◄──│ :8080 (public) │ │
│ │ /metrics │ └─────────────────┘ │
│ │ /healthz │ │
│ │ /chaos │ ┌─────────────────┐ │
│ └──────────────┘ │ OPA │ │
│ │ :8181 (loopback)│ │
│ └─────────────────┘ │
└──────────────────────────────────────────┘
│
▼
history.jsonl ← append-only audit trail
│
▼
audit_report.md ← generated on demand
Key isolation rules:
- The app is never exposed directly to the host - all traffic goes through Nginx
- OPA is bound to
127.0.0.1:8181only - not reachable via the Nginx port or from the internet - The CLI is the only component that talks to OPA
- The app has no knowledge of policy decisions
Stage A: The Engine
The Manifest
Everything starts with manifest.yaml. This file describes the entire deployment intent:
services:
image: swift-deploy-1-node:latest
port: 3000
mode: stable
version: "1.0.0"
restart_policy: unless-stopped
nginx:
image: nginx:latest
port: 8080
proxy_timeout: 10
contact: o.odimayo@gbadedata.com
opa:
image: openpolicyagent/opa:latest
port: 8181
policies_dir: policies
decision_timeout_seconds: 5
network:
name: swiftdeploy-net
driver_type: bridge
logs:
volume_name: swiftdeploy-logs
policy_limits:
infrastructure:
min_disk_free_gb: 10
max_cpu_load: 2.0
canary:
max_error_rate: 0.01
max_p99_latency_ms: 500
audit:
history_file: history.jsonl
report_file: audit_report.md
Every value that the generated configuration needs comes from here. No hardcoded values exist in any template or policy file.
Template-based Config Generation
The CLI uses Jinja2 to render docker-compose.yml and nginx.conf from templates. This is how swiftdeploy init works:
def render_template(template_name, output_path, manifest):
env = Environment(
loader=FileSystemLoader(TEMPLATE_DIR, encoding="utf-8-sig"),
autoescape=False,
trim_blocks=True,
lstrip_blocks=True,
)
template = env.get_template(template_name)
rendered = template.render(**manifest).lstrip("\ufeff")
output_path.write_bytes(rendered.encode("utf-8"))
The template for the app service in docker-compose.yml.tpl looks like this:
services:
app:
image: {{ services.image }}
container_name: swiftdeploy-app
restart: {{ services.restart_policy }}
user: "10001:10001"
cap_drop:
- ALL
security_opt:
- no-new-privileges:true
environment:
MODE: "{{ services.mode }}"
APP_VERSION: "{{ services.version }}"
APP_PORT: "{{ services.port }}"
expose:
- "{{ services.port }}"
Notice expose instead of ports. This is deliberate - the app is reachable inside the Docker network but not from the host machine. All external traffic must go through Nginx.
The Five Validation Checks
Before any deployment, the CLI runs five pre-flight checks:
[PASS] manifest.yaml exists and is valid YAML
[PASS] All required fields are present and non-empty
[PASS] Docker image exists locally: swift-deploy-1-node:latest
[PASS] Nginx port is free on host: 8080
[PASS] Generated nginx.conf is syntactically valid
The Nginx syntax check is particularly interesting - it runs nginx -t inside a temporary container with a host mapping for the app upstream name, so DNS resolution works without the full stack being active:
command = [
"docker", "run", "--rm",
"--add-host", "app:127.0.0.1",
"-v", f"{NGINX_FILE}:/etc/nginx/nginx.conf:ro",
nginx_image, "nginx", "-t",
]
The Application Service
The app is a Python Flask service that exposes four endpoints:
-
GET /- welcome message with mode, version, and timestamp -
GET /healthz- liveness check with uptime in seconds -
GET /metrics- Prometheus-format metrics (added in Stage B) -
POST /chaos- fault injection, canary mode only
Stable and Canary Promotion
Promotion updates the manifest in-place, regenerates the Compose file, and restarts only the app container - Nginx is untouched:
def cmd_promote(args):
target_mode = args.mode.lower()
manifest = load_manifest()
# policy check runs here for promote stable (Stage B)
manifest["services"]["mode"] = target_mode
write_manifest(manifest)
render_template("docker-compose.yml.tpl", COMPOSE_FILE, manifest)
run_command([
"docker", "compose", "up", "-d",
"--no-deps", "--force-recreate", "app"
])
The same image runs in both modes. The difference is the MODE environment variable injected by the generated Compose file - no rebuild required.
Nginx: Structured Logging and JSON Errors
The generated nginx.conf enforces three important behaviours.
Required access log format:
log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';
Example output:
2026-05-06T23:04:50+00:00 | 200 | 0.001s | 172.18.0.2:3000 | GET / HTTP/1.1
JSON error responses for upstream failures:
error_page 502 = @error502;
location @error502 {
return 502 '{"error":"bad gateway","code":502,"service":"swiftdeploy","contact":"o.odimayo@gbadedata.com"}';
}
Platform headers on every response:
add_header X-Deployed-By swiftdeploy always;
proxy_pass_header X-Mode;
Security Hardening
Every container runs with least privilege:
user: "10001:10001" # app
cap_drop:
- ALL
security_opt:
- no-new-privileges:true
Images are built from python:3.12-slim and verified under 300MB. No secrets are baked into any image - all configuration is injected at runtime via environment variables.
Stage B: The Intelligence Layer
Part 1: Instrumentation
The first requirement was a /metrics endpoint in Prometheus text format. I used the prometheus_client library and Flask's request hooks to instrument every endpoint automatically:
HTTP_REQUESTS_TOTAL = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "path", "status_code"],
)
HTTP_REQUEST_DURATION_SECONDS = Histogram(
"http_request_duration_seconds",
"HTTP request latency in seconds",
["method", "path"],
buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5),
)
APP_MODE = Gauge("app_mode", "Application mode: 0=stable, 1=canary")
CHAOS_ACTIVE = Gauge("chaos_active", "Chaos state: 0=none, 1=slow, 2=error")
APP_UPTIME_SECONDS = Gauge("app_uptime_seconds", "Application uptime in seconds")
The after_request hook records every request automatically:
@app.after_request
def after_request(response):
duration = time.monotonic() - getattr(g, "request_started_at", time.monotonic())
HTTP_REQUESTS_TOTAL.labels(
method=request.method,
path=request.path,
status_code=str(response.status_code),
).inc()
HTTP_REQUEST_DURATION_SECONDS.labels(
method=request.method,
path=request.path,
).observe(duration)
return response
The /metrics endpoint itself is just two lines:
@app.get("/metrics")
def metrics():
update_runtime_metrics()
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
Part 2: The Policy Engine
This was the most architecturally significant part of Stage B. The requirement was absolute: the CLI must not make any allow/deny decision itself. All decision logic lives exclusively in OPA.
Why this matters
If the CLI embeds policy logic in Python, changing a threshold requires deploying new CLI code. When OPA owns the decisions, changing a threshold means editing one line in manifest.yaml. The operational difference is enormous - and the auditing story is much cleaner.
OPA in the Docker Compose template
OPA runs as a fourth service in the generated stack:
opa:
image: {{ opa.image }}
container_name: swiftdeploy-opa
command:
- "run"
- "--server"
- "--addr=0.0.0.0:{{ opa.port }}"
- "/policies"
ports:
- "127.0.0.1:{{ opa.port }}:{{ opa.port }}"
volumes:
- ./{{ opa.policies_dir }}:/policies:ro
cap_drop:
- ALL
security_opt:
- no-new-privileges:true
The critical detail: 127.0.0.1:8181:8181 binds OPA only to the host loopback interface. It is reachable by the CLI but not via the Nginx port. The OPA API cannot be queried or probed from the internet.
Policy organisation
Each domain owns exactly one question and one set of input data. A change to the canary policy never touches the infrastructure policy.
Infrastructure policy (policies/infrastructure.rego) - answers: is this host safe to deploy on?
package swiftdeploy.infrastructure
default allow := false
disk_ok if {
input.stats.disk_free_gb >= input.limits.min_disk_free_gb
}
cpu_ok if {
input.stats.cpu_load <= input.limits.max_cpu_load
}
allow if {
input.context == "pre_deploy"
disk_ok
cpu_ok
}
reasons contains msg if {
not disk_ok
msg := sprintf(
"Disk free %vGB is below required minimum %vGB",
[input.stats.disk_free_gb, input.limits.min_disk_free_gb],
)
}
reasons contains msg if {
not cpu_ok
msg := sprintf(
"CPU load %.2f exceeds allowed maximum %.2f",
[input.stats.cpu_load, input.limits.max_cpu_load],
)
}
decision := {
"domain": "infrastructure",
"question": "pre_deploy",
"allow": allow,
"reasons": reasons,
}
Canary safety policy (policies/canary.rego) - answers: is the canary healthy enough to promote to stable?
package swiftdeploy.canary
default allow := false
error_rate_ok if {
input.context == "pre_promote"
input.metrics.error_rate <= input.limits.max_error_rate
}
p99_latency_ok if {
input.context == "pre_promote"
input.metrics.p99_latency_ms <= input.limits.max_p99_latency_ms
}
allow if {
input.context == "pre_promote"
error_rate_ok
p99_latency_ok
}
decision := {
"domain": "canary",
"question": "pre_promote",
"allow": allow,
"reasons": reasons,
}
Thresholds in the manifest, not in Rego
Neither policy file contains a hardcoded number. The limits come from input.limits, which the CLI sends as part of the JSON payload. The actual values live in manifest.yaml under policy_limits. Tune thresholds by editing the manifest - no Rego files need to be touched.
The pre-deploy gate
Before starting the stack, the CLI collects host stats and queries OPA:
def pre_deploy_policy_check(manifest):
stats = get_host_stats()
limits = manifest["policy_limits"]["infrastructure"]
payload = {
"input": {
"context": "pre_deploy",
"stats": stats,
"limits": limits,
}
}
decision = call_opa(manifest, "swiftdeploy/infrastructure/decision", payload)
print_policy_decision(decision)
if not decision.get("allow"):
append_history(manifest, "policy_violation", {
"domain": decision.get("domain"),
"reasons": decision.get("reasons", []),
"stats": stats,
})
return False
return True
To prove the hard gate, I temporarily set min_disk_free_gb: 99999 and ran swiftdeploy deploy:
[POLICY][FAIL] infrastructure.pre_deploy
- Disk free 2GB is below required minimum 99999GB
[FAIL] Deployment blocked by policy.
The stack never starts. The block is absolute.
The pre-promote gate
Before promoting from canary to stable, the CLI scrapes /metrics, calculates the current error rate and P99 latency, and queries OPA:
def pre_promote_policy_check(manifest):
metrics_text = scrape_metrics(manifest)
samples = parse_prometheus_metrics(metrics_text)
observed = calculate_observed_metrics(samples)
limits = manifest["policy_limits"]["canary"]
payload = {
"input": {
"context": "pre_promote",
"metrics": {
"error_rate": observed["error_rate"],
"p99_latency_ms": observed["p99_latency_ms"],
},
"limits": limits,
}
}
decision = call_opa(manifest, "swiftdeploy/canary/decision", payload)
print_policy_decision(decision)
return bool(decision.get("allow"))
OPA failure handling
The CLI handles every distinct failure mode explicitly:
| Failure | Message |
|---|---|
| Connection refused | OPA unavailable at http://127.0.0.1:8181 |
| Request timed out | OPA decision timed out after 5s |
| Non-200 response | OPA returned HTTP 503 |
| Non-JSON response | OPA returned non-JSON response |
| Missing result | OPA response did not include a decision result |
In every case the operation is blocked and the event is recorded in the audit trail.
Part 3: The Chaos - What Happened When I Injected Failures
With the canary safety gate in place, I needed to prove it actually blocks unsafe promotions.
Deploy stable, promote to canary:
[PASS] Health check passed: mode=stable, version=1.0.0
[PASS] Promotion confirmed through /healthz: mode=canary
Inject 50% error chaos:
curl -X POST http://127.0.0.1:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode":"error","rate":0.5}'
Generate traffic to build up the error rate:
500 200 500 200 500 500 200 500 200 200
Run the status dashboard - this is where it gets interesting:
SwiftDeploy Status
==================
Timestamp: 2026-05-06T23:12:26.328363+00:00
Mode: canary
Chaos: error
Req/s: 0.00
Error rate: 38.10%
P99 latency: 5.00ms
Uptime: 128.42s
Policy Compliance
-----------------
[PASS] infrastructure.pre_deploy
- Infrastructure policy passed
[FAIL] canary.pre_promote
- Error rate 0.380952 exceeds allowed maximum 0.01
The status dashboard scrapes /metrics, calculates error rate from raw Prometheus counters, and queries both OPA domains independently on every interval. The infrastructure policy passes (the host is healthy) while the canary policy fails (the service is broken).
Attempt promotion — blocked:
[POLICY][FAIL] canary.pre_promote
- Error rate 0.380952 exceeds allowed maximum 0.01
[FAIL] Promotion blocked by policy.
This is the safety guarantee the system is designed to provide. A broken canary cannot accidentally become the stable deployment.
Recover, generate clean traffic, promote successfully:
[POLICY][PASS] canary.pre_promote
- Canary safety policy passed
[PASS] Promotion confirmed through /healthz: mode=stable
Part 4: The Status Dashboard
swiftdeploy status is a live-refreshing terminal dashboard. It runs continuously, scraping /metrics on every interval and appending every result to history.jsonl.
The interesting engineering challenge was calculating P99 latency from raw Prometheus histogram buckets without a Prometheus server. Prometheus histograms store cumulative bucket counts - to find P99, you need to find the smallest bucket whose cumulative count exceeds 99% of total requests:
def calculate_p99_from_buckets(bucket_totals):
total_count = bucket_totals.get("+Inf", 0.0)
if total_count == 0:
return 0.0
target = total_count * 0.99
numeric_buckets = sorted(
[(float(le), count)
for le, count in bucket_totals.items()
if le != "+Inf"],
key=lambda x: x[0]
)
for upper_bound, count in numeric_buckets:
if count >= target:
return upper_bound * 1000 # convert to milliseconds
return numeric_buckets[-1][0] * 1000
Health check and metrics paths are excluded from all calculations to avoid skewing the error rate and latency numbers.
Part 5: The Audit Trail
Every significant event is written to history.jsonl as a JSON line:
{"timestamp": "2026-05-06T23:40:04+00:00", "event_type": "deploy", "data": {"mode": "stable", "version": "1.0.0"}}
{"timestamp": "2026-05-06T23:42:53+00:00", "event_type": "policy_violation", "data": {"domain": "canary", "reasons": ["Error rate 0.380952 exceeds allowed maximum 0.01"]}}
{"timestamp": "2026-05-06T23:45:32+00:00", "event_type": "mode_change", "data": {"mode": "stable"}}
Running swiftdeploy audit generates audit_report.md - a GitHub Flavored Markdown report with a deployment timeline and violations table that renders directly on GitHub.
Complete CLI Reference
| Command | What it does |
|---|---|
swiftdeploy init |
Parse manifest, generate docker-compose.yml and nginx.conf
|
swiftdeploy validate |
Run 5 pre-flight checks, exit non-zero on any failure |
swiftdeploy deploy |
init → validate → OPA infra check → compose up → health gate |
swiftdeploy promote canary |
Update manifest, regenerate Compose, restart app only |
swiftdeploy promote stable |
OPA canary check → update manifest → regenerate → restart app |
swiftdeploy status |
Live metrics + policy compliance dashboard, appends to history |
swiftdeploy status --once |
Single scrape and exit |
swiftdeploy audit |
Parse history.jsonl, generate audit_report.md
|
swiftdeploy teardown |
Remove containers, networks, and volumes |
swiftdeploy teardown --clean |
Teardown + delete generated config files |
Lessons Learned
1. The manifest discipline pays off at every stage
Every time I was tempted to hardcode a value - a port, a timeout, a threshold - I put it in the manifest instead. This cost five extra minutes each time and saved hours. Any value can be changed in the manifest and the entire system adapts without touching any other file.
2. Separation of concerns is a survival strategy, not just a principle
When the CLI makes policy decisions in Python, changing a threshold means deploying new CLI code. When OPA owns the decisions, changing a threshold means editing one line in manifest.yaml. The operational difference is enormous. More importantly, the policy files can be reviewed, versioned, and audited independently of the tool that enforces them.
3. Every failure mode deserves its own message
The first version of the OPA client raised a generic exception on any failure. Connection refused, timeout, bad JSON, missing result - all looked the same. Adding distinct handling for each case costs twenty lines of code and saves hours of debugging in production.
4. Prometheus counters persist for the process lifetime
After recovering from error chaos and generating clean traffic, promote stable was still blocked. The reason: Prometheus counters never reset. The old error counts were still in the counter from before the recovery. Restarting the container resets them because it starts a fresh process. In production, you would use a time-windowed approach with a proper Prometheus server and PromQL range queries.
5. OPA isolation is a hard security requirement
If OPA were reachable via the Nginx port, anyone could query your policy engine, discover your exact thresholds, and craft traffic that stays just below detection limits. Binding OPA to the loopback interface and keeping it off the public network is not a preference - it is the minimum viable security posture for a policy engine.
6. Generated files should never be committed
Committing docker-compose.yml and nginx.conf to Git creates a false source of truth. Developers start editing the generated file instead of the manifest, the template drifts from reality, and the tool becomes meaningless. Keeping generated files in .gitignore enforces the discipline mechanically.
7. The Windows BOM problem is real
On Windows, writing YAML with Python's yaml.safe_dump can produce a file with a UTF-8 BOM prefix. When that file is later read by the Jinja template loader, the BOM gets rendered into the first key name, producing "\xEF\xBB\xBFservices" instead of services. The fix is to always write files using write_bytes(content.encode("utf-8")) rather than write_text(..., encoding="utf-8"). The difference is subtle and the debugging is painful.
Running It Yourself
# Clone the repository
git clone https://github.com/gbadedata/swiftdeploy.git
cd swiftdeploy
# Set up Python environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .\.venv\Scripts\Activate.ps1 # Windows PowerShell
pip install pyyaml jinja2 requests
# Build the app image
docker build -t swift-deploy-1-node:latest .
# Full lifecycle
python ./swiftdeploy deploy
python ./swiftdeploy promote canary
python ./swiftdeploy status --once
python ./swiftdeploy promote stable
python ./swiftdeploy audit
# Clean up
python ./swiftdeploy teardown --clean
The full source code, Rego policies, Jinja templates, and screenshots are at:
https://github.com/gbadedata/swiftdeploy
Final Thought
The most important insight from this project is that a deployment tool should be more than a script that runs docker compose up. It should be a control plane - one that enforces standards before acting, surfaces observable state while running, and leaves an auditable record of every decision it makes.
SwiftDeploy is deliberately local-first and small in scope. But the patterns it demonstrates - manifest-driven generation, policy-gated lifecycle, metrics-based promotion gates, and append-only audit trails - are the same patterns that underpin tools like Argo CD, Flux, and every serious production deployment system.
The small version teaches you the patterns. The patterns scale to any size.
Source code: github.com/gbadedata/swiftdeploy
Top comments (0)