DEV Community

DevOps Journey
DevOps Journey

Posted on

SwiftDeploy: How I Built a CLI That Writes Its Own Infrastructure, Gates Deployments with OPA, and Audits Everything

HNG Internship 14—DevOps Track, Stage 4A & 4B
A deep dive into building a declarative deployment tool from scratch—no Terraform, no Kubernetes, just bash, Python, Docker, and Open Policy Agent.

Introduction
Most DevOps tasks ask you to configure infrastructure manually. Stage 4 of the HNG DevOps track asked something harder: build the tool that does it for you.
The result is swiftdeploy — a CLI tool that reads a single manifest.yaml file and automatically generates all configuration files, deploys a containerised stack, gates every deployment through policy checks, streams live metrics, and produces audit reports. Nothing is handwritten except the manifest.
This post covers the full journey: the architecture decisions, the OPA policy design, what happened when chaos was injected, and every painful lesson learned along the way.

Part 1 — The Design: A Tool That Writes Its Own Config
The Core Principle: Declarative Infrastructure
The fundamental idea behind tools like Terraform and Kubernetes is declarative infrastructure — you describe what you want, not how to build it. You write a spec, the tool figures out the implementation.
swiftdeploy applies this same principle at a smaller scale. The entire deployment is described in one file:
yamlservices:
image: travispocr-swiftdeploy:latest
port: 3000
mode: stable
version: "1.0.0"

nginx:
image: nginx:latest
port: 8080
proxy_timeout: 30

opa:
image: openpolicyagent/opa:latest
port: 8181

network:
name: swiftdeploy-net
driver_type: bridge
This is the only file you ever edit. Everything else — nginx.conf, docker-compose.yml — is generated from it.
How init Works
swiftdeploy init reads the manifest using a pure Python YAML parser embedded in the bash script, extracts all values, then uses sed to perform token substitution on two template files:
templates/nginx.conf.tmpl → nginx.conf
templates/docker-compose.yml.tmpl → docker-compose.yml
Templates use {{TOKEN}} placeholders:
nginx# In template
listen {{NGINX_PORT}};
server app:{{SERVICE_PORT}};

After init

listen 8080;
server app:3000;
The grader's test for this is brutal: delete the generated files and re-run init. If they don't regenerate identically, the stack breaks. This forced clean separation between templates (committed to git) and generated files (gitignored).
The Architecture
manifest.yaml


swiftdeploy init

├──► nginx.conf (from templates/nginx.conf.tmpl)
└──► docker-compose.yml (from templates/docker-compose.yml.tmpl)


swiftdeploy deploy

├──► swiftdeploy-app (Flask API)
├──► swiftdeploy-nginx (Reverse proxy)
└──► swiftdeploy-opa (Policy engine)
All traffic routes through Nginx. The app port (3000) is never exposed to the host — only port 8080. OPA runs on an isolated internal network and is only reachable by the CLI through the app container.

Part 2 — The API Service
The API is a Flask application running in two modes controlled by the MODE environment variable.
Stable vs Canary
Same Docker image, different runtime behaviour:
FeatureStableCanaryGET /Welcome messageWelcome messageGET /healthzStatus + uptimeStatus + uptimeGET /metricsPrometheus metricsPrometheus metricsPOST /chaos403 ForbiddenActiveX-Mode headerNot presentX-Mode: canary
Mode switching is done by swiftdeploy promote:
bash./swiftdeploy promote canary # stable → canary
./swiftdeploy promote stable # canary → stable
Promote updates manifest.yaml in-place, regenerates docker-compose.yml with the new MODE env var, and restarts only the app container — Nginx stays up the entire time, so there's zero downtime.
The /metrics Endpoint
Rather than pulling in the prometheus_client library, metrics are tracked in-memory and rendered manually in Prometheus text format. This keeps the image small and avoids dependencies.
Three categories of metrics are tracked:
Counters:
http_requests_total{method="GET",path="/healthz",status_code="200"} 42
Histograms (with standard buckets from 5ms to 10s):
http_request_duration_seconds_bucket{method="GET",path="/",le="0.005"} 38
http_request_duration_seconds_sum{method="GET",path="/"} 0.187
http_request_duration_seconds_count{method="GET",path="/"} 42
Gauges:
app_uptime_seconds 3600.123
app_mode 0
chaos_active 0
A before_request hook records the start time, and an after_request hook calculates duration and updates all counters. The /metrics endpoint itself is excluded from tracking to avoid self-referential noise.
The Chaos Endpoint
In canary mode, POST /chaos accepts three commands:
bash# Slow down responses
curl -X POST http://localhost:8080/chaos \
-d '{"mode":"slow","duration":3}'

Inject errors at 50% rate

curl -X POST http://localhost:8080/chaos \
-d '{"mode":"error","rate":0.5}'

Recover

curl -X POST http://localhost:8080/chaos \
-d '{"mode":"recover"}'
Chaos state is stored in a thread-safe dictionary and applied in before_request. The chaos_active gauge in /metrics reflects the current state (0=none, 1=slow, 2=error), making it visible in the status dashboard.

Part 3 — The Guardrails: OPA Policy Enforcement
Why the CLI Never Makes Decisions
The task requirement is strict: all allow/deny logic must live in OPA. The CLI only sends data and acts on the response. This separation matters because:

Policies can be updated without touching the CLI
Policy logic is auditable and testable independently
Different teams can own different policy domains

Policy Structure
Two independent policy domains, each owning exactly one question:
policies/infrastructure.rego — "Is the host healthy enough to deploy?"
policies/canary.rego — "Is the canary safe to promote to stable?"
Each domain is queried independently. A deploy could be blocked by infrastructure policy while canary policy passes — they never interfere with each other.
The Rego Policies
Infrastructure policy:
regopackage swiftdeploy.infrastructure

import rego.v1

default allow := false

allow if {
count(violations) == 0
}

violations contains msg if {
input.disk_free_gb < data.thresholds.min_disk_free_gb
msg := sprintf(
"Disk free %.1fGB is below minimum %.1fGB",
[input.disk_free_gb, data.thresholds.min_disk_free_gb]
)
}

violations contains msg if {
input.cpu_load > data.thresholds.max_cpu_load
msg := sprintf(
"CPU load %.2f exceeds maximum %.2f",
[input.cpu_load, data.thresholds.max_cpu_load]
)
}
Canary safety policy:
regopackage swiftdeploy.canary

import rego.v1

default allow := false

allow if {
count(violations) == 0
}

violations contains msg if {
input.error_rate_percent > data.thresholds.max_error_rate_percent
msg := sprintf(
"Error rate %.2f%% exceeds maximum %.2f%%",
[input.error_rate_percent, data.thresholds.max_error_rate_percent]
)
}

violations contains msg if {
input.chaos_active != 0
msg := sprintf(
"Chaos mode is active (%d) — recover before promoting",
[input.chaos_active]
)
}
Why Thresholds Live in data.json
Notice the policies reference data.thresholds.* — never hardcoded numbers. All threshold values live in policies/data.json:
json{
"thresholds": {
"min_disk_free_gb": 10.0,
"max_cpu_load": 2.0,
"min_mem_free_percent": 10.0,
"max_error_rate_percent": 1.0,
"max_p99_latency_ms": 500.0
}
}
This means you can tighten or relax thresholds without touching Rego. The policy logic and the policy configuration are separate concerns.
Every Decision Carries Reasoning
OPA never returns a bare boolean. Every decision includes the violations that caused it:
json{
"result": {
"allow": false,
"violations": [
"Disk free 4.2GB is below minimum 10.0GB"
],
"domain": "infrastructure",
"checked_at": "2026-05-06T08:35:27Z"
}
}
The CLI surfaces this clearly:
── Policy check: infrastructure ──
[FAIL] infrastructure: policy violated
• Disk free 4.2GB is below minimum 10.0GB

╔══ POLICY VIOLATION ══════════════════════════════╗
║ Deployment blocked by infrastructure policy.
╚══════════════════════════════════════════════════╝
OPA Isolation
OPA runs on a separate opa-internal Docker network. It is not connected to the Nginx network, so it's completely unreachable from the public-facing port 8080. The CLI reaches OPA by proxying requests through the app container using docker exec:
bashdocker exec swiftdeploy-app curl -sf -X POST \
"http://swiftdeploy-opa:8181/v1/data/swiftdeploy/infrastructure/decision" \
-H "Content-Type: application/json" \
-d "$input_json"
This means OPA is reachable by internal services but completely isolated from external traffic.

Part 4 — The Chaos Experiment
Injecting Chaos
With the stack in canary mode:
bash# Inject 50% error rate
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode":"error","rate":0.5}'
The status dashboard immediately captured the degradation:
SwiftDeploy Status Dashboard 2026-05-06T20:31:00Z
─────────────────────────────────────────────────
Throughput 2.4 req/s Total 89
Error rate 48.31%
P99 latency 5ms Avg 0ms
Chaos error

Policy Compliance
[✓] infrastructure: passing
[✗] canary: FAILING
─────────────────────────────────────────────────
Attempting Promotion Under Chaos
With chaos active and error rate above 1%, attempting promote stable is blocked:
bash./swiftdeploy promote stable
── Policy check: canary ──
[FAIL] canary: policy violated
• Error rate 48.31% exceeds maximum 1.00%
• Chaos mode is active (2) — recover before promoting

╔══ POLICY VIOLATION ══════════════════════════════╗
║ Promotion to stable blocked by canary safety policy.
╚══════════════════════════════════════════════════╝
This is exactly the intended be
 haviour — the policy engine prevents a broken canary from being promoted to production.
Recovery
bashcurl -X POST http://localhost:8080/chaos -d '{"mode":"recover"}'
./swiftdeploy promote stable
── Policy check: canary ──
[PASS] canary: all checks passed
[✓] Promotion complete! Mode is now: stable

Part 5 — Observability: Status Dashboard and Audit Trail
The Status Dashboard
swiftdeploy status runs a live-refreshing terminal dashboard that scrapes /metrics every 5 seconds:
SwiftDeploy Status Dashboard 2026-05-06T20:29:03Z
───────────────────────────────────────────────────────
Throughput 2.2 req/s Total 44
Error rate 0.00%
P99 latency 10ms Avg 0ms
Chaos none

Policy Compliance
[✓] infrastructure: passing
[✓] canary: passing

History log history.jsonl
───────────────────────────────────────────────────────
Refreshing every 5s — Ctrl+C to exit
Every scrape is appended to history.jsonl as a structured JSON line:
json{"timestamp":"2026-05-06T20:29:03Z","event":"status_scrape","data":{"rps":2.2,"total_requests":44,"error_rate_percent":0.0,"p99_latency_ms":10.0,"chaos_active":0}}
The Audit Report
swiftdeploy audit parses history.jsonl and generates audit_report.md:
markdown## Timeline
| Timestamp | Event | Detail |
|-----------|-------|--------|
| 2026-05-06T20:15:30Z | policy_pass | domain=infrastructure |
| 2026-05-06T20:15:31Z | deploy_success | mode=stable |
| 2026-05-06T20:16:49Z | mode_change | from=stable, to=canary |

Policy Violations

Timestamp Event Data
2026-05-06T08:35:31Z deploy_blocked disk_free_gb=58.08

Metrics Summary

Metric Value
Total scrapes 31
Avg req/s 6.74
Avg error rate 0.0000%
Max P99 latency 10ms

The report renders as clean GitHub Flavored Markdown and provides a complete audit trail of every deployment, mode change, policy decision, and chaos event.

Part 6 — Lessons Learned

  1. Windows Git Bash Mangles Unix Paths Running docker exec container /opa eval from Git Bash on Windows translates /opa to C:/Program Files/Git/opa. The fix is MSYS_NO_PATHCONV=1: bashMSYS_NO_PATHCONV=1 docker exec swiftdeploy-opa /opa eval "data" --data /policies This was one of the most time-consuming bugs — completely invisible until you check the actual error message from Docker.
  2. The OPA Image Has No Shell, No curl, No wget The openpolicyagent/opa:latest image contains only the opa binary. No sh, no curl, no wget. This means:

Healthchecks can't use CMD-SHELL
You can't docker exec into it with a shell
The solution: route all OPA queries through the app container which has curl

  1. CRLF Line Endings Break Bash Scripts on Linux Files edited on Windows have \r\n line endings. When bash on Linux/WSL reads them, the \r becomes part of variable values, causing cryptic failures. Always run: bashsed -i 's/\r//' swiftdeploy sed -i 's/\r//' templates/.tmpl sed -i 's/\r//' policies/.rego
  2. Port Binding Only Applies at Container Creation Docker applies port bindings when a container is created, not when it starts or restarts. docker compose restart does not re-apply port mappings. You need docker compose up --force-recreate to pick up port changes.
  3. OPA Rego v1 Requires Explicit if Keywords The latest OPA image enforces Rego v1 syntax, which requires import rego.v1 and explicit if keywords on every rule body. The older future.keywords import approach no longer works cleanly with the latest image.

Conclusion
swiftdeploy is a working example of several production DevOps patterns in miniature:

Declarative infrastructure — one manifest drives everything
Policy as code — OPA enforces guardrails without coupling to the CLI
Observability — metrics, dashboards, and audit trails built in from the start
Chaos engineering — deliberate failure injection to verify recovery paths

The full source code is available at github.com/travispocr/hng14-stage-4a.

Published as part of HNG Internship 14 — DevOps Track
Author: travispocr | hngstage4b

Top comments (0)