A complete walkthrough of building a declarative deployment CLI from scratch — one that generates its own config files, enforces policy before acting, and audits everything it does.
Table of Contents
- The Problem
- Architecture Overview
- Part A — The Engine
- Part B — The Eyes and the Brain
- The Chaos Experiments
- Lessons Learned
- Replication Guide
The Problem
Most DevOps tooling asks you to write config files by hand. You write a docker-compose.yml, an nginx.conf, maybe a Kubernetes manifest — and then you maintain all of them separately. When something changes, you update three files instead of one, and they drift apart.
SwiftDeploy flips this. You write one file — manifest.yaml — and the tool derives everything else from it. The generated files are artifacts, not sources.
Part B goes further: the tool now refuses to act unless the environment meets policy. It has eyes (Prometheus metrics), a brain (Open Policy Agent), and a memory (an append-only audit trail).
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Host Machine │
│ │
│ swiftdeploy CLI (bash) │
│ │ │
│ ├── reads ──► manifest.yaml │
│ ├── writes ─► nginx.conf, docker-compose.yml │
│ ├── queries ► OPA :8181 (127.0.0.1 only) │
│ └── scrapes ► /metrics via nginx :8080 │
│ │
│ ┌─────────────────────────┐ ┌──────────────────────┐ │
│ │ swiftdeploy-net │ │ opa-net │ │
│ │ │ │ │ │
│ │ ┌────────┐ ┌────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ app │ │ nginx │ │ │ │ OPA │ │ │
│ │ │ :3000 │◄─│ :8080 │ │ │ │ :8181 │ │ │
│ │ └────────┘ └────────┘ │ │ └─────────────────┘ │ │
│ │ (no host port) (pub) │ │ (127.0.0.1 only) │ │
│ └─────────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key isolation properties:
- The app container has no host-exposed port — all traffic routes through nginx
- OPA lives on a separate Docker network (
opa-net) with no connection toswiftdeploy-net - OPA's port is bound to
127.0.0.1only — the CLI on the host can reach it, but nginx cannot proxy to it - No policy logic lives in the CLI — the CLI only sends data and reads decisions
Part A — The Engine
The Manifest: Single Source of Truth
Everything starts with manifest.yaml:
services:
image: swift-deploy-1-node:latest
port: 3000
mode: stable # stable | canary
version: "1.0.0"
nginx:
image: nginx:latest
port: 8080
proxy_timeout: 30
network:
name: swiftdeploy-net
driver_type: bridge
This is the only file an operator edits. Every other config file is generated from it. The grader's test is simple: delete nginx.conf and docker-compose.yml, run ./swiftdeploy init, and verify they come back correctly. If your tool breaks, your stack breaks.
The Tool That Writes Its Own Infrastructure
The core of swiftdeploy init is a Python template renderer embedded directly in the bash script:
import yaml
from pathlib import Path
with open("manifest.yaml") as f:
m = yaml.safe_load(f)
ctx = {
"service_image": m["services"]["image"],
"service_port": m["services"]["port"],
"mode": m["services"].get("mode", "stable"),
"nginx_port": m["nginx"]["port"],
"proxy_timeout": m["nginx"].get("proxy_timeout", 30),
"network_name": m["network"]["name"],
"network_driver": m["network"]["driver_type"],
}
for tmpl, out in [("templates/nginx.conf.j2", "nginx.conf"),
("templates/docker-compose.yml.j2", "docker-compose.yml")]:
src = Path(tmpl).read_text()
for k, v in ctx.items():
src = src.replace("{{ " + k + " }}", str(v))
Path(out).write_text(src)
The templates use {{ variable }} placeholders. No Jinja2 library needed — a simple string replace is sufficient and keeps the image dependency-free.
The nginx template enforces the required access log format:
log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';
And returns structured JSON error bodies for 502/503/504:
location @err502 {
default_type application/json;
return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"ops@swiftdeploy.local"}';
}
The docker-compose template enforces security hardening on every deploy:
user: "1001:1001"
cap_drop:
- ALL
expose:
- "{{ service_port }}" # internal only — never ports:
The API Service
The service is a pure Python stdlib HTTP server — no Flask, no FastAPI, no external dependencies. This keeps the Docker image under 80MB.
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/healthz":
_json(self, 200, {"status": "ok", "uptime": round(time.time() - START_TIME, 2)})
elif self.path == "/":
_json(self, 200, {
"message": f"Welcome to SwiftDeploy [{MODE}]",
"mode": MODE,
"version": VERSION,
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
})
Canary mode is controlled entirely by the MODE environment variable injected by docker-compose. The same image runs in both modes — no separate builds.
Chaos state is an in-process global dict protected by a threading.Lock:
_chaos = {"mode": None, "duration": 0, "rate": 0.0}
def _apply_chaos():
with _chaos_lock:
m = _chaos["mode"]
if m == "slow":
time.sleep(_chaos["duration"])
elif m == "error":
if random.random() < _chaos["rate"]:
return True # caller should return 500
return False
The CLI Subcommands
validate runs 5 pre-flight checks before any deployment:
==> Running pre-flight checks
[PASS] manifest.yaml exists and is valid YAML
[PASS] All required fields present and non-empty
[PASS] Docker image 'swift-deploy-1-node:latest' exists locally
[PASS] Nginx port 8080 is not already bound
[PASS] nginx.conf is syntactically valid
Results: 5 passed, 0 failed
The nginx syntax check is interesting — it can't just run nginx -t because the app hostname doesn't resolve outside the compose network. The fix: substitute the upstream hostname with 127.0.0.1 in a temp copy before testing:
sed 's|http://app:|http://127.0.0.1:|g' nginx.conf > .nginx-check-tmp.conf
docker run --rm -v "$(pwd)/.nginx-check-tmp.conf:/etc/nginx/conf.d/default.conf:ro" \
nginx:latest nginx -t
promote does a rolling restart of only the app container:
docker compose up -d --no-deps --force-recreate app
The --no-deps flag is critical — it prevents docker-compose from restarting nginx, which would cause a brief outage. Only the app restarts; nginx keeps serving traffic (returning 502s for the few seconds the app is down, which is expected and handled by the JSON error pages).
Part B — The Eyes and the Brain
Instrumentation: /metrics in Pure Python
Adding Prometheus metrics to a stdlib HTTP server requires no libraries. All state lives in module-level variables protected by a lock:
_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
_req_total = {} # {(method, path, status_code): count}
_hist_buckets = {le: 0 for le in _BUCKETS}
_hist_sum = 0.0
_hist_count = 0
Every request records its duration and increments the appropriate counters:
def _record(method, path, status, duration):
key = (method, path, str(status))
with _metrics_lock:
_req_total[key] = _req_total.get(key, 0) + 1
_hist_sum += duration
_hist_count += 1
for le in _BUCKETS:
if duration <= le:
_hist_buckets[le] += 1
_hist_inf += 1
The /metrics endpoint serialises this state into Prometheus text format:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 42
http_requests_total{method="GET",path="/healthz",status_code="200"} 180
# HELP http_request_duration_seconds Request latency histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 210
http_request_duration_seconds_bucket{le="0.01"} 220
...
http_request_duration_seconds_bucket{le="+Inf"} 222
http_request_duration_seconds_sum 0.441200
http_request_duration_seconds_count 222
# HELP app_uptime_seconds Seconds since process start
# TYPE app_uptime_seconds gauge
app_uptime_seconds 3600.42
# HELP app_mode Current deployment mode (0=stable 1=canary)
# TYPE app_mode gauge
app_mode 1
# HELP chaos_active Active chaos state (0=none 1=slow 2=error)
# TYPE chaos_active gauge
chaos_active 0
The Policy Sidecar: OPA
OPA runs as a sidecar container. The key design decision is network isolation: OPA lives on opa-net, which has no connection to swiftdeploy-net (where nginx and the app live). The port is bound to 127.0.0.1 only:
opa:
image: openpolicyagent/opa:latest
command: ["run", "--server", "--addr", "0.0.0.0:8181", "/policies"]
volumes:
- ./policies:/policies:ro
networks:
- opa-net
ports:
- "127.0.0.1:8181:8181"
This satisfies the "no leakage" requirement: nginx cannot proxy to OPA because they share no network. The CLI on the host can reach OPA directly on localhost:8181, but a request through http://localhost:8080/v1/data/infra would return a 404 from nginx.
The policies/ directory is mounted read-only. OPA loads all .rego files and their companion data.json files at startup. Restarting OPA picks up policy changes without rebuilding anything.
Writing Rego Policies
The core principle: OPA never returns a bare boolean. Every decision carries the reasoning behind it.
Infrastructure Policy (policies/infra/policy.rego):
package infra
import rego.v1
default allow := false
allow if {
count(deny) == 0
}
deny contains msg if {
input.disk_free_gb < data.infra.min_disk_gb
msg := sprintf(
"Disk free %vGB is below minimum %vGB",
[input.disk_free_gb, data.infra.min_disk_gb]
)
}
deny contains msg if {
input.cpu_load > data.infra.max_cpu_load
msg := sprintf(
"CPU load %v exceeds maximum %v",
[input.cpu_load, data.infra.max_cpu_load]
)
}
The thresholds live in policies/infra/data.json, not in the Rego file:
{ "min_disk_gb": 10, "max_cpu_load": 2.0 }
This separation matters. Changing a threshold is a data change, not a policy change. You can update data.json and restart OPA without touching the logic. A security audit of the policy file remains valid even after threshold adjustments.
Canary Safety Policy (policies/canary/policy.rego):
package canary
import rego.v1
default allow := false
allow if {
count(deny) == 0
}
deny contains msg if {
input.error_rate > data.canary.max_error_rate
msg := sprintf(
"Error rate %v%% exceeds maximum %v%%",
[round(input.error_rate * 100), round(data.canary.max_error_rate * 100)]
)
}
deny contains msg if {
input.p99_ms > data.canary.max_p99_ms
msg := sprintf(
"P99 latency %vms exceeds maximum %vms",
[input.p99_ms, data.canary.max_p99_ms]
)
}
Each domain owns exactly one question. The infra policy only knows about disk and CPU. The canary policy only knows about error rate and latency. A change to one never requires touching the other.
Gated Lifecycle: deploy and promote
Pre-deploy gate collects host stats and sends them to OPA:
disk_free=$(df -g / | awk 'NR==2{print $4}')
cpu_load=$(uptime | awk -F'load averages?:' '{print $2}' | awk '{print $1}' | tr -d ',')
input_json=$(python3 -c "
import json
print(json.dumps({
'action': 'deploy',
'disk_free_gb': float('$disk_free'),
'cpu_load': float('$cpu_load')
}))
")
When the disk is full, the output is unambiguous:
==> Checking infrastructure policy
disk_free=2GB cpu_load=0.8
allow=false
[DENY] Disk free 2GB is below minimum 10GB
[BLOCKED] Deployment denied by infrastructure policy
Pre-promote gate scrapes /metrics and computes P99 latency from the histogram:
# P99 interpolation from histogram buckets
target = total_obs * 0.99
for le, count in sorted(buckets):
if count >= target:
p99_ms = int(le * 1000)
break
The input to OPA is different from the deploy check — it reflects the question being asked:
{
"action": "promote",
"error_rate": 0.052,
"p99_ms": 620
}
OPA treats pre-deploy and pre-promote as completely separate questions. The CLI queries each domain independently and treats each answer as a separate signal.
OPA unavailability is handled gracefully — the CLI never crashes or hangs:
resp=$(curl -s ... 2>/dev/null) || {
echo " [WARN] OPA unreachable — policy check skipped"
return 2
}
Return code 2 means "unavailable" — the CLI logs a warning and continues. Return code 1 means "denied" — the CLI prints the reasons and exits. Return code 0 means "allowed" — the CLI proceeds.
The Status Dashboard
./swiftdeploy status runs a live-refreshing terminal dashboard that scrapes /metrics every 5 seconds:
==> SwiftDeploy Status [2026-05-06T20:10:00Z] mode=canary
Throughput : 7.3 req/s
P99 Latency: 620ms
Error Rate : 6.00%
Uptime : 700s
Chaos : error
Policy Compliance:
[PASS] infra — disk=107GB cpu=0.8
[FAIL] canary — error_rate=6.00% p99=620ms
(Ctrl+C to exit)
Every scrape appends a JSON record to history.jsonl:
{
"ts": "2026-05-06T20:10:00Z",
"mode": "canary",
"req_per_s": 7.3,
"p99_ms": 620,
"error_rate": 0.06,
"uptime_s": 700,
"chaos": "error",
"policy": {"infra": "PASS", "canary": "FAIL"}
}
The Audit Report
./swiftdeploy audit parses history.jsonl and generates audit_report.md — valid GitHub Flavored Markdown with four sections: timeline, violations, mode changes, and chaos events.
==> audit_report.md generated
4 snapshots | 1 violations | 1 mode changes | 1 chaos events
The Chaos Experiments
Experiment 1: Error Injection
Promote to canary, then inject 50% error rate:
./swiftdeploy promote canary
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode":"error","rate":0.5}'
The status dashboard immediately captures the failure:
Throughput : 8.1 req/s
P99 Latency: 45ms
Error Rate : 48.20%
Chaos : error
Policy Compliance:
[PASS] infra — disk=107GB cpu=0.8
[FAIL] canary — error_rate=48.20% p99=45ms
Now try to promote to stable — the canary gate blocks it:
==> Checking canary safety policy
error_rate=48.20% p99=45ms
allow=false
[DENY] Error rate 48% exceeds maximum 1%
[BLOCKED] Promotion denied by canary safety policy
This is the correct behaviour. The canary is unhealthy. You must recover first:
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode":"recover"}'
Wait for the error rate to drop below 1% in the status view, then promote.
Experiment 2: Slow Responses
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode":"slow","duration":3}'
The status dashboard shows P99 climbing:
P99 Latency: 3000ms
Error Rate : 0.00%
Policy Compliance:
[FAIL] canary — error_rate=0.00% p99=3000ms
The canary policy blocks promotion on P99 alone — even with zero errors, a 3-second P99 is unacceptable.
Experiment 3: Disk Full Gate
Simulate a full disk by temporarily lowering the threshold in policies/infra/data.json:
{ "min_disk_gb": 200, "max_cpu_load": 2.0 }
Restart OPA and try to deploy:
docker restart swiftdeploy-opa
./swiftdeploy deploy
==> Checking infrastructure policy
disk_free=107GB cpu_load=0.8
allow=false
[DENY] Disk free 107GB is below minimum 200GB
[BLOCKED] Deployment denied by infrastructure policy
The hard gate works. No containers are started, no configs are written.
Lessons Learned
1. Heredocs and set -euo pipefail don't mix well with pipes.
The biggest source of bugs was set -e killing the script when a pipe's right-hand side returned non-zero (e.g., grep finding no match). The fix: capture output into a variable and use if statements instead of && chains or piped until conditions.
2. macOS and Linux have different system utilities.
df -BG is Linux. df -g is macOS. uptime formats load averages differently. head -n -1 doesn't work on BSD head. Any tool targeting both platforms needs to handle these divergences explicitly.
3. OPA internal: true networks block host port bindings.
Setting internal: true on a Docker network prevents all external connectivity — including host port mappings. OPA needs to be reachable by the CLI on the host, so the network must not be internal. Isolation from nginx is achieved by simply not adding OPA to the nginx-facing network.
4. Separate policy logic from policy data.
Hardcoding thresholds in Rego files is a mistake. When you need to adjust a limit, you want to change a JSON file, not a policy file. Keeping them separate means your Rego logic can be audited and reviewed independently of operational tuning.
5. Never return a bare boolean from a policy engine.
A decision without reasoning is useless to an operator. Every OPA response includes a deny array with human-readable messages. The CLI surfaces these directly. When a deploy is blocked, the operator knows exactly why.
6. The --no-deps flag is essential for rolling restarts.
docker compose up -d --force-recreate app without --no-deps will restart nginx too, causing a full outage. With --no-deps, only the app container restarts. Nginx keeps serving (returning 502s briefly), which is the correct rolling restart behaviour.
Replication Guide
Prerequisites
- Docker + Docker Compose v2
- Python 3.10+ with
pyyaml(pip install pyyaml) -
curl,lsof
Step 1: Clone and build
git clone <your-repo-url>
cd stage4
chmod +x swiftdeploy
docker build -t swift-deploy-1-node:latest .
Step 2: Init and validate
./swiftdeploy init
./swiftdeploy validate
Expected: 5/5 checks pass.
Step 3: Deploy
./swiftdeploy deploy
Expected output:
==> Checking infrastructure policy
disk_free=107GB cpu_load=0.8
allow=true
[PASS] Infrastructure policy
==> Initialising
generated nginx.conf
generated docker-compose.yml
==> Starting stack
[+] Running 4/4
==> Waiting for health (timeout 60s)
[OK] Stack is healthy
Step 4: Verify endpoints
curl http://localhost:8080/
curl http://localhost:8080/healthz
curl http://localhost:8080/metrics
Step 5: Promote to canary
./swiftdeploy promote canary
Verify X-Mode: canary header:
curl -sv http://localhost:8080/ 2>&1 | grep X-Mode
Step 6: Inject chaos and watch the status dashboard
In one terminal:
./swiftdeploy status
In another:
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode":"error","rate":0.5}'
Watch the canary policy flip to [FAIL] in the dashboard.
Step 7: Recover and generate audit report
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode":"recover"}'
./swiftdeploy audit
cat audit_report.md
Step 8: Teardown
./swiftdeploy teardown --clean
Built for HNG DevOps Track — Stage 4.
Top comments (0)