Introduction
As part of the HNG Internship Stage 4 DevOps track, I built SwiftDeploy — a declarative deployment CLI tool that generates Nginx and Docker Compose configurations from a single manifest.yaml file. In Stage 4B, I extended it with Prometheus metrics, Open Policy Agent (OPA) policy enforcement, a live status dashboard, and an audit trail.
This post walks through the entire journey — the design decisions, the guardrails, the chaos, and the lessons learned.
The Design: A Tool That Writes Its Own Infrastructure
The core idea behind SwiftDeploy is simple: you should only ever edit one file.
That file is manifest.yaml:
services:
name: app
image: wonderfullymade-api:latest
port: 3000
version: "1.0.0"
mode: stable
nginx:
image: nginx:latest
port: 8080
proxy_timeout: 30
network:
name: swiftdeploy-net
driver_type: bridge
From this single file, the swiftdeploy CLI generates:
-
nginx.conf— Nginx reverse proxy configuration -
docker-compose.yml— Full stack definition with app, Nginx, and OPA containers
This is done using Jinja2 templates. The CLI reads the manifest, renders the templates with the values, and writes the output files to the project root.
manifest.yaml → swiftdeploy init → nginx.conf + docker-compose.yml
The CLI supports these subcommands:
-
init— Generate config files from manifest -
validate— Run 5 pre-flight checks -
deploy— Start the full stack -
promote canary/stable— Switch modes with rolling restart -
teardown— Remove all containers and volumes -
status— Live metrics dashboard -
audit— Generate audit report
SwiftDeploy architecture: from manifest.yaml to running stack
The API Service
The API service is a Python Flask app that exposes three endpoints:
-
GET /— Welcome message with current mode, version, and timestamp -
GET /healthz— Liveness check with uptime -
POST /chaos— Simulate degraded behaviour (canary mode only)
The service runs in either stable or canary mode, controlled by the MODE environment variable. Canary mode adds an X-Mode: canary header to every response and activates the chaos endpoint.
The chaos endpoint accepts three modes:
{"mode": "slow", "duration": 3}
{"mode": "error", "rate": 0.5}
{"mode": "recover"}
Instrumentation: The /metrics Endpoint
In Stage 4B, I added a /metrics endpoint in Prometheus text format using the prometheus-client library.
The following metrics are tracked:
-
http_requests_total— Total requests with method, path, and status code labels -
http_request_duration_seconds— Request duration histogram -
app_uptime_seconds— How long the app has been running -
app_mode— Current mode (0=stable, 1=canary) -
chaos_active— Current chaos state (0=none, 1=slow, 2=error)
This is what the /metrics output looks like:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 150.0
http_requests_total{method="GET",path="/",status_code="500"} 12.0
The Guardrails: Open Policy Agent
The most important part of Stage 4B is the policy enforcement layer using OPA.
Why OPA?
The key principle is: the CLI must never make allow/deny decisions itself. All decision logic lives exclusively in OPA. This separation means:
- Policies can be updated without touching the CLI code
- Each policy domain is completely isolated
- The CLI just sends data and acts on the response
Policy Structure
I wrote two Rego policy files:
Infrastructure Policy — runs before deploy:
package infrastructure
deny_reasons := reasons if {
reasons := [msg |
checks := [
[input.disk_free_gb < data.thresholds.min_disk_free_gb,
sprintf("Disk free (%.1fGB) is below minimum", [input.disk_free_gb])],
[input.cpu_load > data.thresholds.max_cpu_load,
sprintf("CPU load (%.2f) exceeds maximum", [input.cpu_load])]
]
check := checks[_]
check[0] == true
msg := check[1]
]
}
Canary Safety Policy — runs before promote:
package canary
deny_reasons := reasons if {
reasons := [msg |
checks := [
[input.error_rate > data.thresholds.max_error_rate,
sprintf("Error rate exceeds maximum of 1.00%% (current: %.2f%%)", [input.error_rate * 100])],
[input.p99_latency_ms > data.thresholds.max_p99_latency_ms,
sprintf("P99 latency exceeds maximum of 500ms (current: %.0fms)", [input.p99_latency_ms])]
]
check := checks[_]
check[0] == true
msg := check[1]
]
}
Threshold values live in policies/data.json — never hardcoded in the Rego files:
{
"thresholds": {
"min_disk_free_gb": 10.0,
"max_cpu_load": 2.0,
"max_error_rate": 0.01,
"max_p99_latency_ms": 500
}
}
OPA Isolation
The OPA container is on a separate internal network (opa-internal) and is not accessible via the Nginx port. Only the CLI can reach it directly on port 8181.
The Chaos: What Happens When Things Go Wrong
This was the most interesting part of the project. I injected a 90% error rate into the canary deployment:
curl -X POST http://localhost:8080/chaos \
-d '{"mode":"error","rate":0.9}'
Immediately, the status dashboard showed:
📊 SwiftDeploy Status Dashboard
==================================================
Mode: canary
Chaos: error
Error rate: 66.67%
P99 latency: 5ms
📋 Policy Compliance:
✅ Infrastructure policy: PASSING
❌ Canary safety policy: FAILING
→ Error rate exceeds maximum of 1.00% (current: 66.67%)
When I tried to promote back to stable:
🔍 Running pre-promote policy checks...
Error rate: 61.54% | P99 latency: 5ms
❌ Canary safety policy: FAILED
→ Error rate exceeds maximum of 1.00% (current: 61.54%)
🚫 Promotion blocked by canary safety policy
The promotion was completely blocked until the error rate dropped below 1%. This is exactly what a production guardrail should do.
The Audit Trail
Every deploy, promote, and status scrape is logged to history.jsonl. Running swiftdeploy audit generates a clean Markdown report:
## Policy Violations
| Timestamp | Policy | Details |
|-----------|--------|---------|
| 2026-05-05T22:30:50Z | Canary Safety | error_rate=53.33% p99=5ms |
| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50.00% p99=5ms |
Lessons Learned
1. Declarative manifests are powerful
Having a single source of truth makes the entire system predictable. The grader can delete generated files and re-run init — everything regenerates correctly.
2. Policy isolation matters
Keeping OPA on a separate network means it can never be reached through the public Nginx ingress. This is a real security principle — your policy engine should never be publicly accessible.
3. Metrics tell the truth
Without the /metrics endpoint, I would have had no way to know the error rate was 66%. Instrumentation is not optional in production systems.
4. Windows PowerShell is different
Most DevOps tooling assumes Linux. Running this on Windows required adjustments for path handling, JSON escaping, and port checking. Always test on the target environment.
5. Chaos engineering is fun
Intentionally breaking things to verify your guardrails work is one of the most satisfying parts of DevOps. If your system can't handle chaos in testing, it will fail in production.
Source Code
The full source code is available on GitHub:
github.com/Wonderfullymade01/swiftdeploy
Built as part of the HNG Internship Stage 4 DevOps track.

Top comments (0)