DEV Community

Ian alex
Ian alex

Posted on

Building SwiftDeploy: A Declarative CLI That Writes Its Own Infrastructure, With Observability, policy Enforcement and Auditing

Most DevOps tasks hand you a broken server and ask you to fix it. This one asked me to build the tool that manages the server for you. Over the course of HNG Stage 4A and 4B, I built SwiftDeploy — a CLI tool that reads a single YAML manifest and generates everything needed to run, monitor, and enforce policy on a containerized service.
This post walks through how I designed it, what went wrong along the way, and what I'd do differently next time.


The Architecture
SwiftDeploy manages four containers from a single manifest.yaml:

The Go app serves HTTP traffic. Nginx sits in front as a reverse proxy. OPA (Open Policy Agent) acts as a policy sidecar that the CLI queries before deploying or promoting. The CLI never makes allow/deny decisions itself — all that logic lives in Rego policy files.


Part 1: The Design — A Tool That Writes Its Own Config
The Manifest as Single Source of Truth
Everything starts with manifest.yaml:

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable
  version: "1.0.0"
  restart_policy: unless-stopped

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 30

network:
  name: swiftdeploy-net
  driver_type: bridge
Enter fullscreen mode Exit fullscreen mode

This is the only file you edit manually. When you run ./swiftdeploy init, the CLI reads this manifest and generates two files from templates:

  • nginx.conf — the full Nginx reverse proxy configuration

  • docker-compose.yml — the complete container orchestration file

Template Rendering
The templates live in templates/ and use {{placeholder}} syntax. The CLI replaces each placeholder with the corresponding value from the manifest:

def render_template(template_path, output_path, manifest):
    replacements = {
        "services_image": str(get_val(manifest, "services.image", "")),
        "services_port": str(get_val(manifest, "services.port", "3000")),
        "nginx_port": str(get_val(manifest, "nginx.port", "8080")),
        # ... more replacements
    }
    with open(template_path, "r") as f:
        content = f.read()
    for key, value in replacements.items():
        content = content.replace("{{" + key + "}}", value)
    with open(output_path, "w") as f:
        f.write(content)
Enter fullscreen mode Exit fullscreen mode

This approach means the grader can delete nginx.conf and docker-compose.yml, run ./swiftdeploy init, and they regenerate perfectly from the manifest. No handwritten configs to drift out of sync.

The Go Service
The API service is a single Go binary built with the Chi router. It exposes four endpoints:

  • GET / — welcome message with mode, version, and timestamp
  • GET /healthz — liveness check with uptime
  • GET /metrics — Prometheus-format metrics (added in Stage 4B)
  • POST /chaos — chaos injection (only works in canary mode)

The service runs in one of two modes controlled by the MODE environment variable: stable or canary. In canary mode, it adds an X-Mode: canary header to every response and activates the chaos endpoint.

The CLI Subcommands
SwiftDeploy supports seven subcommands:

Command ** What it does**
init Parse manifest, generate configs
validate Run 5 pre-flight checks
deploy Init + OPA policy check + start stack + health check
promote Switch mode with policy gate + rolling restart
status Live terminal dashboard
audit Generate markdown report from history
teardown Remove everything

Part 2: The Guardrails — OPA Policy Enforcement

Why OPA?
The core principle is separation of concerns: the CLI handles orchestration, OPA handles decisions. The CLI sends data (disk space, CPU load, error rates) to OPA and asks "is this safe?" OPA evaluates its Rego policies and returns a structured decision with reasoning.

This means I can change policy thresholds without touching the CLI code, and the CLI never contains hardcoded limits.

Infrastructure Policy
Before every deploy, the CLI gathers host stats and sends them to OPA:

def get_host_stats():
    stats = {}
    usage = shutil.disk_usage("/")
    stats["disk_free_gb"] = round(usage.free / (1024 ** 3), 2)
    with open("/proc/loadavg", "r") as f:
        stats["cpu_load"] = float(f.read().split()[0])
    return stats
Enter fullscreen mode Exit fullscreen mode

The Rego policy checks these against thresholds defined in a separate data.json file:

package swiftdeploy.infra

violations contains msg if {
    input.disk_free_gb < data.thresholds.min_disk_free_gb
    msg := sprintf("Disk free space %.1fGB is below minimum %.1fGB",
        [input.disk_free_gb, data.thresholds.min_disk_free_gb])
}

violations contains msg if {
    input.cpu_load > data.thresholds.max_cpu_load
    msg := sprintf("CPU load %.2f exceeds maximum %.2f",
        [input.cpu_load, data.thresholds.max_cpu_load])
}
Enter fullscreen mode Exit fullscreen mode

The thresholds (min_disk_free_gb: 10.0, max_cpu_load: 2.0) live in policies/data.json, not in the Rego files. This separation means ops teams can tune limits without understanding Rego syntax.

Canary Safety Policy
Before promoting from canary to stable, the CLI scrapes /metrics, calculates the error rate and P99 latency, and asks OPA if the canary is healthy:

package swiftdeploy.canary

violations contains msg if {
    input.error_rate_pct > data.thresholds.max_error_rate_pct
    msg := sprintf("Error rate %.2f%% exceeds maximum %.2f%%",
        [input.error_rate_pct, data.thresholds.max_error_rate_pct])
}

violations contains msg if {
    input.p99_latency_ms > data.thresholds.max_p99_latency_ms
    msg := sprintf("P99 latency %.0fms exceeds maximum %.0fms",
        [input.p99_latency_ms, data.thresholds.max_p99_latency_ms])
}
Enter fullscreen mode Exit fullscreen mode

OPA Isolation
The OPA container sits on its own internal Docker network (opa-internal). It's reachable by the CLI (via localhost:8181 port mapping) but not accessible through the Nginx ingress. This means external users can never query or tamper with policy decisions.

Failure Handling
If OPA is down, the CLI doesn't crash or hang. Each failure mode produces a different message:

  • OPA unreachable → "OPA unreachable: Connection refused"
  • OPA timeout → "OPA request timed out"
  • OPA returns bad data → "OPA returned invalid JSON"

In all cases, the deploy proceeds with a warning (fail-open). This is a deliberate design choice — a broken policy engine shouldn't prevent emergency deploys.


Part 3: The Chaos — Breaking Things on Purpose

Injecting Slow Responses
With the service in canary mode, I injected a 2-second delay:

./swiftdeploy promote canary
curl -X POST -H "Content-Type: application/json" \
  -d '{"mode":"slow","duration":2}' http://localhost:8080/chaos
Enter fullscreen mode Exit fullscreen mode

The chaos middleware intercepts every request and sleeps before passing it to the handler:

if slowDur > 0 {
    time.Sleep(slowDur)
}
Enter fullscreen mode Exit fullscreen mode

The status dashboard immediately reflected this — P99 latency jumped above 2000ms and the canary safety policy flipped to failing.

Injecting Error Responses
Next, I injected a 50% error rate:

curl -X POST -H "Content-Type: application/json" \
  -d '{"mode":"error","rate":0.5}' http://localhost:8080/chaos
Enter fullscreen mode Exit fullscreen mode

The middleware rolls a random number on each request:

if errRate > 0 && rand.Float64() < errRate {
    writeJSON(w, http.StatusInternalServerError, map[string]string{
        "error": "Chaos-induced server error",
    })
    return
}
Enter fullscreen mode Exit fullscreen mode

The status dashboard showed the error rate climbing above 1%, and attempting ./swiftdeploy promote stable was blocked by the canary safety policy with a clear violation message.

Recovery

curl -X POST -H "Content-Type: application/json" \
  -d '{"mode":"recover"}' http://localhost:8080/chaos
Enter fullscreen mode Exit fullscreen mode

Error rate dropped to 0%, P99 returned to normal, and promotion was allowed again.


Part 4: Observability — The /metrics Endpoint
Stage 4B required Prometheus-format metrics. I built a custom metrics collector rather than importing the full Prometheus client library, keeping the binary small:

type MetricsCollector struct {
    mu             sync.Mutex
    requestCounts  map[string]int64        // http_requests_total
    histogramSums  map[string]float64      // duration sums
    histogramCount map[string]int64        // duration counts
    histogramBuckets map[string]map[float64]int64  // bucket counts
    buckets        []float64               // standard histogram buckets
}
Enter fullscreen mode Exit fullscreen mode

A middleware records every request's method, path, status code, and duration. The /metrics endpoint renders everything in Prometheus text format.

The CLI's statuscommand scrapes this endpoint every 3 seconds, calculates real-time error rates and P99 latency, checks both OPA policies, and appends everything to history.jsonl for the audit trail.


Part 5: The Audit Trail
Every significant event — deploys, promotions, policy checks, teardowns — gets appended to history.jsonl. Running ./swiftdeploy audit parses this file and generates a GitHub Flavored Markdown report with:

  • A timeline table of all events
  • A violations section listing any policy denials with reasons
  • A metrics summary with min/max/average error rates and latencies
  • A chaos events table showing when faults were injected and their impact

Lessons Learned

  1. Template rendering is surprisingly tricky. Handling both {{key}} and {{ key }} (with spaces) caused bugs early on. A simple string replacement approach worked better than pulling in Jinja2, which added dependency headaches across different Python versions.

  2. OPA's Rego language has a learning curve. The declarative style (rules are evaluated as sets, not imperatively) takes adjustment. But once it clicks, the separation between "what data do I have" (CLI) and "what does it mean" (OPA) is powerful.

  3. Fail-open vs fail-closed is a real decision. I chose fail-open (deploy proceeds if OPA is down) because blocking deploys during an OPA outage could prevent fixing the OPA outage itself. In a production environment, you'd want monitoring on the policy engine separately.

  4. Building your own Prometheus metrics is educational but painful. Histogram bucket math, cumulative counting, and the text exposition format all have gotchas. For a real project, use the official Prometheus Go client. For learning, building it from scratch teaches you what the library actually does.


Repository
Full source code: GitHub - SwiftDeploy

To replicate:

git clone <repo-url>
cd SwiftDeploy
docker build -t swift-deploy-1-node:latest .
pip install pyyaml
./swiftdeploy deploy
./swiftdeploy promote canary
./swiftdeploy status    # Ctrl+C to stop
./swiftdeploy audit
./swiftdeploy teardown --clean
Enter fullscreen mode Exit fullscreen mode

This post covers HNG Internship Stage 4A and 4B — DevOps Track.

Top comments (0)