Fayomi Fiyinfoluwa

Posted on May 6

SwiftDeploy: How I Built a Tool That Writes Its Own Infrastructure Files

#devops #docker #nginx #python

Introduction

As part of the HNG Internship Stage 4 DevOps track, I built SwiftDeploy — a declarative deployment CLI tool that generates Nginx and Docker Compose configurations from a single manifest.yaml file. In Stage 4B, I extended it with Prometheus metrics, Open Policy Agent (OPA) policy enforcement, a live status dashboard, and an audit trail.

This post walks through the entire journey — the design decisions, the guardrails, the chaos, and the lessons learned.

The Design: A Tool That Writes Its Own Infrastructure

The core idea behind SwiftDeploy is simple: you should only ever edit one file.

That file is manifest.yaml:

services:
  name: app
  image: wonderfullymade-api:latest
  port: 3000
  version: "1.0.0"
  mode: stable

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 30

network:
  name: swiftdeploy-net
  driver_type: bridge

From this single file, the swiftdeploy CLI generates:

nginx.conf — Nginx reverse proxy configuration
docker-compose.yml — Full stack definition with app, Nginx, and OPA containers

This is done using Jinja2 templates. The CLI reads the manifest, renders the templates with the values, and writes the output files to the project root.

manifest.yaml → swiftdeploy init → nginx.conf + docker-compose.yml

The CLI supports these subcommands:

init — Generate config files from manifest
validate — Run 5 pre-flight checks
deploy — Start the full stack
promote canary/stable — Switch modes with rolling restart
teardown — Remove all containers and volumes
status — Live metrics dashboard
audit — Generate audit report

SwiftDeploy architecture: from manifest.yaml to running stack

The API Service

The API service is a Python Flask app that exposes three endpoints:

GET / — Welcome message with current mode, version, and timestamp
GET /healthz — Liveness check with uptime
POST /chaos — Simulate degraded behaviour (canary mode only)

The service runs in either stable or canary mode, controlled by the MODE environment variable. Canary mode adds an X-Mode: canary header to every response and activates the chaos endpoint.

The chaos endpoint accepts three modes:

{"mode": "slow", "duration": 3}
{"mode": "error", "rate": 0.5}
{"mode": "recover"}

Instrumentation: The /metrics Endpoint

In Stage 4B, I added a /metrics endpoint in Prometheus text format using the prometheus-client library.

The following metrics are tracked:

http_requests_total — Total requests with method, path, and status code labels
http_request_duration_seconds — Request duration histogram
app_uptime_seconds — How long the app has been running
app_mode — Current mode (0=stable, 1=canary)
chaos_active — Current chaos state (0=none, 1=slow, 2=error)

This is what the /metrics output looks like:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 150.0
http_requests_total{method="GET",path="/",status_code="500"} 12.0

The Guardrails: Open Policy Agent

The most important part of Stage 4B is the policy enforcement layer using OPA.

Why OPA?

The key principle is: the CLI must never make allow/deny decisions itself. All decision logic lives exclusively in OPA. This separation means:

Policies can be updated without touching the CLI code
Each policy domain is completely isolated
The CLI just sends data and acts on the response

Policy Structure

I wrote two Rego policy files:

Infrastructure Policy — runs before deploy:

package infrastructure

deny_reasons := reasons if {
    reasons := [msg |
        checks := [
            [input.disk_free_gb < data.thresholds.min_disk_free_gb,
             sprintf("Disk free (%.1fGB) is below minimum", [input.disk_free_gb])],
            [input.cpu_load > data.thresholds.max_cpu_load,
             sprintf("CPU load (%.2f) exceeds maximum", [input.cpu_load])]
        ]
        check := checks[_]
        check[0] == true
        msg := check[1]
    ]
}

Canary Safety Policy — runs before promote:

package canary

deny_reasons := reasons if {
    reasons := [msg |
        checks := [
            [input.error_rate > data.thresholds.max_error_rate,
             sprintf("Error rate exceeds maximum of 1.00%% (current: %.2f%%)", [input.error_rate * 100])],
            [input.p99_latency_ms > data.thresholds.max_p99_latency_ms,
             sprintf("P99 latency exceeds maximum of 500ms (current: %.0fms)", [input.p99_latency_ms])]
        ]
        check := checks[_]
        check[0] == true
        msg := check[1]
    ]
}

Threshold values live in policies/data.json — never hardcoded in the Rego files:

{
    "thresholds": {
        "min_disk_free_gb": 10.0,
        "max_cpu_load": 2.0,
        "max_error_rate": 0.01,
        "max_p99_latency_ms": 500
    }
}

OPA Isolation

The OPA container is on a separate internal network (opa-internal) and is not accessible via the Nginx port. Only the CLI can reach it directly on port 8181.

The Chaos: What Happens When Things Go Wrong

This was the most interesting part of the project. I injected a 90% error rate into the canary deployment:

curl -X POST http://localhost:8080/chaos \
  -d '{"mode":"error","rate":0.9}'

Immediately, the status dashboard showed:

📊 SwiftDeploy Status Dashboard
==================================================
  Mode:         canary
  Chaos:        error
  Error rate:   66.67%
  P99 latency:  5ms

📋 Policy Compliance:
  ✅ Infrastructure policy: PASSING
  ❌ Canary safety policy: FAILING
     → Error rate exceeds maximum of 1.00% (current: 66.67%)

When I tried to promote back to stable:

🔍 Running pre-promote policy checks...
   Error rate: 61.54% | P99 latency: 5ms
❌ Canary safety policy: FAILED
   → Error rate exceeds maximum of 1.00% (current: 61.54%)
🚫 Promotion blocked by canary safety policy

The promotion was completely blocked until the error rate dropped below 1%. This is exactly what a production guardrail should do.

The Audit Trail

Every deploy, promote, and status scrape is logged to history.jsonl. Running swiftdeploy audit generates a clean Markdown report:

## Policy Violations
| Timestamp | Policy | Details |
|-----------|--------|---------|
| 2026-05-05T22:30:50Z | Canary Safety | error_rate=53.33% p99=5ms |
| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50.00% p99=5ms |

Lessons Learned

1. Declarative manifests are powerful
Having a single source of truth makes the entire system predictable. The grader can delete generated files and re-run init — everything regenerates correctly.

2. Policy isolation matters
Keeping OPA on a separate network means it can never be reached through the public Nginx ingress. This is a real security principle — your policy engine should never be publicly accessible.

3. Metrics tell the truth
Without the /metrics endpoint, I would have had no way to know the error rate was 66%. Instrumentation is not optional in production systems.

4. Windows PowerShell is different
Most DevOps tooling assumes Linux. Running this on Windows required adjustments for path handling, JSON escaping, and port checking. Always test on the target environment.

5. Chaos engineering is fun
Intentionally breaking things to verify your guardrails work is one of the most satisfying parts of DevOps. If your system can't handle chaos in testing, it will fail in production.

Source Code

The full source code is available on GitHub:
github.com/Wonderfullymade01/swiftdeploy

Built as part of the HNG Internship Stage 4 DevOps track.

DEV Community