DEV Community

Cover image for # I Built a Deployment CLI That Says No — And Here's the Policy Engine Behind It
Edith Asante
Edith Asante

Posted on • Edited on

# I Built a Deployment CLI That Says No — And Here's the Policy Engine Behind It

Most deployment tools ask you to configure infrastructure manually. This one writes it for you — and refuses to deploy if it is not safe.


The Problem I Set Out to Solve

Every time I deployed a new service I found myself doing the same things:

  • Writing a Docker Compose file
  • Writing an Nginx config
  • Hoping both were consistent with each other
  • Manually checking if the server had enough resources
  • Deploying and hoping for the best

There had to be a better way. What if a single file described everything — and a tool generated all the configs, checked all the policies, and deployed the stack automatically?

That is what SwiftDeploy does.


What Is SwiftDeploy?

SwiftDeploy is a CLI tool built in Python that:

  1. Reads a single manifest.yaml file
  2. Generates nginx.conf and docker-compose.yml from templates
  3. Asks OPA (Open Policy Agent) if it is safe to deploy
  4. Brings up the stack and waits for health checks
  5. Lets you promote between stable and canary modes — but only if the canary is healthy
  6. Records every decision in an audit trail
  7. Shows you a live dashboard of what is happening

The manifest is the only file you ever edit. Everything else is generated.


Part 1 — The Design: A Tool That Writes Its Own Infrastructure

The Manifest

Here is what manifest.yaml looks like:

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable
  version: v1

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 60

network:
  name: swiftdeploy-net
  driver_type: bridge
Enter fullscreen mode Exit fullscreen mode

That is the entire configuration. One file. Everything else is derived from it.

The Templates

The init command reads the manifest and fills in template files:

def init():
    manifest = load_manifest()

    with open("templates/docker-compose.yml.tpl", "r") as f:
        compose_tpl = f.read()

    compose_out = compose_tpl.replace("{{ app_image }}", manifest["services"]["image"])
    compose_out = compose_out.replace("{{ mode }}", manifest["services"].get("mode", "stable"))

    with open("docker-compose.yml", "w") as f:
        f.write(compose_out)
Enter fullscreen mode Exit fullscreen mode

If you delete your configs, run init and you get the exact same stack back. No guessing. No inconsistency.

Why This Matters

In most projects configs drift over time. Someone edits docker-compose.yml directly. Someone else edits nginx.conf. After six months nobody knows what the source of truth is.

With SwiftDeploy the source of truth is always manifest.yaml. If it is not in the manifest it does not exist.


Part 2 — The Guardrails: Policy Enforcement with OPA

Why OPA?

I could have written the policy checks directly in Python. But the task required something more important — separation of concerns.

The CLI should not decide what is safe. That decision should live in a separate system that can be updated independently. That system is OPA — Open Policy Agent.

OPA runs as a separate container. The CLI sends data to OPA and OPA sends back a decision. The CLI just follows orders.

Infrastructure Policy

Before deploying the CLI collects host statistics and sends them to OPA:

def get_host_stats():
    disk = shutil.disk_usage("/")
    disk_free_gb = disk.free / (1024 ** 3)
    cpu_load = psutil.cpu_percent() / 100
    return {
        "disk_free_gb": round(disk_free_gb, 2),
        "cpu_load": round(cpu_load, 2),
    }
Enter fullscreen mode Exit fullscreen mode

OPA evaluates the infrastructure policy:

package infra

default allow := false

allow := true if {
    input.disk_free_gb >= 10
    input.cpu_load <= 2.0
}

reason := "Disk space too low" if {
    input.disk_free_gb < 10
}
Enter fullscreen mode Exit fullscreen mode

If the disk is below 10GB or CPU is above 2.0 the deployment is blocked:

Running pre-deploy policy check...
   Disk free: 5.2GB | CPU: 0.3 | Memory: 45%
Infrastructure policy: BLOCKED
   Reason: Disk space too low
Enter fullscreen mode Exit fullscreen mode

Canary Safety Policy

Before promoting to canary mode the CLI scrapes the /metrics endpoint and calculates the error rate and P99 latency:

def calc_error_rate(metrics):
    total = sum(v for k, v in metrics.items() if k.startswith("http_requests_total"))
    errors = sum(v for k, v in metrics.items() if 'status_code="5' in k)
    return round((errors / total) * 100, 2) if total > 0 else 0.0
Enter fullscreen mode Exit fullscreen mode

OPA evaluates the canary safety policy:

package canary

default allow := false

allow := true if {
    input.error_rate <= 1.0
    input.p99_latency_ms <= 500
}

reason := "P99 latency too high (must be <= 500ms)" if {
    input.p99_latency_ms > 500
}
Enter fullscreen mode Exit fullscreen mode

If the canary is unhealthy the promotion is blocked:

Running pre-promote policy check...
   Error rate: 0.0% | P99 latency: 100.0ms
Canary safety policy: BLOCKED
   Reason: P99 latency too high (must be <= 500ms)
Enter fullscreen mode Exit fullscreen mode

Why Isolation Matters

OPA runs as a separate container and is only reachable by the CLI — not through Nginx. This means:

  • No external actor can query or manipulate policy decisions
  • Policies can be updated without touching the CLI code
  • Each domain (infrastructure, canary) owns exactly one question

Part 3 — The Chaos: What Happened When Things Broke

Injecting Slow Chaos

The API exposes a /chaos endpoint that simulates degraded behaviour:

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "slow", "duration": 2}'
Enter fullscreen mode Exit fullscreen mode

This makes every request sleep for 2 seconds before responding. The metrics immediately reflect the change — P99 latency spikes.

The Status View Catches It

Running swiftdeploy status shows the live state:

--- Scrape @ Fri May 15 12:38:05 2026 ---
  Mode:        canary
  Uptime:      115s
  Error rate:  0.0%
  P99 latency: 2100.0ms
  Chaos:       active

  Policy Compliance:
    Infrastructure: PASS
    Canary safety:  FAIL - P99 latency too high
Enter fullscreen mode Exit fullscreen mode

The Promotion Is Blocked

When we tried to promote:

Running pre-promote policy check...
   Error rate: 0.0% | P99 latency: 2100.0ms
Canary safety policy: BLOCKED
   Reason: P99 latency too high (must be <= 500ms)
Enter fullscreen mode Exit fullscreen mode

The system worked exactly as designed. The broken canary could not be promoted.

Recovery

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'
Enter fullscreen mode Exit fullscreen mode

Latency dropped back to normal and the next promote attempt passed.


Part 4 — The Audit Trail

Every action is recorded in history.jsonl:

{"event": "deploy", "status": "success", "timestamp": 1778794519.2}
{"event": "pre_promote_check", "result": {"allow": false, "reason": "P99 latency too high"}, "timestamp": 1778799306.5}
{"event": "promote", "mode": "canary", "status": "success", "timestamp": 1778799535.0}
Enter fullscreen mode Exit fullscreen mode

Running swiftdeploy audit generates audit_report.md:

## Timeline

| Time | Event | Details |
|---|---|---|
| Fri May 15 12:36:48 | deploy | status=success |
| Fri May 15 12:40:17 | pre_promote_check | BLOCKED reason=P99 latency too high |
| Fri May 15 12:44:50 | promote | mode=canary status=success |

## Policy Violations

| Time | Check | Reason |
|---|---|---|
| Fri May 15 12:40:17 | pre_promote_check | P99 latency too high |
Enter fullscreen mode Exit fullscreen mode

You can always answer the question "what happened and when" with a single command.


Lessons Learned

1. Declarative infrastructure is worth the investment
Writing templates takes time upfront but saves enormous time later. When something breaks you regenerate from the manifest and you know the configs are correct.

2. Policies should be external
Keeping policy logic in OPA means you can update thresholds without touching the CLI code. This is how real production systems work.

3. Metrics drive decisions — not just monitoring
I used to think metrics were for dashboards. Now I use them to gate deployments. If the canary is unhealthy the metrics prove it and the policy enforces the consequence.

4. Audit trails matter more than you think
During debugging I could look at history.jsonl and see exactly what happened and in what order. Without it I would have been guessing.

5. The CLI is just an orchestrator
SwiftDeploy does not make decisions. It collects data, asks OPA, and follows the answer. This separation makes the system trustworthy and testable.


The Final Result

A complete declarative deployment system that:

  • Generates infrastructure from a single manifest
  • Validates pre-flight conditions before deploying
  • Enforces infrastructure and canary safety policies via OPA
  • Tracks metrics in Prometheus format
  • Shows a live dashboard of system state and policy compliance
  • Records every decision in a structured audit trail
  • Generates a clean audit report in GitHub-flavored Markdown

Full source code: https://github.com/asanteedith/swiftdeploy-project


Written by **Edith Asante* — Cloud & DevOps Engineer*

Top comments (0)