Oluwagbade Odimayo

Posted on May 7

From Broken Repo to Policy-Gated Deployment Platform: Building SwiftDeploy from Scratch

#devops #docker #opa #python

How I built a manifest-driven CLI that generates its own infrastructure, enforces environment policy through OPA, observes the running system in real time, and audits every decision it makes

Most deployment tools ask you to write configuration files. SwiftDeploy asks you to describe your intent once, then writes the configuration files for you - and refuses to deploy unless the environment is safe enough to proceed.

This post covers the complete journey of building SwiftDeploy across two stages. Stage A established the foundation: a declarative CLI that generates Docker Compose and Nginx configuration from a single manifest, manages the container lifecycle, and supports stable/canary promotion. Stage B added the intelligence layer: Prometheus metrics, an Open Policy Agent sidecar that gates every deployment and promotion, a live status dashboard, and an append-only audit trail.

A reader who follows this post from beginning to end should be able to replicate everything.

The Problem SwiftDeploy Solves

Consider a typical deployment workflow without tooling:

Write a docker-compose.yml by hand
Write an nginx.conf by hand
Manually update both files every time a port, timeout, or service name changes
Hope you didn't introduce a drift between the two files
Have no policy gate before deployment
Have no audit trail of what changed and when

SwiftDeploy inverts this. You edit exactly one file - manifest.yaml. The CLI derives everything else from it. If you delete the generated files and run swiftdeploy init, they come back identically. The manifest is the single source of truth, and the tool enforces that guarantee mechanically.

Architecture

Here is the complete system after both stages:

manifest.yaml  ← the only file you edit manually
      │
      ▼
swiftdeploy CLI
      │
      ├── OPA policy check (pre-deploy / pre-promote)
      │         │
      │         ▼
      │   policies/*.rego
      │   + policy_limits from manifest
      │
      ▼
templates/
      ├── docker-compose.yml.tpl
      └── nginx.conf.tpl
            │
            ▼
┌──────────────────────────────────────────┐
│              Docker Network              │
│                                          │
│  ┌──────────────┐   ┌─────────────────┐ │
│  │     App      │   │      Nginx      │ │
│  │   :3000      │◄──│  :8080 (public) │ │
│  │  /metrics    │   └─────────────────┘ │
│  │  /healthz    │                       │
│  │  /chaos      │   ┌─────────────────┐ │
│  └──────────────┘   │       OPA       │ │
│                     │ :8181 (loopback)│ │
│                     └─────────────────┘ │
└──────────────────────────────────────────┘
      │
      ▼
history.jsonl  ← append-only audit trail
      │
      ▼
audit_report.md  ← generated on demand

Key isolation rules:

The app is never exposed directly to the host - all traffic goes through Nginx
OPA is bound to 127.0.0.1:8181 only - not reachable via the Nginx port or from the internet
The CLI is the only component that talks to OPA
The app has no knowledge of policy decisions

Stage A: The Engine

The Manifest

Everything starts with manifest.yaml. This file describes the entire deployment intent:

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable
  version: "1.0.0"
  restart_policy: unless-stopped

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 10
  contact: o.odimayo@gbadedata.com

opa:
  image: openpolicyagent/opa:latest
  port: 8181
  policies_dir: policies
  decision_timeout_seconds: 5

network:
  name: swiftdeploy-net
  driver_type: bridge

logs:
  volume_name: swiftdeploy-logs

policy_limits:
  infrastructure:
    min_disk_free_gb: 10
    max_cpu_load: 2.0
  canary:
    max_error_rate: 0.01
    max_p99_latency_ms: 500

audit:
  history_file: history.jsonl
  report_file: audit_report.md

Every value that the generated configuration needs comes from here. No hardcoded values exist in any template or policy file.

Template-based Config Generation

The CLI uses Jinja2 to render docker-compose.yml and nginx.conf from templates. This is how swiftdeploy init works:

def render_template(template_name, output_path, manifest):
    env = Environment(
        loader=FileSystemLoader(TEMPLATE_DIR, encoding="utf-8-sig"),
        autoescape=False,
        trim_blocks=True,
        lstrip_blocks=True,
    )
    template = env.get_template(template_name)
    rendered = template.render(**manifest).lstrip("\ufeff")
    output_path.write_bytes(rendered.encode("utf-8"))

The template for the app service in docker-compose.yml.tpl looks like this:

services:
  app:
    image: {{ services.image }}
    container_name: swiftdeploy-app
    restart: {{ services.restart_policy }}
    user: "10001:10001"
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    environment:
      MODE: "{{ services.mode }}"
      APP_VERSION: "{{ services.version }}"
      APP_PORT: "{{ services.port }}"
    expose:
      - "{{ services.port }}"

Notice expose instead of ports. This is deliberate - the app is reachable inside the Docker network but not from the host machine. All external traffic must go through Nginx.

The Five Validation Checks

Before any deployment, the CLI runs five pre-flight checks:

[PASS] manifest.yaml exists and is valid YAML
[PASS] All required fields are present and non-empty
[PASS] Docker image exists locally: swift-deploy-1-node:latest
[PASS] Nginx port is free on host: 8080
[PASS] Generated nginx.conf is syntactically valid

The Nginx syntax check is particularly interesting - it runs nginx -t inside a temporary container with a host mapping for the app upstream name, so DNS resolution works without the full stack being active:

command = [
    "docker", "run", "--rm",
    "--add-host", "app:127.0.0.1",
    "-v", f"{NGINX_FILE}:/etc/nginx/nginx.conf:ro",
    nginx_image, "nginx", "-t",
]

The Application Service

The app is a Python Flask service that exposes four endpoints:

GET / - welcome message with mode, version, and timestamp
GET /healthz - liveness check with uptime in seconds
GET /metrics - Prometheus-format metrics (added in Stage B)
POST /chaos - fault injection, canary mode only

Stable and Canary Promotion

Promotion updates the manifest in-place, regenerates the Compose file, and restarts only the app container - Nginx is untouched:

def cmd_promote(args):
    target_mode = args.mode.lower()
    manifest = load_manifest()

    # policy check runs here for promote stable (Stage B)

    manifest["services"]["mode"] = target_mode
    write_manifest(manifest)

    render_template("docker-compose.yml.tpl", COMPOSE_FILE, manifest)

    run_command([
        "docker", "compose", "up", "-d",
        "--no-deps", "--force-recreate", "app"
    ])

The same image runs in both modes. The difference is the MODE environment variable injected by the generated Compose file - no rebuild required.

Nginx: Structured Logging and JSON Errors

The generated nginx.conf enforces three important behaviours.

Required access log format:

log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';

Example output:

2026-05-06T23:04:50+00:00 | 200 | 0.001s | 172.18.0.2:3000 | GET / HTTP/1.1

JSON error responses for upstream failures:

error_page 502 = @error502;
location @error502 {
    return 502 '{"error":"bad gateway","code":502,"service":"swiftdeploy","contact":"o.odimayo@gbadedata.com"}';
}

Platform headers on every response:

add_header X-Deployed-By swiftdeploy always;
proxy_pass_header X-Mode;

Security Hardening

Every container runs with least privilege:

user: "10001:10001"   # app
cap_drop:
  - ALL
security_opt:
  - no-new-privileges:true

Images are built from python:3.12-slim and verified under 300MB. No secrets are baked into any image - all configuration is injected at runtime via environment variables.

Stage B: The Intelligence Layer

Part 1: Instrumentation

The first requirement was a /metrics endpoint in Prometheus text format. I used the prometheus_client library and Flask's request hooks to instrument every endpoint automatically:

HTTP_REQUESTS_TOTAL = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "path", "status_code"],
)

HTTP_REQUEST_DURATION_SECONDS = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency in seconds",
    ["method", "path"],
    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5),
)

APP_MODE = Gauge("app_mode", "Application mode: 0=stable, 1=canary")
CHAOS_ACTIVE = Gauge("chaos_active", "Chaos state: 0=none, 1=slow, 2=error")
APP_UPTIME_SECONDS = Gauge("app_uptime_seconds", "Application uptime in seconds")

The after_request hook records every request automatically:

@app.after_request
def after_request(response):
    duration = time.monotonic() - getattr(g, "request_started_at", time.monotonic())

    HTTP_REQUESTS_TOTAL.labels(
        method=request.method,
        path=request.path,
        status_code=str(response.status_code),
    ).inc()

    HTTP_REQUEST_DURATION_SECONDS.labels(
        method=request.method,
        path=request.path,
    ).observe(duration)

    return response

The /metrics endpoint itself is just two lines:

@app.get("/metrics")
def metrics():
    update_runtime_metrics()
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

Part 2: The Policy Engine

This was the most architecturally significant part of Stage B. The requirement was absolute: the CLI must not make any allow/deny decision itself. All decision logic lives exclusively in OPA.

Why this matters

If the CLI embeds policy logic in Python, changing a threshold requires deploying new CLI code. When OPA owns the decisions, changing a threshold means editing one line in manifest.yaml. The operational difference is enormous - and the auditing story is much cleaner.

OPA in the Docker Compose template

OPA runs as a fourth service in the generated stack:

opa:
  image: {{ opa.image }}
  container_name: swiftdeploy-opa
  command:
    - "run"
    - "--server"
    - "--addr=0.0.0.0:{{ opa.port }}"
    - "/policies"
  ports:
    - "127.0.0.1:{{ opa.port }}:{{ opa.port }}"
  volumes:
    - ./{{ opa.policies_dir }}:/policies:ro
  cap_drop:
    - ALL
  security_opt:
    - no-new-privileges:true

The critical detail: 127.0.0.1:8181:8181 binds OPA only to the host loopback interface. It is reachable by the CLI but not via the Nginx port. The OPA API cannot be queried or probed from the internet.

Policy organisation

Each domain owns exactly one question and one set of input data. A change to the canary policy never touches the infrastructure policy.

Infrastructure policy (policies/infrastructure.rego) - answers: is this host safe to deploy on?

package swiftdeploy.infrastructure

default allow := false

disk_ok if {
    input.stats.disk_free_gb >= input.limits.min_disk_free_gb
}

cpu_ok if {
    input.stats.cpu_load <= input.limits.max_cpu_load
}

allow if {
    input.context == "pre_deploy"
    disk_ok
    cpu_ok
}

reasons contains msg if {
    not disk_ok
    msg := sprintf(
        "Disk free %vGB is below required minimum %vGB",
        [input.stats.disk_free_gb, input.limits.min_disk_free_gb],
    )
}

reasons contains msg if {
    not cpu_ok
    msg := sprintf(
        "CPU load %.2f exceeds allowed maximum %.2f",
        [input.stats.cpu_load, input.limits.max_cpu_load],
    )
}

decision := {
    "domain": "infrastructure",
    "question": "pre_deploy",
    "allow": allow,
    "reasons": reasons,
}

Canary safety policy (policies/canary.rego) - answers: is the canary healthy enough to promote to stable?

package swiftdeploy.canary

default allow := false

error_rate_ok if {
    input.context == "pre_promote"
    input.metrics.error_rate <= input.limits.max_error_rate
}

p99_latency_ok if {
    input.context == "pre_promote"
    input.metrics.p99_latency_ms <= input.limits.max_p99_latency_ms
}

allow if {
    input.context == "pre_promote"
    error_rate_ok
    p99_latency_ok
}

decision := {
    "domain": "canary",
    "question": "pre_promote",
    "allow": allow,
    "reasons": reasons,
}

Thresholds in the manifest, not in Rego

Neither policy file contains a hardcoded number. The limits come from input.limits, which the CLI sends as part of the JSON payload. The actual values live in manifest.yaml under policy_limits. Tune thresholds by editing the manifest - no Rego files need to be touched.

The pre-deploy gate

Before starting the stack, the CLI collects host stats and queries OPA:

def pre_deploy_policy_check(manifest):
    stats = get_host_stats()
    limits = manifest["policy_limits"]["infrastructure"]

    payload = {
        "input": {
            "context": "pre_deploy",
            "stats": stats,
            "limits": limits,
        }
    }

    decision = call_opa(manifest, "swiftdeploy/infrastructure/decision", payload)
    print_policy_decision(decision)

    if not decision.get("allow"):
        append_history(manifest, "policy_violation", {
            "domain": decision.get("domain"),
            "reasons": decision.get("reasons", []),
            "stats": stats,
        })
        return False

    return True

To prove the hard gate, I temporarily set min_disk_free_gb: 99999 and ran swiftdeploy deploy:

[POLICY][FAIL] infrastructure.pre_deploy
 - Disk free 2GB is below required minimum 99999GB
[FAIL] Deployment blocked by policy.

The stack never starts. The block is absolute.

The pre-promote gate

Before promoting from canary to stable, the CLI scrapes /metrics, calculates the current error rate and P99 latency, and queries OPA:

def pre_promote_policy_check(manifest):
    metrics_text = scrape_metrics(manifest)
    samples = parse_prometheus_metrics(metrics_text)
    observed = calculate_observed_metrics(samples)

    limits = manifest["policy_limits"]["canary"]

    payload = {
        "input": {
            "context": "pre_promote",
            "metrics": {
                "error_rate": observed["error_rate"],
                "p99_latency_ms": observed["p99_latency_ms"],
            },
            "limits": limits,
        }
    }

    decision = call_opa(manifest, "swiftdeploy/canary/decision", payload)
    print_policy_decision(decision)

    return bool(decision.get("allow"))

OPA failure handling

The CLI handles every distinct failure mode explicitly:

Failure	Message
Connection refused	`OPA unavailable at http://127.0.0.1:8181`
Request timed out	`OPA decision timed out after 5s`
Non-200 response	`OPA returned HTTP 503`
Non-JSON response	`OPA returned non-JSON response`
Missing result	`OPA response did not include a decision result`

In every case the operation is blocked and the event is recorded in the audit trail.

Part 3: The Chaos - What Happened When I Injected Failures

With the canary safety gate in place, I needed to prove it actually blocks unsafe promotions.

Deploy stable, promote to canary:

[PASS] Health check passed: mode=stable, version=1.0.0
[PASS] Promotion confirmed through /healthz: mode=canary

Inject 50% error chaos:

curl -X POST http://127.0.0.1:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode":"error","rate":0.5}'

Generate traffic to build up the error rate:

500  200  500  200  500  500  200  500  200  200

Run the status dashboard - this is where it gets interesting:

SwiftDeploy Status
==================
Timestamp: 2026-05-06T23:12:26.328363+00:00
Mode: canary
Chaos: error
Req/s: 0.00
Error rate: 38.10%
P99 latency: 5.00ms
Uptime: 128.42s

Policy Compliance
-----------------
[PASS] infrastructure.pre_deploy
  - Infrastructure policy passed
[FAIL] canary.pre_promote
  - Error rate 0.380952 exceeds allowed maximum 0.01

The status dashboard scrapes /metrics, calculates error rate from raw Prometheus counters, and queries both OPA domains independently on every interval. The infrastructure policy passes (the host is healthy) while the canary policy fails (the service is broken).

Attempt promotion — blocked:

[POLICY][FAIL] canary.pre_promote
 - Error rate 0.380952 exceeds allowed maximum 0.01
[FAIL] Promotion blocked by policy.

This is the safety guarantee the system is designed to provide. A broken canary cannot accidentally become the stable deployment.

Recover, generate clean traffic, promote successfully:

[POLICY][PASS] canary.pre_promote
 - Canary safety policy passed
[PASS] Promotion confirmed through /healthz: mode=stable

Part 4: The Status Dashboard

swiftdeploy status is a live-refreshing terminal dashboard. It runs continuously, scraping /metrics on every interval and appending every result to history.jsonl.

The interesting engineering challenge was calculating P99 latency from raw Prometheus histogram buckets without a Prometheus server. Prometheus histograms store cumulative bucket counts - to find P99, you need to find the smallest bucket whose cumulative count exceeds 99% of total requests:

def calculate_p99_from_buckets(bucket_totals):
    total_count = bucket_totals.get("+Inf", 0.0)
    if total_count == 0:
        return 0.0

    target = total_count * 0.99
    numeric_buckets = sorted(
        [(float(le), count)
         for le, count in bucket_totals.items()
         if le != "+Inf"],
        key=lambda x: x[0]
    )

    for upper_bound, count in numeric_buckets:
        if count >= target:
            return upper_bound * 1000  # convert to milliseconds

    return numeric_buckets[-1][0] * 1000

Health check and metrics paths are excluded from all calculations to avoid skewing the error rate and latency numbers.

Part 5: The Audit Trail

Every significant event is written to history.jsonl as a JSON line:

{"timestamp": "2026-05-06T23:40:04+00:00", "event_type": "deploy", "data": {"mode": "stable", "version": "1.0.0"}}
{"timestamp": "2026-05-06T23:42:53+00:00", "event_type": "policy_violation", "data": {"domain": "canary", "reasons": ["Error rate 0.380952 exceeds allowed maximum 0.01"]}}
{"timestamp": "2026-05-06T23:45:32+00:00", "event_type": "mode_change", "data": {"mode": "stable"}}

Running swiftdeploy audit generates audit_report.md - a GitHub Flavored Markdown report with a deployment timeline and violations table that renders directly on GitHub.

Complete CLI Reference

Command	What it does
`swiftdeploy init`	Parse manifest, generate `docker-compose.yml` and `nginx.conf`
`swiftdeploy validate`	Run 5 pre-flight checks, exit non-zero on any failure
`swiftdeploy deploy`	init → validate → OPA infra check → compose up → health gate
`swiftdeploy promote canary`	Update manifest, regenerate Compose, restart app only
`swiftdeploy promote stable`	OPA canary check → update manifest → regenerate → restart app
`swiftdeploy status`	Live metrics + policy compliance dashboard, appends to history
`swiftdeploy status --once`	Single scrape and exit
`swiftdeploy audit`	Parse `history.jsonl`, generate `audit_report.md`
`swiftdeploy teardown`	Remove containers, networks, and volumes
`swiftdeploy teardown --clean`	Teardown + delete generated config files

Lessons Learned

1. The manifest discipline pays off at every stage

Every time I was tempted to hardcode a value - a port, a timeout, a threshold - I put it in the manifest instead. This cost five extra minutes each time and saved hours. Any value can be changed in the manifest and the entire system adapts without touching any other file.

2. Separation of concerns is a survival strategy, not just a principle

When the CLI makes policy decisions in Python, changing a threshold means deploying new CLI code. When OPA owns the decisions, changing a threshold means editing one line in manifest.yaml. The operational difference is enormous. More importantly, the policy files can be reviewed, versioned, and audited independently of the tool that enforces them.

3. Every failure mode deserves its own message

The first version of the OPA client raised a generic exception on any failure. Connection refused, timeout, bad JSON, missing result - all looked the same. Adding distinct handling for each case costs twenty lines of code and saves hours of debugging in production.

4. Prometheus counters persist for the process lifetime

After recovering from error chaos and generating clean traffic, promote stable was still blocked. The reason: Prometheus counters never reset. The old error counts were still in the counter from before the recovery. Restarting the container resets them because it starts a fresh process. In production, you would use a time-windowed approach with a proper Prometheus server and PromQL range queries.

5. OPA isolation is a hard security requirement

If OPA were reachable via the Nginx port, anyone could query your policy engine, discover your exact thresholds, and craft traffic that stays just below detection limits. Binding OPA to the loopback interface and keeping it off the public network is not a preference - it is the minimum viable security posture for a policy engine.

6. Generated files should never be committed

Committing docker-compose.yml and nginx.conf to Git creates a false source of truth. Developers start editing the generated file instead of the manifest, the template drifts from reality, and the tool becomes meaningless. Keeping generated files in .gitignore enforces the discipline mechanically.

7. The Windows BOM problem is real

On Windows, writing YAML with Python's yaml.safe_dump can produce a file with a UTF-8 BOM prefix. When that file is later read by the Jinja template loader, the BOM gets rendered into the first key name, producing "\xEF\xBB\xBFservices" instead of services. The fix is to always write files using write_bytes(content.encode("utf-8")) rather than write_text(..., encoding="utf-8"). The difference is subtle and the debugging is painful.

Running It Yourself

# Clone the repository
git clone https://github.com/gbadedata/swiftdeploy.git
cd swiftdeploy

# Set up Python environment
python -m venv .venv
source .venv/bin/activate       # Linux/macOS
# .\.venv\Scripts\Activate.ps1  # Windows PowerShell

pip install pyyaml jinja2 requests

# Build the app image
docker build -t swift-deploy-1-node:latest .

# Full lifecycle
python ./swiftdeploy deploy
python ./swiftdeploy promote canary
python ./swiftdeploy status --once
python ./swiftdeploy promote stable
python ./swiftdeploy audit

# Clean up
python ./swiftdeploy teardown --clean

The full source code, Rego policies, Jinja templates, and screenshots are at:
https://github.com/gbadedata/swiftdeploy

Final Thought

The most important insight from this project is that a deployment tool should be more than a script that runs docker compose up. It should be a control plane - one that enforces standards before acting, surfaces observable state while running, and leaves an auditable record of every decision it makes.

SwiftDeploy is deliberately local-first and small in scope. But the patterns it demonstrates - manifest-driven generation, policy-gated lifecycle, metrics-based promotion gates, and append-only audit trails - are the same patterns that underpin tools like Argo CD, Flux, and every serious production deployment system.

The small version teaches you the patterns. The patterns scale to any size.

Source code: github.com/gbadedata/swiftdeploy

DEV Community