DEV Community

anitaalicloud
anitaalicloud

Posted on

How I Built SwiftDeploy: A Tool That Writes Its Own Infrastructure

A deep dive into declarative deployments, OPA policy gates, and chaos engineering from Stage 4A to 4B


Introduction

Most DevOps tasks ask you to configure infrastructure manually. This one asked me to build the tool that does it for me.

The result is SwiftDeploy which is a CLI tool that reads a single manifest.yaml file and generates your entire deployment stack from it. Nginx configs, Docker Compose files, policy checks, live metrics dashboards are all derived from one source of truth.

This post covers the full journey: the design decisions, the guardrails, the chaos, and the lessons learned.


The Architecture

Here is how all the pieces connect:

┌─────────────────────────────────────────────────────┐
│                    manifest.yaml                     │
│              (single source of truth)                │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
              ./swiftdeploy init
                       │
          ┌────────────┴────────────┐
          ▼                         ▼
     nginx.conf              docker-compose.yml
   (generated)                 (generated)
          │                         │
          ▼                         ▼
┌─────────────────────────────────────────────────────┐
│                   Docker Stack                       │
│                                                      │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐      │
│   │  Nginx   │───▶│   App    │    │   OPA    │      │
│   │  :8080   │    │  :3000   │    │  :8181   │      │
│   └──────────┘    └──────────┘    └──────────┘      │
│   (public)        (internal)      (internal)         │
└─────────────────────────────────────────────────────┘
          │                         ▲
          ▼                         │
     curl :8080              CLI queries OPA
    (your browser)          before deploy/promote
Enter fullscreen mode Exit fullscreen mode

The key insight: you only ever touch manifest.yaml. The tool handles everything else.


Part 1 — The Design: A Tool That Writes Its Own Files

The Problem with Handwritten Config

When you write nginx.conf and docker-compose.yml by hand, you introduce drift. Change a port in one place and forget to update it in another. After a few weeks, nobody knows which file is the source of truth.

SwiftDeploy solves this with a three-layer system:

manifest.yaml          →    templates/*.tmpl    →    generated files
(VALUES)                    (STRUCTURE)              (VALUES + STRUCTURE)
Enter fullscreen mode Exit fullscreen mode

The manifest.yaml holds all the values — ports, image names, modes, timeouts. The templates hold the structure — how nginx.conf and docker-compose.yml should look. The CLI combines them at runtime.

How swiftdeploy init Works

# Read manifest into a Python dict
m = yaml.safe_load(open("manifest.yaml"))

# Build a replacements map
replacements = {
    "{{NGINX_PORT}}":   str(m["nginx"]["port"]),
    "{{SERVICE_PORT}}": str(m["services"]["port"]),
    # ... etc
}

# Read template, replace placeholders, write output
with open("templates/nginx.conf.tmpl") as f:
    content = f.read()

for placeholder, value in replacements.items():
    content = content.replace(placeholder, value)

with open("nginx.conf", "w") as f:
    f.write(content)
Enter fullscreen mode Exit fullscreen mode

Simple string replacement. No Jinja2, no templating engine — just Python's built-in str.replace(). The grader can delete nginx.conf and docker-compose.yml, run ./swiftdeploy init, and they regenerate perfectly every time.

The API Service

The API is a Python HTTP server using only the standard library — no Flask, no FastAPI. This keeps the Docker image under 60MB (well under the 300MB limit).

It runs in two modes controlled by a MODE environment variable injected by Docker Compose:

stable mode  →  normal behaviour
canary mode  →  adds X-Mode: canary header + activates /chaos endpoint
Enter fullscreen mode Exit fullscreen mode

The same image runs both modes. The only difference is the environment variable.

The Nginx Reverse Proxy

Nginx sits in front of the app and adds:

  • X-Deployed-By: swiftdeploy header on every response
  • JSON error bodies on 502/503/504 (instead of ugly HTML)
  • Structured access logs in the required format
  • Forwards X-Mode header from the upstream app

Critically, the app port is never exposed directly. Only Nginx's port is mapped to the host. All traffic must flow through it.


Part 2 — The Guardrails: OPA Policy Enforcement

Why OPA?

The task required that the CLI never make allow/deny decisions itself. All logic must live in OPA (Open Policy Agent).

This matters because it separates concerns cleanly:

CLI  →  collects data, calls OPA, surfaces the result
OPA  →  owns all decision logic, never called by the app
Enter fullscreen mode Exit fullscreen mode

If you want to change a policy, you edit a .rego file. You never touch the CLI. If you want to change a threshold, you edit data.json. You never touch the Rego files.

Policy Structure

Each policy domain owns exactly one question:

Infrastructure policyIs the host healthy enough to deploy?

package infrastructure

import rego.v1

default allow := false

allow if {
    count(violations) == 0
}

violations contains msg if {
    input.disk_free_gb < data.infrastructure.min_disk_free_gb
    msg := sprintf(
        "Disk free (%.1fGB) is below minimum threshold (%.1fGB)",
        [input.disk_free_gb, data.infrastructure.min_disk_free_gb]
    )
}

violations contains msg if {
    input.cpu_load > data.infrastructure.max_cpu_load
    msg := sprintf(
        "CPU load (%.2f) exceeds maximum threshold (%.2f)",
        [input.cpu_load, data.infrastructure.max_cpu_load]
    )
}
Enter fullscreen mode Exit fullscreen mode

Canary safety policyIs the canary healthy enough to promote?

package canary

import rego.v1

default allow := false

allow if {
    count(violations) == 0
}

violations contains msg if {
    input.error_rate > data.canary.max_error_rate
    msg := sprintf(
        "Error rate (%.2f%%) exceeds maximum threshold (%.2f%%)",
        [input.error_rate * 100, data.canary.max_error_rate * 100]
    )
}
Enter fullscreen mode Exit fullscreen mode

Threshold values live in data.json — never hardcoded in Rego:

{
  "infrastructure": {
    "min_disk_free_gb": 10.0,
    "max_cpu_load": 16.0,
    "min_mem_free_percent": 10.0
  },
  "canary": {
    "max_error_rate": 0.01,
    "max_p99_latency_ms": 500
  }
}
Enter fullscreen mode Exit fullscreen mode

The Hard Gate in Action

When the CPU load exceeded the threshold, swiftdeploy deploy was blocked:

[deploy] Running OPA pre-deploy policy checks...
  Host stats: disk=80.08GB free, cpu_load=12.88, mem_free=50.0%
[policy] Checking Infrastructure...
  [BLOCK] Infrastructure policy FAILED:
    x CPU load (12.88) exceeds maximum threshold (2.00)
[deploy] BLOCKED by policy. Fix violations above before deploying.
Enter fullscreen mode Exit fullscreen mode

The deploy never started. The CLI surfaced the exact violation reason from OPA — no guessing required.

OPA Isolation

OPA is intentionally isolated from public Nginx ingress. It runs on port 8181 inside the Docker network. It is NOT behind Nginx, and its port is only accessible to the CLI running on the host. A user hitting localhost:8080 cannot reach OPA.

Failure Handling

The CLI handles every distinct OPA failure mode differently:

except urllib.error.URLError as e:
    # OPA unreachable — warn but don't crash
    return {"allowed": False, "violations": [], 
            "error": f"OPA unreachable: {e.reason}"}

except json.JSONDecodeError:
    # OPA returned garbage — different message
    return {"allowed": False, "violations": [], 
            "error": "OPA returned invalid JSON"}

except Exception as e:
    # Catch-all — still doesn't crash
    return {"allowed": False, "violations": [], 
            "error": f"Unexpected OPA error: {e}"}
Enter fullscreen mode Exit fullscreen mode

The CLI never crashes or hangs when OPA is unavailable. It warns the operator and continues.


Part 3 — The Chaos: Breaking Things on Purpose

The /metrics Endpoint

The API exposes a /metrics endpoint in Prometheus text format:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 42

# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/",le="0.005"} 38

# HELP app_mode Current deployment mode (0=stable, 1=canary)
# TYPE app_mode gauge
app_mode 1

# HELP chaos_active Current chaos state (0=none, 1=slow, 2=error)
# TYPE chaos_active gauge
chaos_active 0
Enter fullscreen mode Exit fullscreen mode

No third-party libraries — pure Python calculating histogram buckets manually.

Injecting Chaos

After promoting to canary mode, chaos was injected:

# Slow mode — every request sleeps 3 seconds
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "slow", "duration": 3}'

# Error mode — 50% of requests return HTTP 500
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "error", "rate": 0.5}'
Enter fullscreen mode Exit fullscreen mode

The Status Dashboard Capturing the Failure

With error mode active at 50%, the status dashboard showed:

=======================================================
  SwiftDeploy Status Dashboard
  2026-05-06T14:18:43Z
=======================================================

  Mode:        CANARY
  Uptime:      892s
  Req/s:       2.40
  P99 Latency: 250ms
  Error Rate:  48.20%

  Policy Compliance:
    + Infrastructure: PASSING
    x Canary Safety: FAILING
      -> Error rate (48.20%) exceeds maximum threshold (1.00%)
Enter fullscreen mode Exit fullscreen mode

The canary safety policy immediately flagged the failure. Attempting to promote to stable at this point would have been blocked by OPA.

Recovery

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'
Enter fullscreen mode Exit fullscreen mode

Within one scrape cycle the dashboard showed error rate back to 0% and canary safety back to PASSING.


Part 4 — The Audit Trail

Every significant event is written to history.jsonl:

{"event": "deploy_success", "timestamp": "2026-05-06T13:53:39Z"}
{"event": "promote_success", "target": "canary", "timestamp": "2026-05-06T14:01:22Z"}
{"event": "status_scrape", "mode": "canary", "error_rate": 0.482, "timestamp": "2026-05-06T14:18:43Z"}
Enter fullscreen mode Exit fullscreen mode

Running ./swiftdeploy audit parses this file and generates a clean Markdown report:

## Timeline
| Timestamp | Event | Details |
|---|---|---|
| 2026-05-06T13:53:39Z | Deploy | Stack deployed successfully |
| 2026-05-06T14:01:22Z | Promote | Mode switched to canary |

## Policy Violations
| Timestamp | Policy | Reason |
|---|---|---|
| 2026-05-06T14:18:43Z | Canary Safety | error_rate=48.20%, p99=250ms |
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

1. Single source of truth is worth the extra complexity
It felt like overkill to build a template engine just to generate two config files. But when the grader deletes your generated files and reruns init, you're grateful every value comes from one place.

2. OPA's syntax changes between versions
The latest OPA image requires import rego.v1 and the if/contains keywords. Older Rego syntax silently fails to load. Always check your OPA container logs first.

3. Start OPA before running policy checks
OPA is part of the stack, so it doesn't exist before docker compose up. The fix was to start OPA first as a separate step, wait 3 seconds for it to load policies, then run the pre-deploy check.

4. Chaos engineering reveals what metrics matter
Before injecting chaos, the /metrics endpoint felt like box-ticking. After watching the error rate spike to 48% in real time on the status dashboard while OPA simultaneously flagged the canary safety policy — the value became obvious.

5. Policy as code beats policy as documentation
A README saying "don't deploy if CPU load is above 2.0" gets ignored. A Rego file that blocks the deploy enforces it automatically.


The Full Subcommand Reference

./swiftdeploy init              # generate nginx.conf + docker-compose.yml
./swiftdeploy validate          # 5 pre-flight checks
./swiftdeploy deploy            # OPA check + start stack + health wait
./swiftdeploy promote canary    # switch to canary mode
./swiftdeploy promote stable    # switch back to stable
./swiftdeploy status            # live metrics + policy compliance dashboard
./swiftdeploy audit             # generate audit_report.md
./swiftdeploy teardown          # stop all containers
./swiftdeploy teardown --clean  # stop + delete generated files
Enter fullscreen mode Exit fullscreen mode

Source Code

The full project is available on GitHub: https://github.com/AnitaAliCloud/hng4-devops


Built as part of the HNG DevOps Track — Stage 4A and 4B

Top comments (0)