A deep dive into declarative deployments, OPA policy gates, and chaos engineering from Stage 4A to 4B
Introduction
Most DevOps tasks ask you to configure infrastructure manually. This one asked me to build the tool that does it for me.
The result is SwiftDeploy which is a CLI tool that reads a single manifest.yaml file and generates your entire deployment stack from it. Nginx configs, Docker Compose files, policy checks, live metrics dashboards are all derived from one source of truth.
This post covers the full journey: the design decisions, the guardrails, the chaos, and the lessons learned.
The Architecture
Here is how all the pieces connect:
┌─────────────────────────────────────────────────────┐
│ manifest.yaml │
│ (single source of truth) │
└──────────────────────┬──────────────────────────────┘
│
▼
./swiftdeploy init
│
┌────────────┴────────────┐
▼ ▼
nginx.conf docker-compose.yml
(generated) (generated)
│ │
▼ ▼
┌─────────────────────────────────────────────────────┐
│ Docker Stack │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Nginx │───▶│ App │ │ OPA │ │
│ │ :8080 │ │ :3000 │ │ :8181 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ (public) (internal) (internal) │
└─────────────────────────────────────────────────────┘
│ ▲
▼ │
curl :8080 CLI queries OPA
(your browser) before deploy/promote
The key insight: you only ever touch manifest.yaml. The tool handles everything else.
Part 1 — The Design: A Tool That Writes Its Own Files
The Problem with Handwritten Config
When you write nginx.conf and docker-compose.yml by hand, you introduce drift. Change a port in one place and forget to update it in another. After a few weeks, nobody knows which file is the source of truth.
SwiftDeploy solves this with a three-layer system:
manifest.yaml → templates/*.tmpl → generated files
(VALUES) (STRUCTURE) (VALUES + STRUCTURE)
The manifest.yaml holds all the values — ports, image names, modes, timeouts. The templates hold the structure — how nginx.conf and docker-compose.yml should look. The CLI combines them at runtime.
How swiftdeploy init Works
# Read manifest into a Python dict
m = yaml.safe_load(open("manifest.yaml"))
# Build a replacements map
replacements = {
"{{NGINX_PORT}}": str(m["nginx"]["port"]),
"{{SERVICE_PORT}}": str(m["services"]["port"]),
# ... etc
}
# Read template, replace placeholders, write output
with open("templates/nginx.conf.tmpl") as f:
content = f.read()
for placeholder, value in replacements.items():
content = content.replace(placeholder, value)
with open("nginx.conf", "w") as f:
f.write(content)
Simple string replacement. No Jinja2, no templating engine — just Python's built-in str.replace(). The grader can delete nginx.conf and docker-compose.yml, run ./swiftdeploy init, and they regenerate perfectly every time.
The API Service
The API is a Python HTTP server using only the standard library — no Flask, no FastAPI. This keeps the Docker image under 60MB (well under the 300MB limit).
It runs in two modes controlled by a MODE environment variable injected by Docker Compose:
stable mode → normal behaviour
canary mode → adds X-Mode: canary header + activates /chaos endpoint
The same image runs both modes. The only difference is the environment variable.
The Nginx Reverse Proxy
Nginx sits in front of the app and adds:
-
X-Deployed-By: swiftdeployheader on every response - JSON error bodies on 502/503/504 (instead of ugly HTML)
- Structured access logs in the required format
- Forwards
X-Modeheader from the upstream app
Critically, the app port is never exposed directly. Only Nginx's port is mapped to the host. All traffic must flow through it.
Part 2 — The Guardrails: OPA Policy Enforcement
Why OPA?
The task required that the CLI never make allow/deny decisions itself. All logic must live in OPA (Open Policy Agent).
This matters because it separates concerns cleanly:
CLI → collects data, calls OPA, surfaces the result
OPA → owns all decision logic, never called by the app
If you want to change a policy, you edit a .rego file. You never touch the CLI. If you want to change a threshold, you edit data.json. You never touch the Rego files.
Policy Structure
Each policy domain owns exactly one question:
Infrastructure policy — Is the host healthy enough to deploy?
package infrastructure
import rego.v1
default allow := false
allow if {
count(violations) == 0
}
violations contains msg if {
input.disk_free_gb < data.infrastructure.min_disk_free_gb
msg := sprintf(
"Disk free (%.1fGB) is below minimum threshold (%.1fGB)",
[input.disk_free_gb, data.infrastructure.min_disk_free_gb]
)
}
violations contains msg if {
input.cpu_load > data.infrastructure.max_cpu_load
msg := sprintf(
"CPU load (%.2f) exceeds maximum threshold (%.2f)",
[input.cpu_load, data.infrastructure.max_cpu_load]
)
}
Canary safety policy — Is the canary healthy enough to promote?
package canary
import rego.v1
default allow := false
allow if {
count(violations) == 0
}
violations contains msg if {
input.error_rate > data.canary.max_error_rate
msg := sprintf(
"Error rate (%.2f%%) exceeds maximum threshold (%.2f%%)",
[input.error_rate * 100, data.canary.max_error_rate * 100]
)
}
Threshold values live in data.json — never hardcoded in Rego:
{
"infrastructure": {
"min_disk_free_gb": 10.0,
"max_cpu_load": 16.0,
"min_mem_free_percent": 10.0
},
"canary": {
"max_error_rate": 0.01,
"max_p99_latency_ms": 500
}
}
The Hard Gate in Action
When the CPU load exceeded the threshold, swiftdeploy deploy was blocked:
[deploy] Running OPA pre-deploy policy checks...
Host stats: disk=80.08GB free, cpu_load=12.88, mem_free=50.0%
[policy] Checking Infrastructure...
[BLOCK] Infrastructure policy FAILED:
x CPU load (12.88) exceeds maximum threshold (2.00)
[deploy] BLOCKED by policy. Fix violations above before deploying.
The deploy never started. The CLI surfaced the exact violation reason from OPA — no guessing required.
OPA Isolation
OPA is intentionally isolated from public Nginx ingress. It runs on port 8181 inside the Docker network. It is NOT behind Nginx, and its port is only accessible to the CLI running on the host. A user hitting localhost:8080 cannot reach OPA.
Failure Handling
The CLI handles every distinct OPA failure mode differently:
except urllib.error.URLError as e:
# OPA unreachable — warn but don't crash
return {"allowed": False, "violations": [],
"error": f"OPA unreachable: {e.reason}"}
except json.JSONDecodeError:
# OPA returned garbage — different message
return {"allowed": False, "violations": [],
"error": "OPA returned invalid JSON"}
except Exception as e:
# Catch-all — still doesn't crash
return {"allowed": False, "violations": [],
"error": f"Unexpected OPA error: {e}"}
The CLI never crashes or hangs when OPA is unavailable. It warns the operator and continues.
Part 3 — The Chaos: Breaking Things on Purpose
The /metrics Endpoint
The API exposes a /metrics endpoint in Prometheus text format:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/",status_code="200"} 42
# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/",le="0.005"} 38
# HELP app_mode Current deployment mode (0=stable, 1=canary)
# TYPE app_mode gauge
app_mode 1
# HELP chaos_active Current chaos state (0=none, 1=slow, 2=error)
# TYPE chaos_active gauge
chaos_active 0
No third-party libraries — pure Python calculating histogram buckets manually.
Injecting Chaos
After promoting to canary mode, chaos was injected:
# Slow mode — every request sleeps 3 seconds
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode": "slow", "duration": 3}'
# Error mode — 50% of requests return HTTP 500
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode": "error", "rate": 0.5}'
The Status Dashboard Capturing the Failure
With error mode active at 50%, the status dashboard showed:
=======================================================
SwiftDeploy Status Dashboard
2026-05-06T14:18:43Z
=======================================================
Mode: CANARY
Uptime: 892s
Req/s: 2.40
P99 Latency: 250ms
Error Rate: 48.20%
Policy Compliance:
+ Infrastructure: PASSING
x Canary Safety: FAILING
-> Error rate (48.20%) exceeds maximum threshold (1.00%)
The canary safety policy immediately flagged the failure. Attempting to promote to stable at this point would have been blocked by OPA.
Recovery
curl -X POST http://localhost:8080/chaos \
-H "Content-Type: application/json" \
-d '{"mode": "recover"}'
Within one scrape cycle the dashboard showed error rate back to 0% and canary safety back to PASSING.
Part 4 — The Audit Trail
Every significant event is written to history.jsonl:
{"event": "deploy_success", "timestamp": "2026-05-06T13:53:39Z"}
{"event": "promote_success", "target": "canary", "timestamp": "2026-05-06T14:01:22Z"}
{"event": "status_scrape", "mode": "canary", "error_rate": 0.482, "timestamp": "2026-05-06T14:18:43Z"}
Running ./swiftdeploy audit parses this file and generates a clean Markdown report:
## Timeline
| Timestamp | Event | Details |
|---|---|---|
| 2026-05-06T13:53:39Z | Deploy | Stack deployed successfully |
| 2026-05-06T14:01:22Z | Promote | Mode switched to canary |
## Policy Violations
| Timestamp | Policy | Reason |
|---|---|---|
| 2026-05-06T14:18:43Z | Canary Safety | error_rate=48.20%, p99=250ms |
Lessons Learned
1. Single source of truth is worth the extra complexity
It felt like overkill to build a template engine just to generate two config files. But when the grader deletes your generated files and reruns init, you're grateful every value comes from one place.
2. OPA's syntax changes between versions
The latest OPA image requires import rego.v1 and the if/contains keywords. Older Rego syntax silently fails to load. Always check your OPA container logs first.
3. Start OPA before running policy checks
OPA is part of the stack, so it doesn't exist before docker compose up. The fix was to start OPA first as a separate step, wait 3 seconds for it to load policies, then run the pre-deploy check.
4. Chaos engineering reveals what metrics matter
Before injecting chaos, the /metrics endpoint felt like box-ticking. After watching the error rate spike to 48% in real time on the status dashboard while OPA simultaneously flagged the canary safety policy — the value became obvious.
5. Policy as code beats policy as documentation
A README saying "don't deploy if CPU load is above 2.0" gets ignored. A Rego file that blocks the deploy enforces it automatically.
The Full Subcommand Reference
./swiftdeploy init # generate nginx.conf + docker-compose.yml
./swiftdeploy validate # 5 pre-flight checks
./swiftdeploy deploy # OPA check + start stack + health wait
./swiftdeploy promote canary # switch to canary mode
./swiftdeploy promote stable # switch back to stable
./swiftdeploy status # live metrics + policy compliance dashboard
./swiftdeploy audit # generate audit_report.md
./swiftdeploy teardown # stop all containers
./swiftdeploy teardown --clean # stop + delete generated files
Source Code
The full project is available on GitHub: https://github.com/AnitaAliCloud/hng4-devops
Built as part of the HNG DevOps Track — Stage 4A and 4B

Top comments (0)