SwiftDeploy: Building an Observable, Policy-Driven Deployment Engine with OPA

Chris Ameh — Fri, 08 May 2026 04:06:27 +0000

Introduction
As part of the HNG Internship DevOps Track Stage 4B, I extended my Stage 4A project — SwiftDeploy — into a fully observable, policy-aware deployment platform.
In Stage 4A, SwiftDeploy could:

generate infrastructure files from a declarative manifest
deploy containers using Docker Compose
manage deployment modes (stable/canary)
configure Nginx automatically

Stage 4B transformed it into something much closer to a real production deployment system by adding:

Prometheus instrumentation
Open Policy Agent (OPA) policy enforcement
live operational dashboards
deployment safety gates
audit logging and reporting
chaos engineering validation

The result is a deployment tool that not only deploys services, but also decides whether deployments are safe enough to proceed.

The Core Philosophy: One Manifest, Everything Else Generated
SwiftDeploy is built around a single principle:

manifest.yaml is the only file you should ever edit manually.

Everything else is generated from it.
Here is the manifest structure:
services: name: app image: swift-deploy-1-node:latest port: 3000 version: "1.0.0" mode: stablenginx: image: nginx:latest port: 8080 proxy_timeout: 30network: name: swiftdeploy-net driver_type: bridge
From this manifest, the CLI generates:

generated/nginx.conf
generated/docker-compose.yml
OPA runtime configuration

This design provides:

consistency
reproducibility
environment portability
infrastructure-as-code discipline

The grader can delete all generated files and rerun:
./swiftdeploy init
and the entire stack regenerates correctly.

Architecture Overview
The system architecture consists of four major components:
User ↓Nginx Reverse Proxy ↓Flask API Service ↓Prometheus Metrics ↓SwiftDeploy CLI ↓OPA Policy Engine
The deployment stack includes:

Flask application container
Nginx reverse proxy
Open Policy Agent (OPA)
internal Docker network
named log volumes

The SwiftDeploy CLI
The heart of the project is the swiftdeploy executable.
It is a Python-based CLI tool that manages the entire deployment lifecycle.
Supported Commands
CommandPurposeinitGenerate config files from templatesvalidateRun pre-flight validation checksdeployStart the stackpromote canarySwitch deployment into canary modepromote stableReturn deployment to stable modestatusLive metrics dashboardauditGenerate audit reportteardownDestroy containers and networks

The API Service
The API service is a Flask application that supports both stable and canary deployment modes.
Deployment mode is controlled through the MODE environment variable.
Endpoints
Root Endpoint
GET /
Returns:

deployment mode
version
timestamp

Example:
{ "message": "Welcome to SwiftDeploy", "mode": "stable", "version": "1.0.0"}

Health Endpoint
GET /healthz
Returns:

health status
application uptime

Chaos Endpoint
POST /chaos
Available only in canary mode.
Supports:
{ "mode": "slow", "duration": 3 }
{ "mode": "error", "rate": 0.5 }
{ "mode": "recover" }
This endpoint was used to simulate:

degraded latency
random failures
recovery workflows

Instrumentation: The /metrics Endpoint
One of the biggest upgrades in Stage 4B was observability.
I instrumented the Flask service using the prometheus_client library.
The service now exposes:
GET /metrics
in Prometheus text format.

Metrics Collected
Request Throughput
http_requests_total
Labels:

method
path
status_code

Example:
http_requests_total{method="GET",path="/",status_code="200"} 152
Request Latency
http_request_duration_seconds
Histogram used for:

latency analysis
P99 calculation

Application Uptime

app_uptime_seconds
Tracks process uptime.

Deployment Mode
app_mode
Values:

0 = stable
1 = canary

Chaos State
chaos_active
Values:

0 = none
1 = slow
2 = error

Why Metrics Matter
Without metrics:

deployments are blind
failures become invisible
canary safety cannot be enforced

Metrics became the foundation for:

policy decisions
dashboards
auditing
promotion safety

Open Policy Agent (OPA): The Brain of SwiftDeploy
The most important design principle in Stage 4B was:
The CLI must never make allow/deny decisions itself.
All decision-making lives entirely inside OPA.
SwiftDeploy only:

gathers data
sends context to OPA
acts on the response

This separation makes the system:

modular
secure
maintainable
extensible

OPA Policy Domains
I separated policies into independent domains.
Each policy:

answers one question
owns its own logic
operates independently

Infrastructure Policy
Runs before deployment.
Blocks deployment when:

disk free space is below 10GB
CPU load exceeds 2.0

Rego Example
package infradefault allow = falseallow { input.disk_free_gb >= data.thresholds.disk_free_gb input.cpu_load <= data.thresholds.cpu_load}

Canary Safety Policy
Runs before promotion.
Blocks promotion when:

error rate exceeds 1%
P99 latency exceeds 500ms Rego Example package canarydefault allow = falseallow { input.error_rate <= data.thresholds.error_rate input.p99_latency_ms <= data.thresholds.p99_latency_ms}

Policy Thresholds
Thresholds are stored separately in:
policies/data.json
Example:
{ "thresholds": { "disk_free_gb": 10, "cpu_load": 2.0, "error_rate": 0.01, "p99_latency_ms": 500 }}

This prevents:

hardcoded values
duplicated configuration
policy coupling OPA Isolation The OPA container runs on an internal Docker network. It is intentionally NOT exposed through Nginx. Only the CLI can access OPA directly via: http://localhost:8181

This prevents external users from:

querying policies
bypassing deployment logic
inspecting internal rules This mirrors real production security architecture.

Pre-Deploy Policy Enforcement
Before deployment, SwiftDeploy collects:

CPU load
available disk space

Example payload:
{ "disk_free_gb": 8.5, "cpu_load": 2.4}
OPA evaluates the payload.

If policies fail:

Deployment blocked:Infrastructure policy violation
The deployment never proceeds.

Canary Safety Enforcement
Before promotion, SwiftDeploy:

scrapes /metrics
calculates error rate
calculates P99 latency
submits metrics to OPA

If the canary is unhealthy:

promotion is blocked
rollout is prevented This introduces production-grade deployment safety.

The Status Dashboard
The status command provides a live operational dashboard.
./swiftdeploy status
The dashboard:

refreshes continuously
scrapes live metrics
calculates request rate
calculates P99 latency
evaluates policy compliance
appends results to history.jsonl

Example output:
SwiftDeploy Status Dashboard==================================================Mode: canaryChaos: errorError Rate: 52%P99 Latency: 430msPolicy Compliance:✓ Infrastructure policy: PASSING✗ Canary safety policy: FAILING

Chaos Engineering
This was one of the most interesting parts of the project.
I intentionally injected:

high error rates
slow responses

Example:
curl -X POST http://localhost:8080/chaos -d '{"mode":"error","rate":0.9}'

Immediately:

metrics reflected failures
policies began failing
promotions were blocked
This validated that:
metrics were accurate
policies were functional
safety gates worked correctly

Audit Logging
Every:

deploy
promote
status scrape
policy violation is appended to: history.jsonl

Example entry:
{ "timestamp": "2026-05-06T12:00:00", "mode": "canary", "error_rate": 0.52}

Audit Report Generation
Running:
./swiftdeploy audit
generates:
audit_report.md

The report includes:

deployment timeline
mode changes
chaos injections
policy violations

Example:
| Timestamp | Policy | Details ||-----------|--------|---------|| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50% |

Challenges Faced
a. Python Virtual Environment Issues
Ubuntu’s externally-managed Python environment caused repeated package installation failures.

The solution was:

recreating the virtual environment
installing dependencies inside the venv only

b. Nginx Validation Problems
Generated Nginx configs initially failed validation due to unresolved upstream references.

Fix:

validate only inside container context
avoid host-side upstream resolution

c. Metrics Parsing

Calculating:
error rate
P99 latency from Prometheus text format required careful parsing and aggregation.

d. OPA Failure Handling
The CLI had to gracefully handle:

OPA downtime
connection failures
malformed responses The system never crashes when OPA becomes unavailable.

Lessons Learned
Declarative Systems Scale Better
A single source of truth drastically reduces configuration drift.

Observability Is Mandatory
Without metrics:

policy enforcement becomes impossible
deployments become blind
Policy Engines Should Be Isolated
Keeping OPA internal-only mirrors real enterprise architectures.

Chaos Engineering Builds Confidence
Breaking the system intentionally proved that:

metrics were accurate
policies were effective
safety mechanisms worked

Automation Must Be Explainable
Every policy response included human-readable reasoning.
This made debugging and operational decisions much easier.

Final Thoughts
Stage 4B transformed SwiftDeploy from a deployment generator into a lightweight deployment platform with:

observability
governance
auditing
deployment safety

The project demonstrated how:

metrics
policy engines
infrastructure generation
deployment orchestration can work together to create reliable deployment systems. Most importantly, it reinforced a key DevOps principle: Safe automation is more valuable than fast automation.

DEV Community: Chris Ameh

SwiftDeploy: Building an Observable, Policy-Driven Deployment Engine with OPA