DEV Community

Chris Ameh
Chris Ameh

Posted on

SwiftDeploy: Building an Observable, Policy-Driven Deployment Engine with OPA

Introduction
As part of the HNG Internship DevOps Track Stage 4B, I extended my Stage 4A project — SwiftDeploy — into a fully observable, policy-aware deployment platform.
In Stage 4A, SwiftDeploy could:

  • generate infrastructure files from a declarative manifest
  • deploy containers using Docker Compose
  • manage deployment modes (stable/canary)
  • configure Nginx automatically

Stage 4B transformed it into something much closer to a real production deployment system by adding:

  • Prometheus instrumentation
  • Open Policy Agent (OPA) policy enforcement
  • live operational dashboards
  • deployment safety gates
  • audit logging and reporting
  • chaos engineering validation

The result is a deployment tool that not only deploys services, but also decides whether deployments are safe enough to proceed.

The Core Philosophy: One Manifest, Everything Else Generated
SwiftDeploy is built around a single principle:

manifest.yaml is the only file you should ever edit manually.

Everything else is generated from it.
Here is the manifest structure:
services: name: app image: swift-deploy-1-node:latest port: 3000 version: "1.0.0" mode: stablenginx: image: nginx:latest port: 8080 proxy_timeout: 30network: name: swiftdeploy-net driver_type: bridge
From this manifest, the CLI generates:

  • generated/nginx.conf
  • generated/docker-compose.yml
  • OPA runtime configuration

This design provides:

  • consistency
  • reproducibility
  • environment portability
  • infrastructure-as-code discipline

The grader can delete all generated files and rerun:
./swiftdeploy init
and the entire stack regenerates correctly.

Architecture Overview
The system architecture consists of four major components:
User ↓Nginx Reverse Proxy ↓Flask API Service ↓Prometheus Metrics ↓SwiftDeploy CLI ↓OPA Policy Engine
The deployment stack includes:

  • Flask application container
  • Nginx reverse proxy
  • Open Policy Agent (OPA)
  • internal Docker network
  • named log volumes

The SwiftDeploy CLI
The heart of the project is the swiftdeploy executable.
It is a Python-based CLI tool that manages the entire deployment lifecycle.
Supported Commands
CommandPurposeinitGenerate config files from templatesvalidateRun pre-flight validation checksdeployStart the stackpromote canarySwitch deployment into canary modepromote stableReturn deployment to stable modestatusLive metrics dashboardauditGenerate audit reportteardownDestroy containers and networks

The API Service
The API service is a Flask application that supports both stable and canary deployment modes.
Deployment mode is controlled through the MODE environment variable.
Endpoints
Root Endpoint
GET /
Returns:

  • deployment mode
  • version
  • timestamp

Example:
{ "message": "Welcome to SwiftDeploy", "mode": "stable", "version": "1.0.0"}

Health Endpoint
GET /healthz
Returns:

  • health status
  • application uptime

Chaos Endpoint
POST /chaos
Available only in canary mode.
Supports:
{ "mode": "slow", "duration": 3 }
{ "mode": "error", "rate": 0.5 }
{ "mode": "recover" }
This endpoint was used to simulate:

  • degraded latency
  • random failures
  • recovery workflows

Instrumentation: The /metrics Endpoint
One of the biggest upgrades in Stage 4B was observability.
I instrumented the Flask service using the prometheus_client library.
The service now exposes:
GET /metrics
in Prometheus text format.

Metrics Collected
Request Throughput
http_requests_total
Labels:

method
path
status_code

Example:
http_requests_total{method="GET",path="/",status_code="200"} 152
Request Latency
http_request_duration_seconds
Histogram used for:

  • latency analysis
  • P99 calculation

Application Uptime

app_uptime_seconds
Tracks process uptime.

Deployment Mode
app_mode
Values:

  • 0 = stable
  • 1 = canary

Chaos State
chaos_active
Values:

  • 0 = none
  • 1 = slow
  • 2 = error

Why Metrics Matter
Without metrics:

  • deployments are blind
  • failures become invisible
  • canary safety cannot be enforced

Metrics became the foundation for:

  • policy decisions
  • dashboards
  • auditing
  • promotion safety

Open Policy Agent (OPA): The Brain of SwiftDeploy
The most important design principle in Stage 4B was:
The CLI must never make allow/deny decisions itself.
All decision-making lives entirely inside OPA.
SwiftDeploy only:

  • gathers data
  • sends context to OPA
  • acts on the response

This separation makes the system:

  • modular
  • secure
  • maintainable
  • extensible

OPA Policy Domains
I separated policies into independent domains.
Each policy:

  • answers one question
  • owns its own logic
  • operates independently

Infrastructure Policy
Runs before deployment.
Blocks deployment when:

  • disk free space is below 10GB
  • CPU load exceeds 2.0

Rego Example
package infradefault allow = falseallow { input.disk_free_gb >= data.thresholds.disk_free_gb input.cpu_load <= data.thresholds.cpu_load}

Canary Safety Policy
Runs before promotion.
Blocks promotion when:

  • error rate exceeds 1%
  • P99 latency exceeds 500ms Rego Example package canarydefault allow = falseallow { input.error_rate <= data.thresholds.error_rate input.p99_latency_ms <= data.thresholds.p99_latency_ms}

Policy Thresholds
Thresholds are stored separately in:
policies/data.json
Example:
{ "thresholds": { "disk_free_gb": 10, "cpu_load": 2.0, "error_rate": 0.01, "p99_latency_ms": 500 }}

This prevents:

  • hardcoded values
  • duplicated configuration
  • policy coupling OPA Isolation The OPA container runs on an internal Docker network. It is intentionally NOT exposed through Nginx. Only the CLI can access OPA directly via: http://localhost:8181

This prevents external users from:

  • querying policies
  • bypassing deployment logic
  • inspecting internal rules This mirrors real production security architecture.

Pre-Deploy Policy Enforcement
Before deployment, SwiftDeploy collects:

  • CPU load
  • available disk space

Example payload:
{ "disk_free_gb": 8.5, "cpu_load": 2.4}
OPA evaluates the payload.

If policies fail:

Deployment blocked:Infrastructure policy violation
The deployment never proceeds.

Canary Safety Enforcement
Before promotion, SwiftDeploy:

  • scrapes /metrics
  • calculates error rate
  • calculates P99 latency
  • submits metrics to OPA

If the canary is unhealthy:

  • promotion is blocked
  • rollout is prevented This introduces production-grade deployment safety.

The Status Dashboard
The status command provides a live operational dashboard.
./swiftdeploy status
The dashboard:

  • refreshes continuously
  • scrapes live metrics
  • calculates request rate
  • calculates P99 latency
  • evaluates policy compliance
  • appends results to history.jsonl

Example output:
SwiftDeploy Status Dashboard==================================================Mode: canaryChaos: errorError Rate: 52%P99 Latency: 430msPolicy Compliance:✓ Infrastructure policy: PASSING✗ Canary safety policy: FAILING

Chaos Engineering
This was one of the most interesting parts of the project.
I intentionally injected:

  • high error rates
  • slow responses

Example:
curl -X POST http://localhost:8080/chaos -d '{"mode":"error","rate":0.9}'

Immediately:

  • metrics reflected failures
  • policies began failing
  • promotions were blocked
    This validated that:

  • metrics were accurate

  • policies were functional

  • safety gates worked correctly

Audit Logging
Every:

  • deploy
  • promote
  • status scrape
  • policy violation is appended to: history.jsonl

Example entry:
{ "timestamp": "2026-05-06T12:00:00", "mode": "canary", "error_rate": 0.52}

Audit Report Generation
Running:
./swiftdeploy audit
generates:
audit_report.md

The report includes:

  • deployment timeline
  • mode changes
  • chaos injections
  • policy violations

Example:
| Timestamp | Policy | Details ||-----------|--------|---------|| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50% |

Challenges Faced
a. Python Virtual Environment Issues
Ubuntu’s externally-managed Python environment caused repeated package installation failures.

The solution was:

  • recreating the virtual environment
  • installing dependencies inside the venv only

b. Nginx Validation Problems
Generated Nginx configs initially failed validation due to unresolved upstream references.

Fix:

  • validate only inside container context
  • avoid host-side upstream resolution

c. Metrics Parsing

  • Calculating:
  • error rate
  • P99 latency from Prometheus text format required careful parsing and aggregation.

d. OPA Failure Handling
The CLI had to gracefully handle:

  • OPA downtime
  • connection failures
  • malformed responses The system never crashes when OPA becomes unavailable.

Lessons Learned
Declarative Systems Scale Better
A single source of truth drastically reduces configuration drift.

Observability Is Mandatory
Without metrics:

  • policy enforcement becomes impossible
  • deployments become blind
  • Policy Engines Should Be Isolated
  • Keeping OPA internal-only mirrors real enterprise architectures.

Chaos Engineering Builds Confidence
Breaking the system intentionally proved that:

  • metrics were accurate
  • policies were effective
  • safety mechanisms worked

Automation Must Be Explainable
Every policy response included human-readable reasoning.
This made debugging and operational decisions much easier.

Final Thoughts
Stage 4B transformed SwiftDeploy from a deployment generator into a lightweight deployment platform with:

  • observability
  • governance
  • auditing
  • deployment safety

The project demonstrated how:

  • metrics
  • policy engines
  • infrastructure generation
  • deployment orchestration can work together to create reliable deployment systems. Most importantly, it reinforced a key DevOps principle: Safe automation is more valuable than fast automation.

Top comments (0)