Introduction
As part of the HNG Internship DevOps Track Stage 4B, I extended my Stage 4A project — SwiftDeploy — into a fully observable, policy-aware deployment platform.
In Stage 4A, SwiftDeploy could:
- generate infrastructure files from a declarative manifest
- deploy containers using Docker Compose
- manage deployment modes (stable/canary)
- configure Nginx automatically
Stage 4B transformed it into something much closer to a real production deployment system by adding:
- Prometheus instrumentation
- Open Policy Agent (OPA) policy enforcement
- live operational dashboards
- deployment safety gates
- audit logging and reporting
- chaos engineering validation
The result is a deployment tool that not only deploys services, but also decides whether deployments are safe enough to proceed.
The Core Philosophy: One Manifest, Everything Else Generated
SwiftDeploy is built around a single principle:
manifest.yaml is the only file you should ever edit manually.
Everything else is generated from it.
Here is the manifest structure:
services: name: app image: swift-deploy-1-node:latest port: 3000 version: "1.0.0" mode: stablenginx: image: nginx:latest port: 8080 proxy_timeout: 30network: name: swiftdeploy-net driver_type: bridge
From this manifest, the CLI generates:
- generated/nginx.conf
- generated/docker-compose.yml
- OPA runtime configuration
This design provides:
- consistency
- reproducibility
- environment portability
- infrastructure-as-code discipline
The grader can delete all generated files and rerun:
./swiftdeploy init
and the entire stack regenerates correctly.
Architecture Overview
The system architecture consists of four major components:
User ↓Nginx Reverse Proxy ↓Flask API Service ↓Prometheus Metrics ↓SwiftDeploy CLI ↓OPA Policy Engine
The deployment stack includes:
- Flask application container
- Nginx reverse proxy
- Open Policy Agent (OPA)
- internal Docker network
- named log volumes
The SwiftDeploy CLI
The heart of the project is the swiftdeploy executable.
It is a Python-based CLI tool that manages the entire deployment lifecycle.
Supported Commands
CommandPurposeinitGenerate config files from templatesvalidateRun pre-flight validation checksdeployStart the stackpromote canarySwitch deployment into canary modepromote stableReturn deployment to stable modestatusLive metrics dashboardauditGenerate audit reportteardownDestroy containers and networks
The API Service
The API service is a Flask application that supports both stable and canary deployment modes.
Deployment mode is controlled through the MODE environment variable.
Endpoints
Root Endpoint
GET /
Returns:
- deployment mode
- version
- timestamp
Example:
{ "message": "Welcome to SwiftDeploy", "mode": "stable", "version": "1.0.0"}
Health Endpoint
GET /healthz
Returns:
- health status
- application uptime
Chaos Endpoint
POST /chaos
Available only in canary mode.
Supports:
{ "mode": "slow", "duration": 3 }
{ "mode": "error", "rate": 0.5 }
{ "mode": "recover" }
This endpoint was used to simulate:
- degraded latency
- random failures
- recovery workflows
Instrumentation: The /metrics Endpoint
One of the biggest upgrades in Stage 4B was observability.
I instrumented the Flask service using the prometheus_client library.
The service now exposes:
GET /metrics
in Prometheus text format.
Metrics Collected
Request Throughput
http_requests_total
Labels:
method
path
status_code
Example:
http_requests_total{method="GET",path="/",status_code="200"} 152
Request Latency
http_request_duration_seconds
Histogram used for:
- latency analysis
- P99 calculation
Application Uptime
app_uptime_seconds
Tracks process uptime.
Deployment Mode
app_mode
Values:
- 0 = stable
- 1 = canary
Chaos State
chaos_active
Values:
- 0 = none
- 1 = slow
- 2 = error
Why Metrics Matter
Without metrics:
- deployments are blind
- failures become invisible
- canary safety cannot be enforced
Metrics became the foundation for:
- policy decisions
- dashboards
- auditing
- promotion safety
Open Policy Agent (OPA): The Brain of SwiftDeploy
The most important design principle in Stage 4B was:
The CLI must never make allow/deny decisions itself.
All decision-making lives entirely inside OPA.
SwiftDeploy only:
- gathers data
- sends context to OPA
- acts on the response
This separation makes the system:
- modular
- secure
- maintainable
- extensible
OPA Policy Domains
I separated policies into independent domains.
Each policy:
- answers one question
- owns its own logic
- operates independently
Infrastructure Policy
Runs before deployment.
Blocks deployment when:
- disk free space is below 10GB
- CPU load exceeds 2.0
Rego Example
package infradefault allow = falseallow { input.disk_free_gb >= data.thresholds.disk_free_gb input.cpu_load <= data.thresholds.cpu_load}
Canary Safety Policy
Runs before promotion.
Blocks promotion when:
- error rate exceeds 1%
- P99 latency exceeds 500ms Rego Example package canarydefault allow = falseallow { input.error_rate <= data.thresholds.error_rate input.p99_latency_ms <= data.thresholds.p99_latency_ms}
Policy Thresholds
Thresholds are stored separately in:
policies/data.json
Example:
{ "thresholds": { "disk_free_gb": 10, "cpu_load": 2.0, "error_rate": 0.01, "p99_latency_ms": 500 }}
This prevents:
- hardcoded values
- duplicated configuration
- policy coupling OPA Isolation The OPA container runs on an internal Docker network. It is intentionally NOT exposed through Nginx. Only the CLI can access OPA directly via: http://localhost:8181
This prevents external users from:
- querying policies
- bypassing deployment logic
- inspecting internal rules This mirrors real production security architecture.
Pre-Deploy Policy Enforcement
Before deployment, SwiftDeploy collects:
- CPU load
- available disk space
Example payload:
{ "disk_free_gb": 8.5, "cpu_load": 2.4}
OPA evaluates the payload.
If policies fail:
Deployment blocked:Infrastructure policy violation
The deployment never proceeds.
Canary Safety Enforcement
Before promotion, SwiftDeploy:
- scrapes /metrics
- calculates error rate
- calculates P99 latency
- submits metrics to OPA
If the canary is unhealthy:
- promotion is blocked
- rollout is prevented This introduces production-grade deployment safety.
The Status Dashboard
The status command provides a live operational dashboard.
./swiftdeploy status
The dashboard:
- refreshes continuously
- scrapes live metrics
- calculates request rate
- calculates P99 latency
- evaluates policy compliance
- appends results to history.jsonl
Example output:
SwiftDeploy Status Dashboard==================================================Mode: canaryChaos: errorError Rate: 52%P99 Latency: 430msPolicy Compliance:✓ Infrastructure policy: PASSING✗ Canary safety policy: FAILING
Chaos Engineering
This was one of the most interesting parts of the project.
I intentionally injected:
- high error rates
- slow responses
Example:
curl -X POST http://localhost:8080/chaos -d '{"mode":"error","rate":0.9}'
Immediately:
- metrics reflected failures
- policies began failing
promotions were blocked
This validated that:metrics were accurate
policies were functional
safety gates worked correctly
Audit Logging
Every:
- deploy
- promote
- status scrape
- policy violation is appended to: history.jsonl
Example entry:
{ "timestamp": "2026-05-06T12:00:00", "mode": "canary", "error_rate": 0.52}
Audit Report Generation
Running:
./swiftdeploy audit
generates:
audit_report.md
The report includes:
- deployment timeline
- mode changes
- chaos injections
- policy violations
Example:
| Timestamp | Policy | Details ||-----------|--------|---------|| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50% |
Challenges Faced
a. Python Virtual Environment Issues
Ubuntu’s externally-managed Python environment caused repeated package installation failures.
The solution was:
- recreating the virtual environment
- installing dependencies inside the venv only
b. Nginx Validation Problems
Generated Nginx configs initially failed validation due to unresolved upstream references.
Fix:
- validate only inside container context
- avoid host-side upstream resolution
c. Metrics Parsing
- Calculating:
- error rate
- P99 latency from Prometheus text format required careful parsing and aggregation.
d. OPA Failure Handling
The CLI had to gracefully handle:
- OPA downtime
- connection failures
- malformed responses The system never crashes when OPA becomes unavailable.
Lessons Learned
Declarative Systems Scale Better
A single source of truth drastically reduces configuration drift.
Observability Is Mandatory
Without metrics:
- policy enforcement becomes impossible
- deployments become blind
- Policy Engines Should Be Isolated
- Keeping OPA internal-only mirrors real enterprise architectures.
Chaos Engineering Builds Confidence
Breaking the system intentionally proved that:
- metrics were accurate
- policies were effective
- safety mechanisms worked
Automation Must Be Explainable
Every policy response included human-readable reasoning.
This made debugging and operational decisions much easier.
Final Thoughts
Stage 4B transformed SwiftDeploy from a deployment generator into a lightweight deployment platform with:
- observability
- governance
- auditing
- deployment safety
The project demonstrated how:
- metrics
- policy engines
- infrastructure generation
- deployment orchestration can work together to create reliable deployment systems. Most importantly, it reinforced a key DevOps principle: Safe automation is more valuable than fast automation.
Top comments (0)