Introduction
In Stage 4A, I built a CLI tool (swiftdeploy) that generates infrastructure from a single file (manifest.yaml).
In Stage 4B, I extended it to
include:
- Observability (metrics)
- Policy enforcement (OPA)
- Auditing (history + reports)
The goal was simple but strict:
The system must refuse to deploy or promote if it is unsafe.
This meant moving from just “running containers” to building a system that can think and decide before acting.
⸻
Architectural Overview
manifest.yaml
↓
swiftdeploy CLI
↓
docker-compose + nginx
↓
Docker Network
↓
[ NGINX ] → [ APP (/metrics) ]
↓
metrics
↓
CLI
↓
OPA
At a high level:
- manifest.yaml is the single source of truth
- swiftdeploy CLI reads it and generates:
- docker-compose.yml
- nginx.conf
- Docker runs:
- API service
- Nginx (reverse proxy)
- OPA (policy engine)
flow:
CLI → collect data → send to OPA → receive decision → deploy or block
The Design: A Tool That Writes Its Own Infrastructure
The core idea was:
I don’t manually write configs — I generate them.
Instead of editing multiple files, I only update:
manifest.yaml
then:
python swiftdeploy.py init
This generates:
- docker-compose.yml
- nginx.conf
Why this matters
- Reduces manual errors
- Keeps configuration consistent
- Makes the system reproducible
If I deletes my configs, I can regenerate everything from the manifest.
Observability: Adding the “Eyes” (/metrics)
I added a /metrics endpoint to the API in Prometheus format.
It tracks:
Throughput & Errors
http_requests_total{method, path, status_code}Latency
http_request_duration_seconds_bucketApplication State
app_uptime_seconds
app_mode (0=stable, 1=canary)
chaos_active
The Guardrails: Policy Enforcement with OPA
Instead of writing logic inside the CLI, I used Open Policy Agent.
Key Rule:
The CLI must NOT decide anything — OPA decides everything.
🔹 Infrastructure Policy (Pre-Deploy)
Checks:
- Disk space
- CPU load
Example rule:
Deny if disk_free < 10GB
Deny if cpu_load > 2.0
If I artificially reduce disk space:
BLOCKED: Disk below threshold
👉 This satisfies the Hard Gate requirement
⸻
🔹 Canary Safety Policy (Pre-Promote)
Before promoting, the CLI:
- Scrapes /metrics
- Calculates:
- Error rate
- P99 latency
- Sends to OPA
Policy:
Deny if error_rate > 1%
Deny if p99_latency > 500ms
⸻
Why Isolation Matters
OPA runs as a separate container and:
- Is reachable by the CLI
- Is NOT exposed through Nginx
👉 This ensures:
- No external access to policy engine
- Clear separation of responsibilities
This satisfies the “No Leakage” requirement
⸻
🧪 The Chaos: Testing Failure Scenarios
I implemented a /chaos endpoint:
Modes:
- slow → delays responses
- error → randomly returns 500
- recover → resets system
{ "mode": "slow", "duration": 2 }
What Happened
When I injected chaos:
- Latency increased
- Error rate increased
- Metrics reflected the change
When I tried to promote:
BLOCKED: Latency too high
👉 This confirmed:
The system reacts to real runtime conditions, not assumptions
⸻
The Eyes: swiftdeploy status
This command:
python swiftdeploy.py status
- Continuously scrapes /metrics
- Displays live system state
- Logs everything to:
history.jsonl
The Memory: Audit System
From the logs, I generate:
python swiftdeploy.py audit
This creates:
audit_report.md
Contents:
- Timeline of events
- Policy violations
👉 The report renders cleanly in GitHub Markdown
(Satisfies submission requirement)
⸻
Lessons Learned
This stage changed how I think about DevOps:
- Deployment is not just execution
It’s decision-making
⸻
- Policies should be external
Keeping logic in OPA:
- makes it reusable
- avoids tightly coupled code
⸻
- Metrics are not just for monitoring
They actively drive decisions
⸻
- Debugging is part of the process
I faced:
- YAML errors
- Docker rebuild issues
- Nginx misconfigurations
- OPA connection failures
Fixing them helped me understand the system deeply.
⸻
✅ Final Checklist (Submission Criteria)
✔ manifest.yaml is the only edited file
✔ Deployment blocked when disk is low
✔ OPA not exposed via Nginx
✔ Metrics fully implemented
✔ Audit report generated and readable
✔ Blog includes architecture diagram
⸻
Conclusion
This project helped me move from:
running commands → building systems that enforce rules
I now better understand how:
- observability
- policy
- infrastructure
work together in real-world systems.
⸻
If you’re learning DevOps, my biggest takeaway is:
Don’t just deploy — build systems that decide when deployment is safe.
Top comments (0)