DEV Community

Edith Asante
Edith Asante

Posted on

Building a Policy-Gated Deployment System with Observability (SwiftDeploy Stage 4B)

Introduction

In Stage 4A, I built a CLI tool (swiftdeploy) that generates infrastructure from a single file (manifest.yaml).
In Stage 4B, I extended it to

include:

  • Observability (metrics)
  • Policy enforcement (OPA)
  • Auditing (history + reports)

The goal was simple but strict:

The system must refuse to deploy or promote if it is unsafe.

This meant moving from just “running containers” to building a system that can think and decide before acting.

Architectural Overview

manifest.yaml

Enter fullscreen mode Exit fullscreen mode

swiftdeploy CLI

Enter fullscreen mode Exit fullscreen mode

docker-compose + nginx

Enter fullscreen mode Exit fullscreen mode

Docker Network

Enter fullscreen mode Exit fullscreen mode

[ NGINX ] → [ APP (/metrics) ]

              ↓

           metrics

              ↓

           CLI

              ↓

            OPA
Enter fullscreen mode Exit fullscreen mode

At a high level:

  • manifest.yaml is the single source of truth
  • swiftdeploy CLI reads it and generates:
    • docker-compose.yml
    • nginx.conf
  • Docker runs:
    • API service
    • Nginx (reverse proxy)
    • OPA (policy engine)

flow:
CLI → collect data → send to OPA → receive decision → deploy or block

The Design: A Tool That Writes Its Own Infrastructure

The core idea was:

I don’t manually write configs — I generate them.

Instead of editing multiple files, I only update:
manifest.yaml

then:
python swiftdeploy.py init

This generates:

  • docker-compose.yml
  • nginx.conf

Why this matters

  • Reduces manual errors
  • Keeps configuration consistent
  • Makes the system reproducible

If I deletes my configs, I can regenerate everything from the manifest.

Observability: Adding the “Eyes” (/metrics)

I added a /metrics endpoint to the API in Prometheus format.

It tracks:

  1. Throughput & Errors
    http_requests_total{method, path, status_code}

  2. Latency
    http_request_duration_seconds_bucket

  3. Application State

app_uptime_seconds

app_mode (0=stable, 1=canary)

chaos_active

The Guardrails: Policy Enforcement with OPA

Instead of writing logic inside the CLI, I used Open Policy Agent.

Key Rule:

The CLI must NOT decide anything — OPA decides everything.

🔹 Infrastructure Policy (Pre-Deploy)

Checks:

  • Disk space
  • CPU load

Example rule:
Deny if disk_free < 10GB

Deny if cpu_load > 2.0

If I artificially reduce disk space:

BLOCKED: Disk below threshold

👉 This satisfies the Hard Gate requirement

🔹 Canary Safety Policy (Pre-Promote)

Before promoting, the CLI:

  1. Scrapes /metrics
  2. Calculates:
    • Error rate
    • P99 latency
  3. Sends to OPA

Policy:
Deny if error_rate > 1%

Deny if p99_latency > 500ms

Why Isolation Matters

OPA runs as a separate container and:

  • Is reachable by the CLI
  • Is NOT exposed through Nginx

👉 This ensures:

  • No external access to policy engine
  • Clear separation of responsibilities

This satisfies the “No Leakage” requirement

🧪 The Chaos: Testing Failure Scenarios

I implemented a /chaos endpoint:

Modes:

  • slow → delays responses
  • error → randomly returns 500
  • recover → resets system

{ "mode": "slow", "duration": 2 }

What Happened

When I injected chaos:

  • Latency increased
  • Error rate increased
  • Metrics reflected the change

When I tried to promote:
BLOCKED: Latency too high

👉 This confirmed:
The system reacts to real runtime conditions, not assumptions

The Eyes: swiftdeploy status

This command:
python swiftdeploy.py status

  • Continuously scrapes /metrics
  • Displays live system state
  • Logs everything to:

history.jsonl

The Memory: Audit System

From the logs, I generate:

python swiftdeploy.py audit

This creates:
audit_report.md

Contents:

  • Timeline of events
  • Policy violations

👉 The report renders cleanly in GitHub Markdown
(Satisfies submission requirement)

Lessons Learned

This stage changed how I think about DevOps:

  1. Deployment is not just execution

It’s decision-making

  1. Policies should be external

Keeping logic in OPA:

  • makes it reusable
  • avoids tightly coupled code

  1. Metrics are not just for monitoring

They actively drive decisions

  1. Debugging is part of the process

I faced:

  • YAML errors
  • Docker rebuild issues
  • Nginx misconfigurations
  • OPA connection failures

Fixing them helped me understand the system deeply.

✅ Final Checklist (Submission Criteria)

✔ manifest.yaml is the only edited file
✔ Deployment blocked when disk is low
✔ OPA not exposed via Nginx
✔ Metrics fully implemented
✔ Audit report generated and readable
✔ Blog includes architecture diagram

Conclusion

This project helped me move from:

running commands → building systems that enforce rules

I now better understand how:

  • observability
  • policy
  • infrastructure

work together in real-world systems.

If you’re learning DevOps, my biggest takeaway is:

Don’t just deploy — build systems that decide when deployment is safe.

Top comments (0)