<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kelechi Uba</title>
    <description>The latest articles on DEV Community by Kelechi Uba (@kelechi_uba_d8ec694684838).</description>
    <link>https://dev.to/kelechi_uba_d8ec694684838</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916545%2F76de581c-086e-4a6c-be61-c1c975a9d7bc.jpg</url>
      <title>DEV Community: Kelechi Uba</title>
      <link>https://dev.to/kelechi_uba_d8ec694684838</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kelechi_uba_d8ec694684838"/>
    <language>en</language>
    <item>
      <title>I Built a CLI That Writes Its Own Docker Config — Then Taught It to Say No</title>
      <dc:creator>Kelechi Uba</dc:creator>
      <pubDate>Fri, 08 May 2026 13:28:17 +0000</pubDate>
      <link>https://dev.to/kelechi_uba_d8ec694684838/i-built-a-cli-that-writes-its-own-docker-config-then-taught-it-to-say-no-4on7</link>
      <guid>https://dev.to/kelechi_uba_d8ec694684838/i-built-a-cli-that-writes-its-own-docker-config-then-taught-it-to-say-no-4on7</guid>
      <description>&lt;p&gt;Every time I set up a stack from scratch I'd end up touching at least four files: &lt;code&gt;docker-compose.yml&lt;/code&gt;, &lt;code&gt;nginx.conf&lt;/code&gt;, a &lt;code&gt;.env&lt;/code&gt; file, maybe a &lt;code&gt;Makefile&lt;/code&gt;. Change the port in one place and forget to update the others and something silently breaks. I wanted to fix that. Stage 4A was the fix. Stage 4B was the moment I realised the fix was incomplete.&lt;/p&gt;

&lt;p&gt;This post covers the whole journey: how I built &lt;code&gt;swiftdeploy&lt;/code&gt;, why I wired in Prometheus metrics and an OPA policy sidecar, and what actually happened when I deliberately tried to break my own canary deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 4A: One file, everything else is generated
&lt;/h2&gt;

&lt;p&gt;The idea was simple. One file — &lt;code&gt;manifest.yaml&lt;/code&gt; — owns every setting. The CLI reads it and writes &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt;. You never touch the generated files. If you need to change something, you change the manifest and run &lt;code&gt;./swiftdeploy init&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;The manifest looks like this at its base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-stage4b-app:1.0.0&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stable&lt;/span&gt;

&lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:latest&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;18080&lt;/span&gt;
  &lt;span class="na"&gt;proxy_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-net&lt;/span&gt;
  &lt;span class="na"&gt;driver_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;swiftdeploy init&lt;/code&gt; takes that and renders two generated files using Python's &lt;code&gt;string.Template&lt;/code&gt;. The templates live in &lt;code&gt;templates/&lt;/code&gt; and contain &lt;code&gt;${VARIABLE}&lt;/code&gt; placeholders that get substituted from the manifest context. Here is the critical bit from &lt;code&gt;config.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;render_templates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;ensure_policy_source&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;manifest_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;          &lt;span class="c1"&gt;# reads every ${VAR} from manifest.yaml
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tmpl_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;NGINX_TMPL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NGINX_OUT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COMPOSE_TMPL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;COMPOSE_OUT&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;rendered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmpl_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;safe_substitute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;atomic_write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rendered&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I used &lt;code&gt;safe_substitute&lt;/code&gt; instead of &lt;code&gt;substitute&lt;/code&gt; because &lt;code&gt;substitute&lt;/code&gt; raises an exception on any unknown &lt;code&gt;${...}&lt;/code&gt; token. Nginx config files are full of variables like &lt;code&gt;${request_time}&lt;/code&gt; — if I had used &lt;code&gt;substitute&lt;/code&gt;, rendering would blow up on every nginx variable. &lt;code&gt;safe_substitute&lt;/code&gt; leaves tokens it doesn't recognise alone, so nginx gets its variables and the manifest gets its values.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;atomic_write&lt;/code&gt; helper writes to a temp file first, then does &lt;code&gt;os.replace&lt;/code&gt; into the final path. The reason: if something crashes mid-write you end up with a corrupt config. &lt;code&gt;os.replace&lt;/code&gt; is atomic on every OS Python runs on, so you either get the new file or the old one, never half of each.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdal985fca5gwpga1q666.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdal985fca5gwpga1q666.jpg" alt="manifest.yaml feeds swiftdeploy init which renders nginx.conf and docker-compose.yml — generated files carry DO NOT HAND-EDIT headers" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The app
&lt;/h3&gt;

&lt;p&gt;The API service is a FastAPI app with three endpoints: &lt;code&gt;GET /&lt;/code&gt; returns the mode and version, &lt;code&gt;GET /healthz&lt;/code&gt; returns uptime, and &lt;code&gt;POST /chaos&lt;/code&gt; lets you inject failure (more on that later). The &lt;code&gt;MODE&lt;/code&gt; environment variable controls whether the app is in stable or canary mode — same image, different behaviour. In canary mode every response carries an &lt;code&gt;X-Mode: canary&lt;/code&gt; header.&lt;/p&gt;

&lt;h3&gt;
  
  
  The deployment lifecycle
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy deploy&lt;/code&gt; calls &lt;code&gt;init&lt;/code&gt; first, then does &lt;code&gt;docker compose up -d&lt;/code&gt;, then polls &lt;code&gt;/healthz&lt;/code&gt; through nginx every second until it gets a 200 or 60 seconds pass. Nginx waits for the app to be healthy before it starts (&lt;code&gt;depends_on: condition: service_healthy&lt;/code&gt;), so the health poll through nginx is a genuine end-to-end check.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy promote canary&lt;/code&gt; mutates &lt;code&gt;services.mode&lt;/code&gt; in &lt;code&gt;manifest.yaml&lt;/code&gt; using a targeted regex — one line changes, nothing else. It then re-renders &lt;code&gt;docker-compose.yml&lt;/code&gt;, recreates only the app container (&lt;code&gt;--no-deps --force-recreate&lt;/code&gt;), and confirms the mode by checking both the JSON body and the &lt;code&gt;X-Mode&lt;/code&gt; header. If either signal is wrong, the promote fails.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy teardown --clean&lt;/code&gt; brings everything down and deletes the generated configs. Running &lt;code&gt;./swiftdeploy init&lt;/code&gt; afterwards regenerates byte-identical files. The grader can verify this. That idempotency guarantee is the whole point of the manifest-driven approach.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Stage 4A wasn't enough
&lt;/h2&gt;

&lt;p&gt;After building that I realised I had no visibility into what was happening inside the stack once it was running, and no automatic safety check before promoting. I was flying blind. I could deploy a canary that was returning 500 errors on every request and &lt;code&gt;promote stable&lt;/code&gt; would just do it, no questions asked.&lt;/p&gt;

&lt;p&gt;Stage 4B adds three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Eyes&lt;/strong&gt; — a &lt;code&gt;/metrics&lt;/code&gt; endpoint in Prometheus text format so I can see what is happening&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brain&lt;/strong&gt; — an OPA sidecar that makes every allow/deny decision so the CLI never has to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; — &lt;code&gt;history.jsonl&lt;/code&gt; and &lt;code&gt;audit_report.md&lt;/code&gt; so there is a record of what happened and when&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp9yxgmci0zqr3h0fc1g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp9yxgmci0zqr3h0fc1g.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The metrics endpoint
&lt;/h2&gt;

&lt;p&gt;The app exposes &lt;code&gt;GET /metrics&lt;/code&gt; and returns Prometheus text format — no Prometheus library, hand-rolled. Here is what it looks like right after a fresh deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; http://127.0.0.1:18080/metrics
&lt;span class="go"&gt;HTTP/1.1 200 OK
&lt;/span&gt;&lt;span class="gp"&gt;Content-Type: text/plain;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0.4&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;charset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;utf-8
&lt;span class="go"&gt;X-Deployed-By: swiftdeploy

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP http_requests_total Total HTTP requests by method, path, and status code.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE http_requests_total counter
&lt;span class="go"&gt;http_requests_total{method="GET",path="/healthz",status_code="200"} 2
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP http_request_duration_seconds HTTP request latency histogram &lt;span class="k"&gt;in &lt;/span&gt;seconds.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE http_request_duration_seconds histogram
&lt;span class="go"&gt;http_request_duration_seconds_bucket{method="GET",path="/healthz",le="0.005"} 2
http_request_duration_seconds_bucket{method="GET",path="/healthz",le="0.01"} 2
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;span class="go"&gt;http_request_duration_seconds_bucket{method="GET",path="/healthz",le="+Inf"} 2
http_request_duration_seconds_sum{method="GET",path="/healthz"} 0.001272395
http_request_duration_seconds_count{method="GET",path="/healthz"} 2
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP app_uptime_seconds Process &lt;span class="nb"&gt;uptime &lt;/span&gt;&lt;span class="k"&gt;in &lt;/span&gt;seconds.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE app_uptime_seconds gauge
&lt;span class="go"&gt;app_uptime_seconds 4.557
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP app_mode Current deployment mode, &lt;span class="nv"&gt;stable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 and &lt;span class="nv"&gt;canary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE app_mode gauge
&lt;span class="go"&gt;app_mode 0
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP chaos_active Current chaos state, &lt;span class="nv"&gt;none&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nv"&gt;slow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE chaos_active gauge
&lt;span class="go"&gt;chaos_active 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The histogram buckets are cumulative — each &lt;code&gt;le&lt;/code&gt; bucket contains all requests at or below that latency. Two requests, both under 5 ms, so every bucket from &lt;code&gt;le="0.005"&lt;/code&gt; upward shows 2. The &lt;code&gt;+Inf&lt;/code&gt; bucket always equals &lt;code&gt;_count&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/metrics&lt;/code&gt; and &lt;code&gt;/chaos&lt;/code&gt; are deliberately exempt from chaos middleware. The reason: if error chaos is injected at 100% rate and &lt;code&gt;/metrics&lt;/code&gt; also returned 500s, the CLI would lose its ability to observe the failure and the policy loop would go blind. The exemption is intentional.&lt;/p&gt;




&lt;h2&gt;
  
  
  The policy brain: why OPA and not a Python if-statement
&lt;/h2&gt;

&lt;p&gt;My first instinct was to put the threshold checks directly in the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# what I did NOT do
&lt;/span&gt;&lt;span class="n"&gt;disk_free&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;disk_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;free&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;disk_free&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deploy blocked: not enough disk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with this is that the threshold is a magic number in Python code. If you want to change it you edit the Python. If someone else has a different threshold they fork the script. There is no audit trail of what value was used when. And the policy is not testable in isolation.&lt;/p&gt;

&lt;p&gt;OPA solves this differently. The CLI collects facts and sends them to OPA as a JSON document. OPA evaluates Rego rules against the document and returns a decision. The CLI enforces whatever OPA says. The CLI never checks a threshold itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0982mk60id5y4fsju01y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0982mk60id5y4fsju01y.png" alt=" " width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The infrastructure policy lives in &lt;code&gt;policies/infrastructure.rego&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="n"&gt;deny&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"disk_free_too_low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"message"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"disk free %vGB is below required %vGB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="s2"&gt;"observed"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"threshold"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;supported_question&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice &lt;code&gt;input.thresholds.min_disk_free_gb&lt;/code&gt; — not a hardcoded number. The threshold comes from the input document, which the CLI builds from &lt;code&gt;manifest.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;infrastructure_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_manifest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;host_stats&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;                                        &lt;span class="c1"&gt;# disk_free_gb, cpu_load
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thresholds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;policy_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;infrastructure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# from manifest.yaml
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The manifest has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;infrastructure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;max_cpu_load&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2.0&lt;/span&gt;
  &lt;span class="na"&gt;canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_error_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.01&lt;/span&gt;
    &lt;span class="na"&gt;max_p99_latency_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Changing a threshold is a one-line edit in &lt;code&gt;manifest.yaml&lt;/code&gt;. The Rego file never changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why OPA runs as a sidecar
&lt;/h3&gt;

&lt;p&gt;OPA runs as a separate Docker container on the same internal network. The CLI talks to it on &lt;code&gt;127.0.0.1:18181&lt;/code&gt;. The important thing is what is NOT there: there is no nginx upstream for OPA. The nginx config has exactly one &lt;code&gt;location / { proxy_pass http://app_backend; }&lt;/code&gt; block. Requests through port 18080 reach the app and nothing else.&lt;/p&gt;

&lt;p&gt;The OPA port binding is &lt;code&gt;127.0.0.1:18181:8181&lt;/code&gt; — loopback only on the host. External machines cannot reach OPA directly. And even from inside the Docker network, nginx has no route to the OPA container's address, so a client hitting nginx cannot tunnel through to OPA.&lt;/p&gt;

&lt;h3&gt;
  
  
  OPA never returns a bare boolean
&lt;/h3&gt;

&lt;p&gt;Every decision object carries &lt;code&gt;allowed&lt;/code&gt;, &lt;code&gt;reason&lt;/code&gt;, and a &lt;code&gt;violations&lt;/code&gt; list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"infrastructure"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pre_deploy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"infrastructure policy denied: 1 violation(s)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"violations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disk_free_too_low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disk free 121.359GB is below required 1e+06GB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"observed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;121.359&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1000000.0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI prints the reason and each violation ID. An operator looking at a denied deploy sees exactly which rule fired and what values triggered it, not just "denied".&lt;/p&gt;




&lt;h2&gt;
  
  
  The pre-deploy gate in action
&lt;/h2&gt;

&lt;p&gt;To prove the deploy gate worked I temporarily set &lt;code&gt;min_disk_free_gb: 1000000&lt;/code&gt; in the manifest — an impossible threshold — and ran &lt;code&gt;./swiftdeploy deploy&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy deploy    &lt;span class="c"&gt;# with min_disk_free_gb set to 1000000&lt;/span&gt;
&lt;span class="go"&gt;swiftdeploy deploy: rendering and starting policy sidecar
swiftdeploy init: rendering generated files from manifest.yaml
rendered nginx.conf &amp;lt;- templates\nginx.conf.tmpl
rendered docker-compose.yml &amp;lt;- templates\docker-compose.tmpl
OK: nginx.conf and docker-compose.yml regenerated.
swiftdeploy policy: starting OPA sidecar
[PASS] OPA health check passed
swiftdeploy deploy: querying pre-deploy policy
[FAIL] policy/infrastructure: infrastructure policy denied: 1 violation(s)
  - disk_free_too_low: disk free 121.359GB is below required 1e+06GB
&lt;/span&gt;&lt;span class="gp"&gt;[FAIL] deploy blocked by policy;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;app and nginx were not started
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OPA starts. OPA checks. OPA denies. The app and nginx containers never even get created. After restoring the threshold to 10, deploy succeeds in under 2 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  The live status dashboard
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy status&lt;/code&gt; scrapes &lt;code&gt;/metrics&lt;/code&gt; every 5 seconds, calculates req/s and P99 latency against the previous snapshot, queries both OPA domains for their current verdict, and appends a record to &lt;code&gt;history.jsonl&lt;/code&gt;. Here is what a healthy stable deployment looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy status &lt;span class="nt"&gt;--once&lt;/span&gt;
&lt;span class="go"&gt;SwiftDeploy status @ 2026-05-06T18:37:02.816576+00:00
mode=stable chaos=none uptime=5.2s
req/s=2.000 error_rate=0.00% p99=0.005s window=0.0s
Policy Compliance:
[PASS] policy/infrastructure: infrastructure policy passed
[PASS] policy/canary: canary safety policy passed
history appended: history.jsonl
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both policies show green. P99 is 5 ms. Now watch what happens after chaos.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chaos mode and what the dashboard showed
&lt;/h2&gt;

&lt;p&gt;After promoting to canary I injected a 100% error rate:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rojn638op9meg9cgfp8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rojn638op9meg9cgfp8.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    -d '{"mode":"error","rate":1.0}' http://127.0.0.1:18080/chaos

{"chaos":{"mode":"error","duration":0.0,"rate":1.0}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The canary is now returning 500 on every non-exempt request. The status dashboard immediately picked this up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;SwiftDeploy status @ 2026-05-06T18:37:10.281061+00:00
mode=canary chaos=error uptime=6.7s
req/s=4.469 error_rate=100.00% p99=0.005s window=1.1s
Policy Compliance:
[PASS] policy/infrastructure: infrastructure policy passed
[FAIL] policy/canary: canary safety policy denied: 1 violation(s)
  - error_rate_too_high: error rate 1.0000 is above allowed 0.0100
history appended: history.jsonl
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The canary policy is now red. &lt;code&gt;error_rate=100.00%&lt;/code&gt;. OPA knows. The status loop is recording this to &lt;code&gt;history.jsonl&lt;/code&gt; every scrape cycle.&lt;/p&gt;

&lt;p&gt;Now I tried to promote back to stable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy promote stable
&lt;span class="go"&gt;swiftdeploy promote: target mode=stable
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;span class="go"&gt;[PASS] OPA health check passed
  policy: querying canary safety before manifest mutation
[FAIL] policy/canary: canary safety policy denied: 1 violation(s)
  - error_rate_too_high: error rate 1.0000 is above allowed 0.0100
&lt;/span&gt;&lt;span class="gp"&gt;[FAIL] promote blocked by policy;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;manifest.yaml was not changed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last line is the safety guarantee that matters: &lt;code&gt;manifest.yaml was not changed&lt;/code&gt;. The policy check runs &lt;strong&gt;before&lt;/strong&gt; the manifest mutation. A failed check leaves the stack exactly as it was. No half-promote. No corrupted state.&lt;/p&gt;

&lt;p&gt;After recovering from chaos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    -d '{"mode":"recover"}' http://127.0.0.1:18080/chaos

{"chaos":{"mode":null,"duration":0.0,"rate":0.0}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next &lt;code&gt;promote stable&lt;/code&gt; succeeds because OPA now sees a clean error rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit trail
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy audit&lt;/code&gt; reads &lt;code&gt;history.jsonl&lt;/code&gt; and generates &lt;code&gt;audit_report.md&lt;/code&gt;. After the whole lifecycle above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy audit
&lt;span class="go"&gt;audit: wrote audit_report.md from 6 history record(s)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The report's timeline section:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Chaos&lt;/th&gt;
&lt;th&gt;Req/s&lt;/th&gt;
&lt;th&gt;Error Rate&lt;/th&gt;
&lt;th&gt;P99 Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:36:54&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;0.000&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.000s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:36:55&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;0.000&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.000s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:02&lt;/td&gt;
&lt;td&gt;stable&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;2.000&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:04&lt;/td&gt;
&lt;td&gt;stable&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;4.721&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:10&lt;/td&gt;
&lt;td&gt;canary&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;4.469&lt;/td&gt;
&lt;td&gt;100.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:12&lt;/td&gt;
&lt;td&gt;canary&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;4.519&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The violations section:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- 2026-05-06T18:36:54  infrastructure  deny: disk_free_too_low
    disk free 121.359GB is below required 1e+06GB
- 2026-05-06T18:37:10  canary  deny: error_rate_too_high
    error rate 1.0000 is above allowed 0.0100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two violations, two causes, timestamps on both. The first was the intentional disk threshold test. The second was the chaos injection. Both are there even though neither resulted in a broken deployment — that is the point of an audit trail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Replicate it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone&lt;/span&gt;
git clone https://github.com/Kaycee-dev/hng14-devops-stage4A swiftdeploy
&lt;span class="nb"&gt;cd &lt;/span&gt;swiftdeploy

&lt;span class="c"&gt;# 2. Install the one Python dependency&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;pyyaml

&lt;span class="c"&gt;# 3. Build the app image&lt;/span&gt;
docker build &lt;span class="nt"&gt;-t&lt;/span&gt; swiftdeploy-stage4b-app:1.0.0 &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# 4. Validate — should show 5 PASS lines&lt;/span&gt;
./swiftdeploy validate

&lt;span class="c"&gt;# 5. Deploy (OPA starts first, policy check runs, then app + nginx)&lt;/span&gt;
./swiftdeploy deploy

&lt;span class="c"&gt;# 6. Check status&lt;/span&gt;
./swiftdeploy status &lt;span class="nt"&gt;--once&lt;/span&gt;

&lt;span class="c"&gt;# 7. Promote to canary&lt;/span&gt;
./swiftdeploy promote canary

&lt;span class="c"&gt;# 8. Inject chaos&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode":"error","rate":1.0}'&lt;/span&gt; http://127.0.0.1:18080/chaos

&lt;span class="c"&gt;# 9. Watch status go red&lt;/span&gt;
./swiftdeploy status &lt;span class="nt"&gt;--once&lt;/span&gt;

&lt;span class="c"&gt;# 10. Try to promote to stable — policy blocks it&lt;/span&gt;
./swiftdeploy promote stable

&lt;span class="c"&gt;# 11. Recover&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode":"recover"}'&lt;/span&gt; http://127.0.0.1:18080/chaos

&lt;span class="c"&gt;# 12. Now promote succeeds&lt;/span&gt;
./swiftdeploy promote stable

&lt;span class="c"&gt;# 13. Generate the audit report&lt;/span&gt;
./swiftdeploy audit
&lt;span class="nb"&gt;cat &lt;/span&gt;audit_report.md

&lt;span class="c"&gt;# 14. Tear down and prove regeneration&lt;/span&gt;
./swiftdeploy teardown &lt;span class="nt"&gt;--clean&lt;/span&gt;
./swiftdeploy init      &lt;span class="c"&gt;# nginx.conf and docker-compose.yml come back byte-identical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows note:&lt;/strong&gt; run everything inside Git Bash. &lt;code&gt;os.getloadavg()&lt;/code&gt; does not exist on Windows, so CPU load always reads as 0.0. The CPU policy check still works — to prove it, set &lt;code&gt;max_cpu_load: -1.0&lt;/code&gt; in &lt;code&gt;manifest.yaml&lt;/code&gt; and run deploy. That forces &lt;code&gt;0.0 &amp;gt; -1.0&lt;/code&gt; and you get a CPU denial. On Linux or macOS the real load average is used and a threshold of 2.0 is meaningful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The CLI is an enforcer, not a judge
&lt;/h3&gt;

&lt;p&gt;The most tempting shortcut was putting threshold comparisons directly in the Python. It would have been three lines of code. The problem is that once you put a threshold in Python, OPA is just logging middleware — you can bypass it by changing the Python. The design that actually holds is: the CLI gathers facts, calls OPA, reads the decision, acts on it. The CLI never knows what the thresholds are. If you want to understand why a deploy was blocked, you read the Rego file and the manifest, not the Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  The single source of truth saves you at 2am
&lt;/h3&gt;

&lt;p&gt;Everything flows from &lt;code&gt;manifest.yaml&lt;/code&gt;. When the grader deletes &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt; and runs &lt;code&gt;./swiftdeploy init&lt;/code&gt;, they get the same files back. The SHA256 hash of the generated files is deterministic given the manifest. If something breaks, you open the manifest. You do not hunt through five separate files trying to find where the port is defined.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generated artifacts and source files must be clearly separated
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt; have &lt;code&gt;DO NOT HAND-EDIT&lt;/code&gt; headers. &lt;code&gt;history.jsonl&lt;/code&gt; and &lt;code&gt;audit_report.md&lt;/code&gt; are generated outputs of the CLI runtime. None of these are source files. Committing them to the repo is fine as evidence and for the grader, but they must never be the thing you edit to configure the stack. The moment you hand-edit a generated file you break the invariant the whole tool is built on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The two-scrape window is a real trade-off
&lt;/h3&gt;

&lt;p&gt;The brief asks for error rate "over the last 30 seconds." What the implementation actually does is take two metrics scrapes about 1 second apart and evaluate the delta. This gives an immediate signal — if the canary is broken right now, the next promote is blocked within 1 second of the command starting. The trade-off is that a bursty error spike from 10 seconds ago would not block promotion. The right answer for production is to run &lt;code&gt;./swiftdeploy status&lt;/code&gt; for 30 seconds before promoting so the rolling history is warm. For this project the live window proved the policy gate works; the 30-second window is a design goal, not a current implementation constraint.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>opa</category>
      <category>prometheus</category>
      <category>docker</category>
    </item>
  </channel>
</rss>
