<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kelechi Uba</title>
    <description>The latest articles on DEV Community by Kelechi Uba (@kelechi_uba_d8ec694684838).</description>
    <link>https://dev.to/kelechi_uba_d8ec694684838</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916545%2F76de581c-086e-4a6c-be61-c1c975a9d7bc.jpg</url>
      <title>DEV Community: Kelechi Uba</title>
      <link>https://dev.to/kelechi_uba_d8ec694684838</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kelechi_uba_d8ec694684838"/>
    <language>en</language>
    <item>
      <title>The Bug the Happy Path Hides</title>
      <dc:creator>Kelechi Uba</dc:creator>
      <pubDate>Sat, 13 Jun 2026 11:33:47 +0000</pubDate>
      <link>https://dev.to/kelechi_uba_d8ec694684838/the-bug-the-happy-path-hides-3gjo</link>
      <guid>https://dev.to/kelechi_uba_d8ec694684838/the-bug-the-happy-path-hides-3gjo</guid>
      <description>&lt;p&gt;&lt;em&gt;Two backend tasks from my internship, and the failure modes that only appear after the demo is over.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most backends I build pass their own demo.&lt;/p&gt;

&lt;p&gt;You click through the intended flow, the happy path turns green, and the feature feels finished. But the recurring lesson of this internship was that "passes the demo" and "holds up under failure" are separated by a specific class of bug: the one that only appears under a condition the demo never creates.&lt;/p&gt;

&lt;p&gt;A crash at exactly the wrong byte. Two requests arriving inside the same critical section. A background task running longer than the timeout its author guessed. A transport failing after the database has already committed.&lt;/p&gt;

&lt;p&gt;This is about two of those tasks. One was an individual project: a write-ahead log built from first principles. The other was a team feature: a streaming persona-refinement endpoint for an AI product. Different stacks, different failure modes, same lesson.&lt;/p&gt;

&lt;h2&gt;
  
  
  Task 1: A write-ahead log without a database
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What it was
&lt;/h3&gt;

&lt;p&gt;The task was to build an append-only event store without a database or third-party storage system.&lt;/p&gt;

&lt;p&gt;I built it with Python 3.11 and FastAPI. Each event is serialized as compact, newline-delimited JSON and appended to one flat file, &lt;code&gt;events.log&lt;/code&gt;. An in-memory dictionary maps every &lt;code&gt;event_id&lt;/code&gt; to a &lt;code&gt;(byte_offset, byte_length)&lt;/code&gt; pair. Reads use that index to jump directly to the record:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On startup, the service replays the log and rebuilds the index before accepting requests.&lt;/p&gt;

&lt;p&gt;The core implementation is small. That was part of the challenge. Once I removed the database, I also removed the layer that normally handles durability, indexing, transaction boundaries, and recovery. Every assumption about bytes and concurrency became my responsibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem it solved
&lt;/h3&gt;

&lt;p&gt;The event store needed four properties:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Append-only writes.&lt;/li&gt;
&lt;li&gt;Byte-accurate random-access reads.&lt;/li&gt;
&lt;li&gt;Startup recovery by replaying the log.&lt;/li&gt;
&lt;li&gt;Consistent state when concurrent requests race.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The "no database" constraint forced me to implement those guarantees explicitly instead of delegating them to Postgres.&lt;/p&gt;

&lt;h3&gt;
  
  
  How I approached it
&lt;/h3&gt;

&lt;p&gt;The file format is deliberately simple: one UTF-8 JSON object per write, followed by a newline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;serialized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;encoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serialized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The index stores the offset of the JSON object's first byte and its encoded length, excluding the newline. That lets reads return exactly the record bytes without scanning or trimming.&lt;/p&gt;

&lt;p&gt;A single &lt;code&gt;asyncio.Lock&lt;/code&gt; guards the full append transaction:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the current file size.&lt;/li&gt;
&lt;li&gt;Append the encoded event.&lt;/li&gt;
&lt;li&gt;Flush the file buffer.&lt;/li&gt;
&lt;li&gt;Add the matching index entry.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Recovery runs inside FastAPI's lifespan hook, so the service rebuilds its index before serving traffic.&lt;/p&gt;

&lt;p&gt;That design worked. Then the edge cases started testing what "worked" actually meant.&lt;/p&gt;

&lt;h3&gt;
  
  
  What broke, and how I fixed it
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Unicode byte trap
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb11w1r50fsqdrtbi0q07.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb11w1r50fsqdrtbi0q07.png" alt="Two rows comparing the same string measured in characters versus UTF-8 bytes; a seek that is correct by character lands inside a multibyte character" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Characters are counted in code points; the file is addressed in bytes. A seek that is correct by character can land mid-byte.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Python's &lt;code&gt;len(s)&lt;/code&gt; counts Unicode code points. Files are addressed in bytes.&lt;/p&gt;

&lt;p&gt;For ASCII, those values happen to match. That is what makes the bug dangerous: it passes tests written with payloads such as &lt;code&gt;"test"&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"é"&lt;/code&gt; is one character but two UTF-8 bytes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"🎉"&lt;/code&gt; is one Unicode code point but four UTF-8 bytes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the index stores &lt;code&gt;len(serialized)&lt;/code&gt;, then &lt;code&gt;seek()&lt;/code&gt; moves to a byte offset but &lt;code&gt;read(length)&lt;/code&gt; uses a character count. The read can stop inside a multibyte character, leaving invalid UTF-8 or truncated JSON.&lt;/p&gt;

&lt;p&gt;The implementation stores &lt;code&gt;len(encoded)&lt;/code&gt;, and the test makes the reason explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_bytes&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The payload is &lt;code&gt;"Chidé 🎉"&lt;/code&gt; on purpose. A future change back to character length cannot pass that test by accident.&lt;/p&gt;

&lt;h4&gt;
  
  
  Stats could observe half a write
&lt;/h4&gt;

&lt;p&gt;My first &lt;code&gt;/stats&lt;/code&gt; endpoint read the event count from memory and the byte count from disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bytes_written&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stat&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;st_size&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bytes_written&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both reads happened outside the append lock.&lt;/p&gt;

&lt;p&gt;That created a small but real race. If a &lt;code&gt;POST&lt;/code&gt; had written and flushed the bytes but had not yet updated the index, a simultaneous &lt;code&gt;GET /stats&lt;/code&gt; could report the new file size with the old event count. Each number was individually true, but the pair described no consistent state the system had ever held.&lt;/p&gt;

&lt;p&gt;I moved the snapshot into the store and protected it with the same lock as append:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;bytes_written&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stat&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;st_size&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bytes_written&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now both values come from one transaction boundary.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recovery failed on a torn tail
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falf8vq6lw9gw2tma5yzv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falf8vq6lw9gw2tma5yzv.png" alt="An append-only log of intact records ending in a torn final fragment; lookahead skips an interrupted final append but fails loudly on interior corruption" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Lookahead lets recovery treat a torn final record as an interrupted append, while interior corruption still fails loudly.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The most serious bug was in startup recovery.&lt;/p&gt;

&lt;p&gt;The first recovery loop parsed every line as JSON. It could handle a valid final record without a trailing newline, but not a process crash in the middle of a write. If the file ended with bytes such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"x":3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;json.loads()&lt;/code&gt; raised. That exception escaped the lifespan hook, so the service could not start until someone manually repaired the log.&lt;/p&gt;

&lt;p&gt;A crash during a write had made the write-ahead log prevent recovery.&lt;/p&gt;

&lt;p&gt;The fix needed more than a broad &lt;code&gt;try/except&lt;/code&gt;. A malformed final fragment can be evidence of an interrupted append. A malformed interior record means previously committed data was damaged. Silently ignoring both would hide real corruption.&lt;/p&gt;

&lt;p&gt;Recovery therefore uses one-line lookahead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw_line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;raw_line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;next_line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;is_last_line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;next_line&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;UnicodeDecodeError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_last_line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Skipped 1 incomplete trailing record in events.log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt;

    &lt;span class="n"&gt;raw_line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;next_line&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The distinction is deliberate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Valid final JSON without a newline is recovered.&lt;/li&gt;
&lt;li&gt;Invalid final bytes are treated as a torn append and skipped.&lt;/li&gt;
&lt;li&gt;Invalid interior data fails loudly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test injects the exact crash shape instead of approximating it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ab&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;recovered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EventStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;recover&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;recovered&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Skipped 1 incomplete trailing record in events.log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The commit history made the lesson measurable. I built append, indexed reads, stats, and startup recovery in under two hours. Roughly fourteen and a half hours later, harder tests exposed the stats race and torn-tail failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I took away
&lt;/h3&gt;

&lt;p&gt;The write path and recovery path are one protocol implemented in two places. They must agree on byte lengths, record boundaries, and what counts as committed. If either side interprets the file differently, the system eventually reads garbage or refuses to boot.&lt;/p&gt;

&lt;p&gt;The stats bug was the same lesson at a smaller scale: related state needs one observation boundary.&lt;/p&gt;

&lt;p&gt;I also learned to state durability honestly. &lt;code&gt;flush()&lt;/code&gt; moves Python's buffered bytes to the operating system, which protects against an application-process crash. It is not &lt;code&gt;fsync()&lt;/code&gt;, so it does not promise survival after a machine crash or power loss. Calling the prototype "fully durable" would have hidden an important boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why I picked it
&lt;/h3&gt;

&lt;p&gt;This task gave me both kinds of useful bug.&lt;/p&gt;

&lt;p&gt;The Unicode trap was one I anticipated and encoded into a regression test. The stats race and torn-tail failure were bugs I shipped in the first version and found only after testing the moments between the happy-path steps.&lt;/p&gt;

&lt;p&gt;With no database to absorb my mistakes, the failure boundaries were impossible to ignore.&lt;/p&gt;

&lt;h2&gt;
  
  
  Task 2: Streaming an AI refinement without duplicating the turn
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What it was
&lt;/h3&gt;

&lt;p&gt;Anvila is an AI persona-generation platform. I worked on its FastAPI backend, where Celery workers perform long-running generation and Redis carries live events.&lt;/p&gt;

&lt;p&gt;My largest feature was the persona-refinement endpoint. It lets a user continue a conversation with an existing persona and watch the response arrive over Server-Sent Events. The model may return a normal streaming answer or regenerate the persona's files and skills.&lt;/p&gt;

&lt;p&gt;The change was substantial: roughly 1,300 inserted lines in its first commit, including a new Celery task, the endpoint and SSE relay, a shared persona-application module, and 637 lines of endpoint and task tests.&lt;/p&gt;

&lt;p&gt;The architecture sounds straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser &amp;lt;- SSE &amp;lt;- FastAPI relay &amp;lt;- Redis pub/sub &amp;lt;- Celery worker &amp;lt;- model
                                      |
                                      +-----------------------&amp;gt; database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4ohglemgi5jmqwhle7t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4ohglemgi5jmqwhle7t.png" alt="Left-to-right streaming pipeline: browser, FastAPI relay, Redis pub/sub, Celery worker, model, with a durable database branch off the worker" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Browser, relay, Redis, worker, and database each have their own lifetime and failure mode.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every arrow has an independent lifetime and a different failure mode.&lt;/p&gt;
&lt;h3&gt;
  
  
  The problem it solved
&lt;/h3&gt;

&lt;p&gt;Generation is too slow and stateful to run inside a normal request-response cycle. The work belongs in a background worker.&lt;/p&gt;

&lt;p&gt;But users still expect immediate feedback. They should see tokens as the model produces them, not stare at a loading spinner until the full turn is complete.&lt;/p&gt;

&lt;p&gt;That required the API process and worker process to agree on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;where events are published,&lt;/li&gt;
&lt;li&gt;when the listener is ready,&lt;/li&gt;
&lt;li&gt;how long silence is tolerated,&lt;/li&gt;
&lt;li&gt;which state is durable,&lt;/li&gt;
&lt;li&gt;and whether a delivery failure is allowed to retry the task.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hard part was not streaming a token. It was preventing timing and transport failures from losing the beginning of a stream or duplicating the durable result.&lt;/p&gt;
&lt;h3&gt;
  
  
  How I approached it
&lt;/h3&gt;

&lt;p&gt;I extended the model adapter with an asynchronous &lt;code&gt;.stream()&lt;/code&gt; generator. The endpoint creates a Redis pub/sub subscription, enqueues the Celery task, and relays channel messages as SSE frames. The worker processes the model response, persists conversation or persona changes, and publishes status, token, file, skill, and completion events.&lt;/p&gt;

&lt;p&gt;One ordering rule mattered immediately: subscribe before enqueue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pubsub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;refine_persona&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If those lines are reversed, a fast worker can publish its first events before the relay is listening. Redis pub/sub does not retain them for a late subscriber.&lt;/p&gt;

&lt;p&gt;I added a test that records both operations and asserts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That protects the race directly instead of hoping the worker is always slower than the request.&lt;/p&gt;

&lt;h3&gt;
  
  
  What broke, and how I fixed it
&lt;/h3&gt;

&lt;p&gt;The feature worked in manual testing. Review still found two production-shaped failures.&lt;/p&gt;

&lt;h4&gt;
  
  
  The timeout measured age, not idleness
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnrmsyoui4zjdueru2gd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnrmsyoui4zjdueru2gd.png" alt="Two timelines: a fixed wall-clock cutoff severs a healthy but slow stream, while an idle timeout resets on each event and still delivers the terminal event" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A wall-clock deadline cuts a healthy-but-slow stream; an idle timeout resets on every delivered event.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;My first relay used one wall-clock deadline. Once the stream reached that age, the relay stopped.&lt;/p&gt;

&lt;p&gt;That confuses two very different states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no events are arriving because the worker is dead;&lt;/li&gt;
&lt;li&gt;events are arriving, but the task is legitimately slow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A long refinement could still be delivering tokens and get disconnected simply because it crossed the original cutoff.&lt;/p&gt;

&lt;p&gt;The fix turned the deadline into an idle timeout. Every delivered event resets it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;REFINE_RELAY_IDLE_TIMEOUT_SECONDS&lt;/span&gt;

&lt;span class="c1"&gt;# after each delivered event
&lt;/span&gt;&lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;REFINE_RELAY_IDLE_TIMEOUT_SECONDS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The regression test advances a fake clock beyond the original deadline between two events. The relay must still deliver the terminal event because the first event reset its idle window.&lt;/p&gt;

&lt;p&gt;The rule became precise: time out silence, not duration.&lt;/p&gt;

&lt;h4&gt;
  
  
  A Redis failure could retry durable work
&lt;/h4&gt;

&lt;p&gt;The more dangerous bug crossed a consistency boundary.&lt;/p&gt;

&lt;p&gt;The Celery task writes durable state: assistant messages, token usage, persona files, and skills. It also publishes ephemeral Redis events for the live stream.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;publish()&lt;/code&gt; raises and that exception escapes the task, Celery may retry the entire refinement. That is acceptable before durable work is committed. It is dangerous after commit, because the retry can duplicate an assistant turn, usage accounting, or file updates.&lt;/p&gt;

&lt;p&gt;The two systems cannot form one atomic transaction. Redis cannot roll back the database, and the database cannot guarantee SSE delivery.&lt;/p&gt;

&lt;p&gt;So I chose the failure hierarchy explicitly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Durable state is authoritative.&lt;/li&gt;
&lt;li&gt;Redis delivery is best-effort.&lt;/li&gt;
&lt;li&gt;A publish failure is logged but never controls task retry.&lt;/li&gt;
&lt;li&gt;Terminal events are attempted only after the matching durable state is committed.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed to publish refine event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Live token events still occur while the model is streaming, before the final turn commit. The important guarantee is not that every publish happens after a commit. It is that a transport failure cannot turn into a second execution of durable work.&lt;/p&gt;

&lt;p&gt;That trades a degraded live stream for consistent stored state. The user may miss an event and need to reload the committed result. The system must not create the same turn twice.&lt;/p&gt;

&lt;h4&gt;
  
  
  Shared behavior was starting to fork
&lt;/h4&gt;

&lt;p&gt;My first implementation also duplicated the logic that applies model output to a persona: validating files, matching skills, and rebuilding derived content.&lt;/p&gt;

&lt;p&gt;That was not an immediate runtime failure, but it was a future correctness bug. Generation and refinement would eventually disagree about what a valid persona update meant.&lt;/p&gt;

&lt;p&gt;I extracted the behavior into a shared &lt;code&gt;persona_apply&lt;/code&gt; module used by both paths. The rule for applying model output now has one implementation and one place to harden.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I took away
&lt;/h3&gt;

&lt;p&gt;Distributed correctness often lives in ordering rather than in a large algorithm.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subscribe, then enqueue.&lt;/li&gt;
&lt;li&gt;Commit durable state before announcing completion.&lt;/li&gt;
&lt;li&gt;Reset a timeout on activity.&lt;/li&gt;
&lt;li&gt;Do not let an ephemeral transport decide whether committed work runs again.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each fix was small. Finding the need for it required treating the API connection, Redis channel, Celery task, and database transaction as separate systems instead of one feature-shaped box.&lt;/p&gt;

&lt;p&gt;The tests had to force those boundaries: a worker that publishes immediately, a clock that passes the original deadline while events still arrive, and a Redis client that raises at the worst possible moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why I picked it
&lt;/h3&gt;

&lt;p&gt;It is the same lesson as the write-ahead log at a larger scale.&lt;/p&gt;

&lt;p&gt;In the file-backed store, correctness depended on what happened between writing bytes and updating the index, or between a complete record and a torn tail. In Anvila, it depended on what happened between subscribing and enqueueing, or between committing the database transaction and publishing the completion event.&lt;/p&gt;

&lt;p&gt;The happy path compresses each pair into one mental step. Production separates them again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The throughline
&lt;/h2&gt;

&lt;p&gt;The demo is the weakest test I have.&lt;/p&gt;

&lt;p&gt;It exercises the sequence I designed under the timing I expected, with every dependency available. It does not crash between two instructions. It does not schedule the unlucky request. It does not slow a healthy worker down or fail Redis after Postgres commits.&lt;/p&gt;

&lt;p&gt;The work that made both tasks defensible was the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the boundary the happy path hides.&lt;/li&gt;
&lt;li&gt;Decide which side of that boundary is authoritative.&lt;/li&gt;
&lt;li&gt;Define the failure behavior before writing the fix.&lt;/li&gt;
&lt;li&gt;Write a test that forces the failure on purpose.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bugs that taught me the most were not in the line I was looking at. They were in the space between two lines I had treated as one operation.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>python</category>
      <category>fastapi</category>
      <category>testing</category>
    </item>
    <item>
      <title>I Built a CLI That Writes Its Own Docker Config — Then Taught It to Say No</title>
      <dc:creator>Kelechi Uba</dc:creator>
      <pubDate>Fri, 08 May 2026 13:28:17 +0000</pubDate>
      <link>https://dev.to/kelechi_uba_d8ec694684838/i-built-a-cli-that-writes-its-own-docker-config-then-taught-it-to-say-no-4on7</link>
      <guid>https://dev.to/kelechi_uba_d8ec694684838/i-built-a-cli-that-writes-its-own-docker-config-then-taught-it-to-say-no-4on7</guid>
      <description>&lt;p&gt;Every time I set up a stack from scratch I'd end up touching at least four files: &lt;code&gt;docker-compose.yml&lt;/code&gt;, &lt;code&gt;nginx.conf&lt;/code&gt;, a &lt;code&gt;.env&lt;/code&gt; file, maybe a &lt;code&gt;Makefile&lt;/code&gt;. Change the port in one place and forget to update the others and something silently breaks. I wanted to fix that. Stage 4A was the fix. Stage 4B was the moment I realised the fix was incomplete.&lt;/p&gt;

&lt;p&gt;This post covers the whole journey: how I built &lt;code&gt;swiftdeploy&lt;/code&gt;, why I wired in Prometheus metrics and an OPA policy sidecar, and what actually happened when I deliberately tried to break my own canary deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 4A: One file, everything else is generated
&lt;/h2&gt;

&lt;p&gt;The idea was simple. One file — &lt;code&gt;manifest.yaml&lt;/code&gt; — owns every setting. The CLI reads it and writes &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt;. You never touch the generated files. If you need to change something, you change the manifest and run &lt;code&gt;./swiftdeploy init&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;The manifest looks like this at its base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-stage4b-app:1.0.0&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stable&lt;/span&gt;

&lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:latest&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;18080&lt;/span&gt;
  &lt;span class="na"&gt;proxy_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swiftdeploy-net&lt;/span&gt;
  &lt;span class="na"&gt;driver_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;swiftdeploy init&lt;/code&gt; takes that and renders two generated files using Python's &lt;code&gt;string.Template&lt;/code&gt;. The templates live in &lt;code&gt;templates/&lt;/code&gt; and contain &lt;code&gt;${VARIABLE}&lt;/code&gt; placeholders that get substituted from the manifest context. Here is the critical bit from &lt;code&gt;config.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;render_templates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;ensure_policy_source&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;manifest_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;          &lt;span class="c1"&gt;# reads every ${VAR} from manifest.yaml
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tmpl_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;NGINX_TMPL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NGINX_OUT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COMPOSE_TMPL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;COMPOSE_OUT&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;rendered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmpl_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;safe_substitute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;atomic_write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rendered&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I used &lt;code&gt;safe_substitute&lt;/code&gt; instead of &lt;code&gt;substitute&lt;/code&gt; because &lt;code&gt;substitute&lt;/code&gt; raises an exception on any unknown &lt;code&gt;${...}&lt;/code&gt; token. Nginx config files are full of variables like &lt;code&gt;${request_time}&lt;/code&gt; — if I had used &lt;code&gt;substitute&lt;/code&gt;, rendering would blow up on every nginx variable. &lt;code&gt;safe_substitute&lt;/code&gt; leaves tokens it doesn't recognise alone, so nginx gets its variables and the manifest gets its values.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;atomic_write&lt;/code&gt; helper writes to a temp file first, then does &lt;code&gt;os.replace&lt;/code&gt; into the final path. The reason: if something crashes mid-write you end up with a corrupt config. &lt;code&gt;os.replace&lt;/code&gt; is atomic on every OS Python runs on, so you either get the new file or the old one, never half of each.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdal985fca5gwpga1q666.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdal985fca5gwpga1q666.jpg" alt="manifest.yaml feeds swiftdeploy init which renders nginx.conf and docker-compose.yml — generated files carry DO NOT HAND-EDIT headers" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The app
&lt;/h3&gt;

&lt;p&gt;The API service is a FastAPI app with three endpoints: &lt;code&gt;GET /&lt;/code&gt; returns the mode and version, &lt;code&gt;GET /healthz&lt;/code&gt; returns uptime, and &lt;code&gt;POST /chaos&lt;/code&gt; lets you inject failure (more on that later). The &lt;code&gt;MODE&lt;/code&gt; environment variable controls whether the app is in stable or canary mode — same image, different behaviour. In canary mode every response carries an &lt;code&gt;X-Mode: canary&lt;/code&gt; header.&lt;/p&gt;

&lt;h3&gt;
  
  
  The deployment lifecycle
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy deploy&lt;/code&gt; calls &lt;code&gt;init&lt;/code&gt; first, then does &lt;code&gt;docker compose up -d&lt;/code&gt;, then polls &lt;code&gt;/healthz&lt;/code&gt; through nginx every second until it gets a 200 or 60 seconds pass. Nginx waits for the app to be healthy before it starts (&lt;code&gt;depends_on: condition: service_healthy&lt;/code&gt;), so the health poll through nginx is a genuine end-to-end check.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy promote canary&lt;/code&gt; mutates &lt;code&gt;services.mode&lt;/code&gt; in &lt;code&gt;manifest.yaml&lt;/code&gt; using a targeted regex — one line changes, nothing else. It then re-renders &lt;code&gt;docker-compose.yml&lt;/code&gt;, recreates only the app container (&lt;code&gt;--no-deps --force-recreate&lt;/code&gt;), and confirms the mode by checking both the JSON body and the &lt;code&gt;X-Mode&lt;/code&gt; header. If either signal is wrong, the promote fails.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy teardown --clean&lt;/code&gt; brings everything down and deletes the generated configs. Running &lt;code&gt;./swiftdeploy init&lt;/code&gt; afterwards regenerates byte-identical files. The grader can verify this. That idempotency guarantee is the whole point of the manifest-driven approach.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Stage 4A wasn't enough
&lt;/h2&gt;

&lt;p&gt;After building that I realised I had no visibility into what was happening inside the stack once it was running, and no automatic safety check before promoting. I was flying blind. I could deploy a canary that was returning 500 errors on every request and &lt;code&gt;promote stable&lt;/code&gt; would just do it, no questions asked.&lt;/p&gt;

&lt;p&gt;Stage 4B adds three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Eyes&lt;/strong&gt; — a &lt;code&gt;/metrics&lt;/code&gt; endpoint in Prometheus text format so I can see what is happening&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brain&lt;/strong&gt; — an OPA sidecar that makes every allow/deny decision so the CLI never has to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; — &lt;code&gt;history.jsonl&lt;/code&gt; and &lt;code&gt;audit_report.md&lt;/code&gt; so there is a record of what happened and when&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp9yxgmci0zqr3h0fc1g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp9yxgmci0zqr3h0fc1g.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The metrics endpoint
&lt;/h2&gt;

&lt;p&gt;The app exposes &lt;code&gt;GET /metrics&lt;/code&gt; and returns Prometheus text format — no Prometheus library, hand-rolled. Here is what it looks like right after a fresh deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; http://127.0.0.1:18080/metrics
&lt;span class="go"&gt;HTTP/1.1 200 OK
&lt;/span&gt;&lt;span class="gp"&gt;Content-Type: text/plain;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0.4&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;charset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;utf-8
&lt;span class="go"&gt;X-Deployed-By: swiftdeploy

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP http_requests_total Total HTTP requests by method, path, and status code.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE http_requests_total counter
&lt;span class="go"&gt;http_requests_total{method="GET",path="/healthz",status_code="200"} 2
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP http_request_duration_seconds HTTP request latency histogram &lt;span class="k"&gt;in &lt;/span&gt;seconds.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE http_request_duration_seconds histogram
&lt;span class="go"&gt;http_request_duration_seconds_bucket{method="GET",path="/healthz",le="0.005"} 2
http_request_duration_seconds_bucket{method="GET",path="/healthz",le="0.01"} 2
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;span class="go"&gt;http_request_duration_seconds_bucket{method="GET",path="/healthz",le="+Inf"} 2
http_request_duration_seconds_sum{method="GET",path="/healthz"} 0.001272395
http_request_duration_seconds_count{method="GET",path="/healthz"} 2
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP app_uptime_seconds Process &lt;span class="nb"&gt;uptime &lt;/span&gt;&lt;span class="k"&gt;in &lt;/span&gt;seconds.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE app_uptime_seconds gauge
&lt;span class="go"&gt;app_uptime_seconds 4.557
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP app_mode Current deployment mode, &lt;span class="nv"&gt;stable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 and &lt;span class="nv"&gt;canary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE app_mode gauge
&lt;span class="go"&gt;app_mode 0
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;HELP chaos_active Current chaos state, &lt;span class="nv"&gt;none&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nv"&gt;slow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2.
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;TYPE chaos_active gauge
&lt;span class="go"&gt;chaos_active 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The histogram buckets are cumulative — each &lt;code&gt;le&lt;/code&gt; bucket contains all requests at or below that latency. Two requests, both under 5 ms, so every bucket from &lt;code&gt;le="0.005"&lt;/code&gt; upward shows 2. The &lt;code&gt;+Inf&lt;/code&gt; bucket always equals &lt;code&gt;_count&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/metrics&lt;/code&gt; and &lt;code&gt;/chaos&lt;/code&gt; are deliberately exempt from chaos middleware. The reason: if error chaos is injected at 100% rate and &lt;code&gt;/metrics&lt;/code&gt; also returned 500s, the CLI would lose its ability to observe the failure and the policy loop would go blind. The exemption is intentional.&lt;/p&gt;




&lt;h2&gt;
  
  
  The policy brain: why OPA and not a Python if-statement
&lt;/h2&gt;

&lt;p&gt;My first instinct was to put the threshold checks directly in the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# what I did NOT do
&lt;/span&gt;&lt;span class="n"&gt;disk_free&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;disk_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;free&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;disk_free&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deploy blocked: not enough disk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with this is that the threshold is a magic number in Python code. If you want to change it you edit the Python. If someone else has a different threshold they fork the script. There is no audit trail of what value was used when. And the policy is not testable in isolation.&lt;/p&gt;

&lt;p&gt;OPA solves this differently. The CLI collects facts and sends them to OPA as a JSON document. OPA evaluates Rego rules against the document and returns a decision. The CLI enforces whatever OPA says. The CLI never checks a threshold itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0982mk60id5y4fsju01y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0982mk60id5y4fsju01y.png" alt=" " width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The infrastructure policy lives in &lt;code&gt;policies/infrastructure.rego&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="n"&gt;deny&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"disk_free_too_low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"message"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"disk free %vGB is below required %vGB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="s2"&gt;"observed"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"threshold"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;supported_question&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disk_free_gb&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_disk_free_gb&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice &lt;code&gt;input.thresholds.min_disk_free_gb&lt;/code&gt; — not a hardcoded number. The threshold comes from the input document, which the CLI builds from &lt;code&gt;manifest.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;infrastructure_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_manifest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;host_stats&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;                                        &lt;span class="c1"&gt;# disk_free_gb, cpu_load
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thresholds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;policy_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;infrastructure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# from manifest.yaml
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The manifest has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;infrastructure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;min_disk_free_gb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;max_cpu_load&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2.0&lt;/span&gt;
  &lt;span class="na"&gt;canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_error_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.01&lt;/span&gt;
    &lt;span class="na"&gt;max_p99_latency_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Changing a threshold is a one-line edit in &lt;code&gt;manifest.yaml&lt;/code&gt;. The Rego file never changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why OPA runs as a sidecar
&lt;/h3&gt;

&lt;p&gt;OPA runs as a separate Docker container on the same internal network. The CLI talks to it on &lt;code&gt;127.0.0.1:18181&lt;/code&gt;. The important thing is what is NOT there: there is no nginx upstream for OPA. The nginx config has exactly one &lt;code&gt;location / { proxy_pass http://app_backend; }&lt;/code&gt; block. Requests through port 18080 reach the app and nothing else.&lt;/p&gt;

&lt;p&gt;The OPA port binding is &lt;code&gt;127.0.0.1:18181:8181&lt;/code&gt; — loopback only on the host. External machines cannot reach OPA directly. And even from inside the Docker network, nginx has no route to the OPA container's address, so a client hitting nginx cannot tunnel through to OPA.&lt;/p&gt;

&lt;h3&gt;
  
  
  OPA never returns a bare boolean
&lt;/h3&gt;

&lt;p&gt;Every decision object carries &lt;code&gt;allowed&lt;/code&gt;, &lt;code&gt;reason&lt;/code&gt;, and a &lt;code&gt;violations&lt;/code&gt; list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"infrastructure"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pre_deploy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allowed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"infrastructure policy denied: 1 violation(s)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"violations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disk_free_too_low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"disk free 121.359GB is below required 1e+06GB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"observed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;121.359&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1000000.0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI prints the reason and each violation ID. An operator looking at a denied deploy sees exactly which rule fired and what values triggered it, not just "denied".&lt;/p&gt;




&lt;h2&gt;
  
  
  The pre-deploy gate in action
&lt;/h2&gt;

&lt;p&gt;To prove the deploy gate worked I temporarily set &lt;code&gt;min_disk_free_gb: 1000000&lt;/code&gt; in the manifest — an impossible threshold — and ran &lt;code&gt;./swiftdeploy deploy&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy deploy    &lt;span class="c"&gt;# with min_disk_free_gb set to 1000000&lt;/span&gt;
&lt;span class="go"&gt;swiftdeploy deploy: rendering and starting policy sidecar
swiftdeploy init: rendering generated files from manifest.yaml
rendered nginx.conf &amp;lt;- templates\nginx.conf.tmpl
rendered docker-compose.yml &amp;lt;- templates\docker-compose.tmpl
OK: nginx.conf and docker-compose.yml regenerated.
swiftdeploy policy: starting OPA sidecar
[PASS] OPA health check passed
swiftdeploy deploy: querying pre-deploy policy
[FAIL] policy/infrastructure: infrastructure policy denied: 1 violation(s)
  - disk_free_too_low: disk free 121.359GB is below required 1e+06GB
&lt;/span&gt;&lt;span class="gp"&gt;[FAIL] deploy blocked by policy;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;app and nginx were not started
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OPA starts. OPA checks. OPA denies. The app and nginx containers never even get created. After restoring the threshold to 10, deploy succeeds in under 2 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  The live status dashboard
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy status&lt;/code&gt; scrapes &lt;code&gt;/metrics&lt;/code&gt; every 5 seconds, calculates req/s and P99 latency against the previous snapshot, queries both OPA domains for their current verdict, and appends a record to &lt;code&gt;history.jsonl&lt;/code&gt;. Here is what a healthy stable deployment looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy status &lt;span class="nt"&gt;--once&lt;/span&gt;
&lt;span class="go"&gt;SwiftDeploy status @ 2026-05-06T18:37:02.816576+00:00
mode=stable chaos=none uptime=5.2s
req/s=2.000 error_rate=0.00% p99=0.005s window=0.0s
Policy Compliance:
[PASS] policy/infrastructure: infrastructure policy passed
[PASS] policy/canary: canary safety policy passed
history appended: history.jsonl
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both policies show green. P99 is 5 ms. Now watch what happens after chaos.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chaos mode and what the dashboard showed
&lt;/h2&gt;

&lt;p&gt;After promoting to canary I injected a 100% error rate:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rojn638op9meg9cgfp8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rojn638op9meg9cgfp8.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    -d '{"mode":"error","rate":1.0}' http://127.0.0.1:18080/chaos

{"chaos":{"mode":"error","duration":0.0,"rate":1.0}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The canary is now returning 500 on every non-exempt request. The status dashboard immediately picked this up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;SwiftDeploy status @ 2026-05-06T18:37:10.281061+00:00
mode=canary chaos=error uptime=6.7s
req/s=4.469 error_rate=100.00% p99=0.005s window=1.1s
Policy Compliance:
[PASS] policy/infrastructure: infrastructure policy passed
[FAIL] policy/canary: canary safety policy denied: 1 violation(s)
  - error_rate_too_high: error rate 1.0000 is above allowed 0.0100
history appended: history.jsonl
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The canary policy is now red. &lt;code&gt;error_rate=100.00%&lt;/code&gt;. OPA knows. The status loop is recording this to &lt;code&gt;history.jsonl&lt;/code&gt; every scrape cycle.&lt;/p&gt;

&lt;p&gt;Now I tried to promote back to stable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy promote stable
&lt;span class="go"&gt;swiftdeploy promote: target mode=stable
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;span class="go"&gt;[PASS] OPA health check passed
  policy: querying canary safety before manifest mutation
[FAIL] policy/canary: canary safety policy denied: 1 violation(s)
  - error_rate_too_high: error rate 1.0000 is above allowed 0.0100
&lt;/span&gt;&lt;span class="gp"&gt;[FAIL] promote blocked by policy;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;manifest.yaml was not changed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last line is the safety guarantee that matters: &lt;code&gt;manifest.yaml was not changed&lt;/code&gt;. The policy check runs &lt;strong&gt;before&lt;/strong&gt; the manifest mutation. A failed check leaves the stack exactly as it was. No half-promote. No corrupted state.&lt;/p&gt;

&lt;p&gt;After recovering from chaos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    -d '{"mode":"recover"}' http://127.0.0.1:18080/chaos

{"chaos":{"mode":null,"duration":0.0,"rate":0.0}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next &lt;code&gt;promote stable&lt;/code&gt; succeeds because OPA now sees a clean error rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit trail
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;./swiftdeploy audit&lt;/code&gt; reads &lt;code&gt;history.jsonl&lt;/code&gt; and generates &lt;code&gt;audit_report.md&lt;/code&gt;. After the whole lifecycle above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./swiftdeploy audit
&lt;span class="go"&gt;audit: wrote audit_report.md from 6 history record(s)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The report's timeline section:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Chaos&lt;/th&gt;
&lt;th&gt;Req/s&lt;/th&gt;
&lt;th&gt;Error Rate&lt;/th&gt;
&lt;th&gt;P99 Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:36:54&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;0.000&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.000s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:36:55&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;0.000&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.000s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:02&lt;/td&gt;
&lt;td&gt;stable&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;2.000&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:04&lt;/td&gt;
&lt;td&gt;stable&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;4.721&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:10&lt;/td&gt;
&lt;td&gt;canary&lt;/td&gt;
&lt;td&gt;error&lt;/td&gt;
&lt;td&gt;4.469&lt;/td&gt;
&lt;td&gt;100.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-06T18:37:12&lt;/td&gt;
&lt;td&gt;canary&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;4.519&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;0.005s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The violations section:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- 2026-05-06T18:36:54  infrastructure  deny: disk_free_too_low
    disk free 121.359GB is below required 1e+06GB
- 2026-05-06T18:37:10  canary  deny: error_rate_too_high
    error rate 1.0000 is above allowed 0.0100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two violations, two causes, timestamps on both. The first was the intentional disk threshold test. The second was the chaos injection. Both are there even though neither resulted in a broken deployment — that is the point of an audit trail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Replicate it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone&lt;/span&gt;
git clone https://github.com/Kaycee-dev/hng14-devops-stage4A swiftdeploy
&lt;span class="nb"&gt;cd &lt;/span&gt;swiftdeploy

&lt;span class="c"&gt;# 2. Install the one Python dependency&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;pyyaml

&lt;span class="c"&gt;# 3. Build the app image&lt;/span&gt;
docker build &lt;span class="nt"&gt;-t&lt;/span&gt; swiftdeploy-stage4b-app:1.0.0 &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# 4. Validate — should show 5 PASS lines&lt;/span&gt;
./swiftdeploy validate

&lt;span class="c"&gt;# 5. Deploy (OPA starts first, policy check runs, then app + nginx)&lt;/span&gt;
./swiftdeploy deploy

&lt;span class="c"&gt;# 6. Check status&lt;/span&gt;
./swiftdeploy status &lt;span class="nt"&gt;--once&lt;/span&gt;

&lt;span class="c"&gt;# 7. Promote to canary&lt;/span&gt;
./swiftdeploy promote canary

&lt;span class="c"&gt;# 8. Inject chaos&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode":"error","rate":1.0}'&lt;/span&gt; http://127.0.0.1:18080/chaos

&lt;span class="c"&gt;# 9. Watch status go red&lt;/span&gt;
./swiftdeploy status &lt;span class="nt"&gt;--once&lt;/span&gt;

&lt;span class="c"&gt;# 10. Try to promote to stable — policy blocks it&lt;/span&gt;
./swiftdeploy promote stable

&lt;span class="c"&gt;# 11. Recover&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"mode":"recover"}'&lt;/span&gt; http://127.0.0.1:18080/chaos

&lt;span class="c"&gt;# 12. Now promote succeeds&lt;/span&gt;
./swiftdeploy promote stable

&lt;span class="c"&gt;# 13. Generate the audit report&lt;/span&gt;
./swiftdeploy audit
&lt;span class="nb"&gt;cat &lt;/span&gt;audit_report.md

&lt;span class="c"&gt;# 14. Tear down and prove regeneration&lt;/span&gt;
./swiftdeploy teardown &lt;span class="nt"&gt;--clean&lt;/span&gt;
./swiftdeploy init      &lt;span class="c"&gt;# nginx.conf and docker-compose.yml come back byte-identical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows note:&lt;/strong&gt; run everything inside Git Bash. &lt;code&gt;os.getloadavg()&lt;/code&gt; does not exist on Windows, so CPU load always reads as 0.0. The CPU policy check still works — to prove it, set &lt;code&gt;max_cpu_load: -1.0&lt;/code&gt; in &lt;code&gt;manifest.yaml&lt;/code&gt; and run deploy. That forces &lt;code&gt;0.0 &amp;gt; -1.0&lt;/code&gt; and you get a CPU denial. On Linux or macOS the real load average is used and a threshold of 2.0 is meaningful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The CLI is an enforcer, not a judge
&lt;/h3&gt;

&lt;p&gt;The most tempting shortcut was putting threshold comparisons directly in the Python. It would have been three lines of code. The problem is that once you put a threshold in Python, OPA is just logging middleware — you can bypass it by changing the Python. The design that actually holds is: the CLI gathers facts, calls OPA, reads the decision, acts on it. The CLI never knows what the thresholds are. If you want to understand why a deploy was blocked, you read the Rego file and the manifest, not the Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  The single source of truth saves you at 2am
&lt;/h3&gt;

&lt;p&gt;Everything flows from &lt;code&gt;manifest.yaml&lt;/code&gt;. When the grader deletes &lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt; and runs &lt;code&gt;./swiftdeploy init&lt;/code&gt;, they get the same files back. The SHA256 hash of the generated files is deterministic given the manifest. If something breaks, you open the manifest. You do not hunt through five separate files trying to find where the port is defined.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generated artifacts and source files must be clearly separated
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;nginx.conf&lt;/code&gt; and &lt;code&gt;docker-compose.yml&lt;/code&gt; have &lt;code&gt;DO NOT HAND-EDIT&lt;/code&gt; headers. &lt;code&gt;history.jsonl&lt;/code&gt; and &lt;code&gt;audit_report.md&lt;/code&gt; are generated outputs of the CLI runtime. None of these are source files. Committing them to the repo is fine as evidence and for the grader, but they must never be the thing you edit to configure the stack. The moment you hand-edit a generated file you break the invariant the whole tool is built on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The two-scrape window is a real trade-off
&lt;/h3&gt;

&lt;p&gt;The brief asks for error rate "over the last 30 seconds." What the implementation actually does is take two metrics scrapes about 1 second apart and evaluate the delta. This gives an immediate signal — if the canary is broken right now, the next promote is blocked within 1 second of the command starting. The trade-off is that a bursty error spike from 10 seconds ago would not block promotion. The right answer for production is to run &lt;code&gt;./swiftdeploy status&lt;/code&gt; for 30 seconds before promoting so the rolling history is warm. For this project the live window proved the policy gate works; the 30-second window is a design goal, not a current implementation constraint.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>opa</category>
      <category>prometheus</category>
      <category>docker</category>
    </item>
  </channel>
</rss>
