<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Edith Asante</title>
    <description>The latest articles on DEV Community by Edith Asante (@edithasante).</description>
    <link>https://dev.to/edithasante</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3901913%2Ff989e0fc-e130-4ca5-a86b-35ae5199e0b8.png</url>
      <title>DEV Community: Edith Asante</title>
      <link>https://dev.to/edithasante</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/edithasante"/>
    <language>en</language>
    <item>
      <title># I Built a Tool That Watches Your Server, Learns Your Traffic, and Blocks Attackers Automatically</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Tue, 12 May 2026 06:27:34 +0000</pubDate>
      <link>https://dev.to/edithasante/-i-built-a-tool-that-watches-your-server-learns-your-traffic-and-blocks-attackers-automatically-11f7</link>
      <guid>https://dev.to/edithasante/-i-built-a-tool-that-watches-your-server-learns-your-traffic-and-blocks-attackers-automatically-11f7</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Most developers deploy servers. Few think about what happens when someone tries to take them down. I did. I built ShieldDaemon — a tool that watches every request hitting your server, learns your normal traffic patterns, and automatically blocks attackers the moment something looks wrong. No manual intervention. No hardcoded rules. Just a daemon that never sleeps. Here is exactly how I built it.&lt;/strong&gt;
&lt;/h2&gt;




&lt;h2&gt;
  
  
  What Is This Project About?
&lt;/h2&gt;

&lt;p&gt;Imagine you run an online shop. Everything is working fine until one day thousands of fake requests flood your website all at once. Your server crashes. Real customers can't access your shop. You lose money and trust.&lt;/p&gt;

&lt;p&gt;That is called a &lt;strong&gt;DDoS attack&lt;/strong&gt; — Distributed Denial of Service. It is one of the most common ways attackers take down websites.&lt;/p&gt;

&lt;p&gt;In this project I built &lt;strong&gt;ShieldDaemon&lt;/strong&gt; — a tool that watches every request coming into a server, learns what normal traffic looks like, and automatically blocks any IP address that starts behaving suspiciously.&lt;/p&gt;

&lt;p&gt;The best part? It does all of this in real time, without any human intervention.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; — the detection daemon&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nginx&lt;/strong&gt; — reverse proxy that logs all traffic in JSON format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nextcloud&lt;/strong&gt; — the application being protected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose&lt;/strong&gt; — runs everything together&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iptables&lt;/strong&gt; — Linux firewall used to block bad IPs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flask&lt;/strong&gt; — powers the live dashboard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack&lt;/strong&gt; — receives instant alert notifications&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How the System Works — In Plain English
&lt;/h2&gt;

&lt;p&gt;Think of it like a security camera system at a shopping mall:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Camera (Nginx)&lt;/strong&gt;&lt;br&gt;
Every person who walks through the mall entrance gets recorded. Their face, the time they arrived, which shop they visited, and whether they were let in or turned away. Nginx does the same thing — it records every request that hits your server in JSON format and saves it to a shared log file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Recording (JSON Log File)&lt;/strong&gt;&lt;br&gt;
All that information is saved to a log file in real time. Every single request — who made it, when, what they asked for, and what happened. It looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"45.33.32.156"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-11T22:07:28+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6674&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. The Security Guard (ShieldDaemon)&lt;/strong&gt;&lt;br&gt;
There is a guard watching that recording live. Not checking it hours later — watching it as it happens. The guard has been watching long enough to know what a normal busy day looks like versus something suspicious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Pattern Recognition&lt;/strong&gt;&lt;br&gt;
If one person walks past the same shop 300 times in one minute, the guard knows that is not normal. ShieldDaemon does the same — it compares current traffic against what it has learned is normal and raises an alarm when something is off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The Bouncer (iptables)&lt;/strong&gt;&lt;br&gt;
When the alarm is raised, the bouncer steps in. The suspicious visitor is blocked at the door — they cannot get back in. This happens automatically within 10 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. The Radio (Slack)&lt;/strong&gt;&lt;br&gt;
Every time someone is blocked or unblocked, a message is sent to the security team instantly via Slack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. The Monitor Screen (Dashboard)&lt;/strong&gt;&lt;br&gt;
A live screen shows everything happening in real time — who is visiting, how fast, who is blocked, and how the system is performing.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 1 — Watching the Logs
&lt;/h2&gt;

&lt;p&gt;The first thing ShieldDaemon does is read the Nginx access log line by line as new requests come in. This is called &lt;strong&gt;tailing&lt;/strong&gt; a file.&lt;/p&gt;

&lt;p&gt;Nginx is configured to write logs in JSON format like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"45.33.32.156"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-11T22:07:28+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6674&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every line tells us exactly who made a request, when, what they requested, and whether it succeeded.&lt;/p&gt;

&lt;p&gt;My monitor script tails this file and passes each line to the detector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tail_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# start at end of file
&lt;/span&gt;        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_log_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 2 — The Sliding Window
&lt;/h2&gt;

&lt;p&gt;Now that we can see every request, we need to measure how fast they are coming.&lt;/p&gt;

&lt;p&gt;I use a &lt;strong&gt;sliding window&lt;/strong&gt; — a structure that tracks requests over the last 60 seconds. I use Python's &lt;code&gt;deque&lt;/code&gt; (double-ended queue) for this.&lt;/p&gt;

&lt;p&gt;Here is how it works in simple terms:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Imagine a conveyor belt that is 60 seconds long. Every new request gets placed on the right end. Any request older than 60 seconds falls off the left end automatically. The number of items on the belt at any moment is the current request rate.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;

&lt;span class="n"&gt;ip_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove entries older than 60 seconds
&lt;/span&gt;    &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;ip_window&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Current rate = items on belt / belt length
&lt;/span&gt;    &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us an accurate requests-per-second value for every IP at any moment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3 — The Rolling Baseline
&lt;/h2&gt;

&lt;p&gt;Knowing the current rate is not enough. We need to know whether that rate is &lt;strong&gt;normal or not&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example 10 requests per second might be completely normal for a busy website during the day. But at 3am it might be a sign of an attack.&lt;/p&gt;

&lt;p&gt;This is where the &lt;strong&gt;rolling baseline&lt;/strong&gt; comes in. It learns what normal traffic looks like over the last 30 minutes.&lt;/p&gt;

&lt;p&gt;Every second we record how many requests came in. Every 60 seconds we calculate the &lt;strong&gt;mean&lt;/strong&gt; (average) and &lt;strong&gt;standard deviation&lt;/strong&gt; (how much it varies) of those counts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;variance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;variance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The baseline also maintains &lt;strong&gt;per-hour slots&lt;/strong&gt; — so it learns that traffic during business hours is higher than traffic at night, and adjusts accordingly.&lt;/p&gt;

&lt;p&gt;Floor values of 0.1 are applied to both mean and standard deviation to prevent false positives when there is zero traffic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4 — Detecting Anomalies
&lt;/h2&gt;

&lt;p&gt;Now we have two things: the current rate and the baseline. We compare them using two methods.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 1 — Z-Score
&lt;/h3&gt;

&lt;p&gt;The z-score tells us how many standard deviations the current rate is above normal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;z_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_rate&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;baseline_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;baseline_std&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the z-score is above 2.0, something is unusual. A z-score of 2.0 means the rate is so high it would only happen naturally about 2% of the time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 2 — Rate Multiplier
&lt;/h3&gt;

&lt;p&gt;We also check if the rate is simply more than 2 times the baseline mean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;baseline_mean&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# anomaly detected
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Whichever fires first triggers the response.&lt;/strong&gt; This gives us two layers of protection.&lt;/p&gt;

&lt;p&gt;If an IP also has a high rate of error responses (4xx and 5xx), the thresholds tighten automatically to catch it sooner.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5 — Blocking with iptables
&lt;/h2&gt;

&lt;p&gt;When an anomaly is detected the IP gets blocked at the &lt;strong&gt;firewall level&lt;/strong&gt; using iptables. This means the server stops accepting any traffic from that IP before it even reaches Nginx or Nextcloud.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iptables&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-I&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INPUT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-j&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This happens within 10 seconds of detection.&lt;/p&gt;

&lt;p&gt;Here is what a blocked IP looks like in iptables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Chain INPUT (policy ACCEPT)
target     prot opt source               destination
DROP       all  --  45.33.32.156         0.0.0.0/0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 6 — Auto-Unban with Backoff Schedule
&lt;/h2&gt;

&lt;p&gt;Blocking an IP forever for a first offence is too harsh — it might be a false positive. But being too lenient encourages repeat attacks.&lt;/p&gt;

&lt;p&gt;I implemented a &lt;strong&gt;progressive backoff schedule&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Offence&lt;/th&gt;
&lt;th&gt;Ban Duration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st ban&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2nd ban&lt;/td&gt;
&lt;td&gt;30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3rd ban&lt;/td&gt;
&lt;td&gt;2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4th+ ban&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each ban is scheduled using a Python timer thread that fires after the duration and removes the iptables rule automatically. A Slack notification is sent every time an IP is unbanned.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 7 — Slack Alerts
&lt;/h2&gt;

&lt;p&gt;Every significant event sends an alert to Slack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ban alert example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; IP BANNED
• IP: 45.33.32.156
• Condition: z-score=5.43 &amp;gt; threshold=2.0
• Current rate: 3.72 req/s
• Baseline: 0.10 req/s
• Ban duration: 600 seconds
• Timestamp: 2026-05-11T22:07:33Z
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Global anomaly alert:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; GLOBAL TRAFFIC ANOMALY
• Condition: Global request rate spike
• Current rate: 3.10 req/s
• Baseline: 0.10 req/s
• Action: No IP ban — monitoring closely
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 8 — The Live Dashboard
&lt;/h2&gt;

&lt;p&gt;The dashboard at port 8080 refreshes every 3 seconds and shows everything happening in real time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Global request rate&lt;/li&gt;
&lt;li&gt;Baseline mean and standard deviation&lt;/li&gt;
&lt;li&gt;Blocked IPs with ban count&lt;/li&gt;
&lt;li&gt;CPU and memory usage&lt;/li&gt;
&lt;li&gt;System uptime&lt;/li&gt;
&lt;li&gt;Top 10 source IPs&lt;/li&gt;
&lt;li&gt;Live traffic chart vs baseline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is built with Flask and Chart.js with a dark blue security-themed design.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The baseline kept adapting to attack traffic.&lt;/strong&gt; When I injected test requests the baseline learned those high rates as normal and stopped flagging them. The fix was to restart the daemon with a clean baseline before testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The latency calculation was wrong.&lt;/strong&gt; My first attempt used &lt;code&gt;date +%s%N&lt;/code&gt; which is not supported on all Linux versions. I switched to curl's built-in &lt;code&gt;%{time_total}&lt;/code&gt; timing instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Slack webhook was accidentally exposed.&lt;/strong&gt; I committed the webhook URL to GitHub and GitHub's secret scanning blocked the push. I revoked the token immediately and used a placeholder in the config file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker volume mounting.&lt;/strong&gt; The detector container needed to read the Nginx log file through a shared Docker volume called &lt;code&gt;HNG-nginx-logs&lt;/code&gt;. Getting the volume permissions right took some debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;Building ShieldDaemon taught me that &lt;strong&gt;real security tools are statistical, not rule-based&lt;/strong&gt;. A fixed threshold of "block anyone who sends more than 100 requests per minute" would block legitimate users during a product launch. A statistical baseline that learns from actual traffic patterns is far more accurate.&lt;/p&gt;

&lt;p&gt;I also learned that &lt;strong&gt;the order of operations matters in security&lt;/strong&gt;. You must detect before you block. You must verify before you unban. You must log everything so you can audit what happened.&lt;/p&gt;

&lt;p&gt;Most importantly I learned that &lt;strong&gt;security is a continuous process&lt;/strong&gt;. ShieldDaemon runs forever, constantly learning and adapting. There is no finish line — only a daemon that never sleeps.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;A fully working DDoS detection engine that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watches Nginx logs in real time&lt;/li&gt;
&lt;li&gt;Learns normal traffic patterns automatically&lt;/li&gt;
&lt;li&gt;Detects attacks within seconds using z-scores&lt;/li&gt;
&lt;li&gt;Blocks malicious IPs with iptables&lt;/li&gt;
&lt;li&gt;Unbans automatically on a backoff schedule&lt;/li&gt;
&lt;li&gt;Alerts the team via Slack&lt;/li&gt;
&lt;li&gt;Shows everything on a live dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can see it running at &lt;strong&gt;&lt;a href="http://13.60.224.73:8080" rel="noopener noreferrer"&gt;http://13.60.224.73:8080&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full source code is at &lt;strong&gt;&lt;a href="https://github.com/asanteedith/Shield-Daemon-Detection-Engine" rel="noopener noreferrer"&gt;https://github.com/asanteedith/Shield-Daemon-Detection-Engine&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Written by Edith Asante — Cloud &amp;amp; DevOps Engineer Find me on GitHub | Dev.to&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>python</category>
      <category>docker</category>
    </item>
    <item>
      <title>Building a Self-Service Sandbox Platform from Scratch</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Mon, 11 May 2026 16:31:41 +0000</pubDate>
      <link>https://dev.to/edithasante/building-a-self-service-sandbox-platform-from-scratch-4ff8</link>
      <guid>https://dev.to/edithasante/building-a-self-service-sandbox-platform-from-scratch-4ff8</guid>
      <description>&lt;p&gt;&lt;em&gt;This is part of my HNG DevOps internship series. Follow along as I document every stage.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Quick Recap
&lt;/h2&gt;

&lt;p&gt;Stage 0 was about securing a Linux server. Stage 1 was deploying an API behind Nginx. Stage 2 was containerizing a microservices app. Stage 3 was building a DDoS detection engine. Stage 4 was writing a declarative deployment tool. Stage 5 is the most ambitious yet.&lt;/p&gt;

&lt;p&gt;This time there was no starter code. No bugs to fix. No existing app to containerize. I had to build the entire platform from scratch — a self-service system where users can spin up isolated temporary environments, deploy apps into them, simulate outages, monitor health, and have everything auto-destroyed when the lifetime expires. Think of it as a miniature internal Heroku with a chaos engineering toggle.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Task
&lt;/h2&gt;

&lt;p&gt;The platform had to do all of this on a single Linux VM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Environment Lifecycle&lt;/strong&gt; — create and destroy isolated Docker environments on demand with a configurable TTL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Cleanup Daemon&lt;/strong&gt; — a background process that scans every 60 seconds and destroys expired environments automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Nginx Routing&lt;/strong&gt; — every new environment gets its own Nginx config written and reloaded automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log Shipping&lt;/strong&gt; — container logs captured and queryable by environment ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health Monitoring&lt;/strong&gt; — a poller that hits every environment's &lt;code&gt;/health&lt;/code&gt; endpoint every 30 seconds and marks environments as degraded after 3 consecutive failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outage Simulation&lt;/strong&gt; — a script that can crash, pause, disconnect, or stress-test any environment on demand&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control API&lt;/strong&gt; — a REST API with 6 endpoints wrapping all the scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makefile&lt;/strong&gt; — every action available as a make target&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stack was Docker, Docker Compose, Nginx, Bash, Python 3, and Flask. Everything had to spin up with one command.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Repo Structure and Scaffold
&lt;/h2&gt;

&lt;p&gt;Before writing a single line of logic I set up the repo structure exactly as specified:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;devops-sandbox/
├── platform/
│   ├── create_env.sh
│   ├── destroy_env.sh
│   ├── cleanup_daemon.sh
│   ├── simulate_outage.sh
│   └── api.py
├── nginx/
│   ├── nginx.conf
│   └── conf.d/
├── monitor/
│   └── health_poller.sh
├── logs/
├── envs/
├── Makefile
├── docker-compose.yml
├── README.md
├── .env.example
└── .gitignore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Getting this right first saved a lot of headaches later. Every script references paths relative to the project root, and if those paths don't exist at runtime the scripts fail silently. I also set &lt;code&gt;chmod +x&lt;/code&gt; on all shell scripts immediately — forgetting this causes confusing permission errors later.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.gitignore&lt;/code&gt; was set up to exclude &lt;code&gt;envs/&lt;/code&gt;, &lt;code&gt;logs/&lt;/code&gt;, and &lt;code&gt;.env&lt;/code&gt; from the start. These directories contain runtime state and secrets that should never be committed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: The Demo App
&lt;/h2&gt;

&lt;p&gt;The platform needed something to run inside each environment. The task was clear that the demo app is not the project — the platform is. So I kept it simple: a Flask app with two routes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;jsonify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello from the sandbox!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ENV_ID&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;jsonify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ENV_ID&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;/health&lt;/code&gt; route is the critical one. The health poller depends on it. Every environment container gets its &lt;code&gt;ENV_ID&lt;/code&gt; injected as an environment variable so you can always tell which container you are talking to.&lt;/p&gt;

&lt;p&gt;The app binds to &lt;code&gt;0.0.0.0&lt;/code&gt; not &lt;code&gt;127.0.0.1&lt;/code&gt;. This is a mistake I see constantly. If you bind to localhost inside a container, nothing outside the container can reach it — including Nginx.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Nginx Dynamic Routing
&lt;/h2&gt;

&lt;p&gt;Nginx is the front door for every environment. The key insight is that &lt;code&gt;nginx.conf&lt;/code&gt; never needs to change. It just includes everything in &lt;code&gt;conf.d/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;http&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;include&lt;/span&gt; &lt;span class="n"&gt;/etc/nginx/conf.d/*.conf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt; &lt;span class="s"&gt;default_server&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt; &lt;span class="s"&gt;"No&lt;/span&gt; &lt;span class="s"&gt;environment&lt;/span&gt; &lt;span class="s"&gt;found&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;n"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;create_env.sh&lt;/code&gt; runs, it writes a new file to &lt;code&gt;nginx/conf.d/$ENV_ID.conf&lt;/code&gt; and reloads Nginx. When &lt;code&gt;destroy_env.sh&lt;/code&gt; runs, it deletes that file and reloads Nginx again. No manual config editing ever.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;conf.d/&lt;/code&gt; directory is mounted as a Docker volume into the Nginx container. This means files written to &lt;code&gt;nginx/conf.d/&lt;/code&gt; on the host appear immediately inside the container. Only a reload is needed, not a rebuild.&lt;/p&gt;

&lt;p&gt;One critical mistake to avoid: never write the Nginx config before the container is running. Nginx validates upstream hostnames on reload. If you write a config pointing to a container that doesn't exist yet, the reload fails and Nginx goes down. The order matters — start the container first, then write the config.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Environment Lifecycle
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;create_env.sh&lt;/code&gt; is the heart of the platform. It has to do six things in the right order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate a unique env ID from the name and a timestamp suffix&lt;/li&gt;
&lt;li&gt;Create a dedicated Docker network for the environment&lt;/li&gt;
&lt;li&gt;Connect the Nginx container to that network&lt;/li&gt;
&lt;li&gt;Start the app container on that network with a &lt;code&gt;sandbox.env=$ENV_ID&lt;/code&gt; label&lt;/li&gt;
&lt;li&gt;Write the Nginx config and reload&lt;/li&gt;
&lt;li&gt;Write the state file to &lt;code&gt;envs/$ENV_ID.json&lt;/code&gt; atomically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The atomic write is important. The cleanup daemon reads these state files in a loop. If a write crashes halfway, the daemon reads garbage and fails. The fix is to write to a temp file first and then &lt;code&gt;mv&lt;/code&gt; it into place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;TEMP_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENVS_DIR&lt;/span&gt;&lt;span class="s2"&gt;/.tmp.XXXXXX"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TEMP_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="sh"&gt;
{
  "id": "&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="sh"&gt;",
  "name": "&lt;/span&gt;&lt;span class="nv"&gt;$ENV_NAME&lt;/span&gt;&lt;span class="sh"&gt;",
  "container": "&lt;/span&gt;&lt;span class="nv"&gt;$CONTAINER_NAME&lt;/span&gt;&lt;span class="sh"&gt;",
  "network": "&lt;/span&gt;&lt;span class="nv"&gt;$NETWORK_NAME&lt;/span&gt;&lt;span class="sh"&gt;",
  "created_at": "&lt;/span&gt;&lt;span class="nv"&gt;$CREATED_AT&lt;/span&gt;&lt;span class="sh"&gt;",
  "ttl": &lt;/span&gt;&lt;span class="nv"&gt;$TTL&lt;/span&gt;&lt;span class="sh"&gt;,
  "status": "running"
}
&lt;/span&gt;&lt;span class="no"&gt;JSON
&lt;/span&gt;&lt;span class="nb"&gt;mv&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TEMP_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENVS_DIR&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;.json"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;mv&lt;/code&gt; is atomic on Linux when source and destination are on the same filesystem. The daemon either reads the complete file or nothing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;destroy_env.sh&lt;/code&gt; reverses all of this in the correct order — kill the log shipper first, stop and remove containers, disconnect Nginx from the network, remove the network, delete the Nginx config, reload Nginx, archive logs, delete the state file. Order matters here too. You cannot remove a network while containers are still connected to it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: The Cleanup Daemon
&lt;/h2&gt;

&lt;p&gt;The daemon runs in an infinite loop with a 60 second sleep. On each iteration it reads every file in &lt;code&gt;envs/&lt;/code&gt;, computes how much time has passed since &lt;code&gt;created_at&lt;/code&gt;, and calls &lt;code&gt;destroy_env.sh&lt;/code&gt; if the TTL has been exceeded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CREATED_EPOCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CREATED_AT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;NOW_EPOCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;EXPIRES_AT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;CREATED_EPOCH &lt;span class="o"&gt;+&lt;/span&gt; TTL&lt;span class="k"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$NOW_EPOCH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-ge&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$EXPIRES_AT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;bash &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DESTROY_SCRIPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENV_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing that breaks this: not using &lt;code&gt;nullglob&lt;/code&gt;. If &lt;code&gt;envs/&lt;/code&gt; is empty, &lt;code&gt;*.json&lt;/code&gt; expands to the literal string &lt;code&gt;*.json&lt;/code&gt; and the loop tries to process a file called &lt;code&gt;*.json&lt;/code&gt; which doesn't exist. Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;shopt&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; nullglob
&lt;span class="nv"&gt;STATE_FILES&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENVS_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt;.json&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;shopt&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; nullglob
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every action is timestamped and written to &lt;code&gt;logs/cleanup.log&lt;/code&gt;. The daemon runs in the background with &lt;code&gt;nohup&lt;/code&gt; and its PID is saved so &lt;code&gt;make down&lt;/code&gt; can stop it cleanly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Health Monitoring
&lt;/h2&gt;

&lt;p&gt;The health poller runs every 30 seconds. For each active environment it finds the container's IP address, hits &lt;code&gt;GET /health&lt;/code&gt;, measures the latency, and writes the result to &lt;code&gt;logs/$ENV_ID/health.log&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Getting latency right was harder than expected. My first approach used &lt;code&gt;date +%s%N&lt;/code&gt; for nanosecond timestamps. This failed because the &lt;code&gt;%N&lt;/code&gt; flag is not supported on the version of Linux on the VM. The numbers came out as something like &lt;code&gt;14209454ms&lt;/code&gt; for a request that obviously took under a second.&lt;/p&gt;

&lt;p&gt;The fix was to use curl's own built-in timing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;RESULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /dev/null &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s2"&gt;"%{http_code} %{time_total}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-time&lt;/span&gt; 5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://&lt;/span&gt;&lt;span class="nv"&gt;$CONTAINER_IP&lt;/span&gt;&lt;span class="s2"&gt;:5000/health"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;HTTP_STATUS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $1}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;TIME_SEC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $2}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;LATENCY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TIME_SEC&lt;/span&gt;&lt;span class="s2"&gt; * 1000"&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{printf "%d", $1 * 1000}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;curl&lt;/code&gt;'s &lt;code&gt;%{time_total}&lt;/code&gt; gives you wall clock time in seconds as a decimal. Multiply by 1000 and you have milliseconds. Accurate and reliable.&lt;/p&gt;

&lt;p&gt;After 3 consecutive failures the poller marks the environment as degraded by updating the state file. It also resets the fail counter and restores the status to running when checks pass again. The status update uses the same atomic write pattern as the lifecycle scripts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Outage Simulation
&lt;/h2&gt;

&lt;p&gt;The simulation script accepts &lt;code&gt;--env&lt;/code&gt; and &lt;code&gt;--mode&lt;/code&gt; flags. The modes map directly to Docker commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;crash&lt;/code&gt; → &lt;code&gt;docker kill&lt;/code&gt; (SIGKILL, not graceful)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pause&lt;/code&gt; → &lt;code&gt;docker pause&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;network&lt;/code&gt; → &lt;code&gt;docker network disconnect&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;recover&lt;/code&gt; → inspects current state and reverses whichever mode is active&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stress&lt;/code&gt; → &lt;code&gt;stress-ng&lt;/code&gt; inside the container for 60 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The guard at the top of the script is not optional. It checks whether the target container name matches any protected service names and refuses to run if it does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PROTECTED&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"sandbox-nginx"&lt;/span&gt; &lt;span class="s2"&gt;"cleanup_daemon"&lt;/span&gt; &lt;span class="s2"&gt;"sandbox-api"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;PROTECTED_NAME &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROTECTED&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CONTAINER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROTECTED_NAME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ERROR: Refusing to simulate outage against protected container"&lt;/span&gt;
        &lt;span class="nb"&gt;exit &lt;/span&gt;1
    &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this guard, nothing stops someone from passing the Nginx container ID and taking down the entire platform.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;recover&lt;/code&gt; mode was the most interesting to write. It does not know which mode caused the problem — it just inspects the current state and fixes whatever is wrong. Paused? Unpause. Exited? Restart. Network disconnected? Reconnect. This makes recover genuinely useful rather than just a wrapper around one specific undo.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 8: The Control API
&lt;/h2&gt;

&lt;p&gt;The Flask API wraps all the scripts via &lt;code&gt;subprocess.run&lt;/code&gt;. It has 6 endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST   /envs              → create env
GET    /envs              → list active envs + TTL remaining
DELETE /envs/:id          → destroy env
GET    /envs/:id/logs     → last 100 lines of app.log
GET    /envs/:id/health   → last 10 health check results
POST   /envs/:id/outage   → trigger simulation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The TTL remaining calculation happens in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ttl_remaining&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromisoformat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+00:00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;total_seconds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API runs inside a Docker container with the project directory mounted as a volume and the Docker socket mounted so it can execute Docker commands. This is the standard pattern for tools that need to manage Docker from inside Docker.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: The Makefile
&lt;/h2&gt;

&lt;p&gt;Every action has a make target. The two most important ones are &lt;code&gt;up&lt;/code&gt; and &lt;code&gt;down&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;make up&lt;/code&gt; starts Nginx and the API via Docker Compose, then starts the cleanup daemon and health poller as background processes with &lt;code&gt;nohup&lt;/code&gt;, saving their PIDs to files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;up&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
    &lt;span class="nb"&gt;nohup &lt;/span&gt;bash platform/cleanup_daemon.sh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/cleanup.log 2&amp;gt;&amp;amp;1 &amp;amp;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/cleanup_daemon.pid
    &lt;span class="nb"&gt;nohup &lt;/span&gt;bash monitor/health_poller.sh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/poller.log 2&amp;gt;&amp;amp;1 &amp;amp;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; logs/health_poller.pid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;make down&lt;/code&gt; reads those PID files and kills the processes cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;down&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; logs/cleanup_daemon.pid &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="p"&gt;$$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;logs/cleanup_daemon.pid&lt;span class="p"&gt;)&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; logs/cleanup_daemon.pid&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Makefile syntax has one rule that catches everyone: indentation must use tabs, not spaces. If you use spaces, make throws a cryptic &lt;code&gt;missing separator&lt;/code&gt; error that has nothing to do with separators.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problems I Hit Along the Way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Docker permission denied on a fresh VM&lt;/strong&gt; — The ubuntu user is not in the docker group by default. Fix: &lt;code&gt;sudo usermod -aG docker $USER&lt;/code&gt; followed by &lt;code&gt;newgrp docker&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nginx crashing on startup&lt;/strong&gt; — I left a sample &lt;code&gt;example.conf&lt;/code&gt; file in &lt;code&gt;nginx/conf.d/&lt;/code&gt; as a reference. Nginx tried to resolve the upstream hostname &lt;code&gt;example:5000&lt;/code&gt; on startup, failed, and crashed. The fix was obvious in hindsight: delete the sample file before starting Nginx.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disk full during Docker build&lt;/strong&gt; — &lt;code&gt;docker system prune -af&lt;/code&gt; recovered the space. The build cache had accumulated several GB from previous builds and test runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;demo-app:latest&lt;/code&gt; image lost after prune&lt;/strong&gt; — Docker prune removes all images not referenced by a running container. After cleaning disk space the demo app image was gone. Always rebuild the demo app image after a prune: &lt;code&gt;docker build -t demo-app:latest ./demo-app&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Health log latency showing 14 million milliseconds&lt;/strong&gt; — Caused by &lt;code&gt;date +%s%N&lt;/code&gt; not being supported. Fixed by switching to curl's &lt;code&gt;%{time_total}&lt;/code&gt; timing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What we built&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dedicated Docker network per environment&lt;/td&gt;
&lt;td&gt;Complete isolation — environments cannot interfere with each other&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Atomic state file writes&lt;/td&gt;
&lt;td&gt;Prevents corruption when daemon and scripts write concurrently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nginx config as code&lt;/td&gt;
&lt;td&gt;Dynamic routing without touching the main config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log shipper PID tracking&lt;/td&gt;
&lt;td&gt;Prevents zombie processes on destroy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guard in simulation script&lt;/td&gt;
&lt;td&gt;Prevents accidental destruction of platform infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health-based degraded detection&lt;/td&gt;
&lt;td&gt;Automated observability without external tooling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REST API over raw scripts&lt;/td&gt;
&lt;td&gt;Makes the platform programmable and integratable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hardest part of this task was not any single script. It was understanding the correct order of operations. Create the container before writing the Nginx config. Kill the log shipper before removing the container. Disconnect the network before removing it. Write state files atomically. These ordering constraints are not obvious until something breaks, and when they break they break in confusing ways.&lt;/p&gt;

&lt;p&gt;That is the difference between infrastructure that works in a demo and infrastructure that works at 3am when something goes wrong.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stage 5 complete. Find me on Dev.to | &lt;a href="https://github.com/asanteedith/devops-sandbox" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>docker</category>
      <category>bash</category>
      <category>beginners</category>
    </item>
    <item>
      <title># Containerizing a Broken Microservices App and Shipping It with a Full CI/CD Pipeline</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Mon, 11 May 2026 01:03:21 +0000</pubDate>
      <link>https://dev.to/edithasante/-containerizing-a-broken-microservices-app-and-shipping-it-with-a-full-cicd-pipeline-407b</link>
      <guid>https://dev.to/edithasante/-containerizing-a-broken-microservices-app-and-shipping-it-with-a-full-cicd-pipeline-407b</guid>
      <description>&lt;p&gt;&lt;em&gt;This is part of my HNG DevOps internship series. In Stage 1 I deployed a personal API behind Nginx on a live server. Stage 2 is where things got serious.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Task
&lt;/h2&gt;

&lt;p&gt;We were handed a broken codebase and told to make it production-ready. No hints about what was wrong. No list of bugs. Just the code and the instruction: &lt;em&gt;"Finding them is part of the task."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The application was a distributed job processing system made up of four services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;frontend&lt;/strong&gt; (Node.js/Express) where users submit and track jobs&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;API&lt;/strong&gt; (Python/FastAPI) that creates jobs and serves status updates&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;worker&lt;/strong&gt; (Python) that picks up and processes jobs from a queue&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Redis&lt;/strong&gt; instance shared between the API and worker as a message broker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My job was to find every bug, fix every misconfiguration, containerize all three services with production-quality Dockerfiles, wire everything together with Docker Compose, and build a full CI/CD pipeline that runs lint, tests, security scanning, integration tests, and rolling deployment — all in strict order.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reading the Code Before Touching Anything
&lt;/h2&gt;

&lt;p&gt;The first thing I did was read every file carefully before writing a single line of infrastructure. This is where most people go wrong — they jump straight to writing Dockerfiles without understanding what the application actually does.&lt;/p&gt;

&lt;p&gt;Here is what I found.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Redis hostname problem
&lt;/h3&gt;

&lt;p&gt;Both &lt;code&gt;api/main.py&lt;/code&gt; and &lt;code&gt;frontend/app.js&lt;/code&gt; had hardcoded &lt;code&gt;localhost&lt;/code&gt; as the Redis and API hostname respectively. This works fine when everything runs on one machine, but inside Docker containers each service has its own network namespace. &lt;code&gt;localhost&lt;/code&gt; inside the API container points to the API container itself, not Redis.&lt;/p&gt;

&lt;p&gt;The fix was straightforward — use environment variables and Docker's built-in DNS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REDIS_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker Compose automatically creates DNS entries for each service using the service name. So &lt;code&gt;redis&lt;/code&gt; resolves to the Redis container's IP address inside the network.&lt;/p&gt;

&lt;h3&gt;
  
  
  The silent queue mismatch
&lt;/h3&gt;

&lt;p&gt;This one was subtle. The API was pushing job IDs to a Redis list called &lt;code&gt;job_queue&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lpush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_queue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the worker was polling a completely different list called &lt;code&gt;job&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blpop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every job submitted through the API went into &lt;code&gt;job_queue&lt;/code&gt;. The worker was watching &lt;code&gt;job&lt;/code&gt;. Jobs piled up forever in &lt;code&gt;pending&lt;/code&gt; state and nobody ever processed them. The fix was one word — change &lt;code&gt;job&lt;/code&gt; to &lt;code&gt;job_queue&lt;/code&gt; in the worker.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Python magic variable typo
&lt;/h3&gt;

&lt;p&gt;The worker file ended with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;process_redis_jobs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note &lt;code&gt;name&lt;/code&gt; instead of &lt;code&gt;__name__&lt;/code&gt;. This means the main function never ran. The container started, did nothing, and sat there silently. Changed to &lt;code&gt;if __name__ == "__main__":&lt;/code&gt; and the worker came to life.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing CORS headers
&lt;/h3&gt;

&lt;p&gt;The frontend was making HTTP requests to the API from a browser. Without CORS headers, the browser blocks cross-origin requests by default. Added &lt;code&gt;CORSMiddleware&lt;/code&gt; to the FastAPI app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.middleware.cors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allow_origins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;allow_methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;allow_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Redis byte strings
&lt;/h3&gt;

&lt;p&gt;The Redis client was returning raw bytes instead of strings, so &lt;code&gt;job_id&lt;/code&gt; would come back as &lt;code&gt;b'abc-123'&lt;/code&gt; instead of &lt;code&gt;abc-123&lt;/code&gt;. Added &lt;code&gt;decode_responses=True&lt;/code&gt; to the Redis connection to get UTF-8 strings automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing Production Dockerfiles
&lt;/h2&gt;

&lt;p&gt;Once I understood the application I wrote Dockerfiles for all three services. The two rules I followed strictly: multi-stage builds and non-root users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-stage builds
&lt;/h3&gt;

&lt;p&gt;A naive Dockerfile copies all your source code and runs &lt;code&gt;pip install&lt;/code&gt;. The resulting image contains your build tools, pip cache, compiler output — everything the build needed but the runtime doesn't. Multi-stage builds fix this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stage 1: install dependencies&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Stage 2: copy only what's needed to run&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;runtime&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /root/.local /home/edith/.local&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final image only contains the installed packages and source code. Build tools never make it in. Image size reduced by over 70%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Non-root users
&lt;/h3&gt;

&lt;p&gt;Every service creates and runs as a dedicated user called &lt;code&gt;edith&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;useradd &lt;span class="nt"&gt;-m&lt;/span&gt; edith
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; edith:edith /home/edith /app
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; edith&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If someone finds a vulnerability in your application and gets code execution, they get a restricted user with no special privileges — not root access to the container.&lt;/p&gt;

&lt;h3&gt;
  
  
  Health checks
&lt;/h3&gt;

&lt;p&gt;Every Dockerfile includes a &lt;code&gt;HEALTHCHECK&lt;/code&gt; instruction so Docker knows whether the service is actually working, not just running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# API&lt;/span&gt;
&lt;span class="k"&gt;HEALTHCHECK&lt;/span&gt;&lt;span class="s"&gt; --interval=30s --timeout=10s --retries=3 \&lt;/span&gt;
  CMD curl -f http://127.0.0.1:8000/health || exit 1

&lt;span class="c"&gt;# Worker — no HTTP port, so use a filesystem heartbeat&lt;/span&gt;
&lt;span class="k"&gt;HEALTHCHECK&lt;/span&gt;&lt;span class="s"&gt; --interval=30s --timeout=10s --retries=3 \&lt;/span&gt;
  CMD test -f /tmp/worker_healthy || exit 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worker writes a timestamp to &lt;code&gt;/tmp/worker_healthy&lt;/code&gt; on every loop. The health check verifies that file exists. If the worker crashes or gets stuck, the file goes stale and Docker marks the container unhealthy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Docker Compose Orchestration
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;docker-compose.yml&lt;/code&gt; file ties everything together. The key decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Startup order with health checks.&lt;/strong&gt; Using &lt;code&gt;depends_on&lt;/code&gt; with just a service name only waits for the container to start, not for the application inside to be ready. Using &lt;code&gt;condition: service_healthy&lt;/code&gt; waits for the health check to pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This eliminated the race condition where the API would crash on startup because Redis wasn't ready yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis not exposed on the host.&lt;/strong&gt; Redis uses &lt;code&gt;expose&lt;/code&gt; instead of &lt;code&gt;ports&lt;/code&gt;. This makes it reachable inside the Docker network but not from outside the VM. No reason to expose a database to the internet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource limits on every service.&lt;/strong&gt; Without limits, one misbehaving service can starve the entire host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0.50'&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512M&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Named internal network.&lt;/strong&gt; All services communicate over &lt;code&gt;hng_network&lt;/code&gt; — an isolated bridge network managed by Docker Compose.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;The task specified 6 stages in strict order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lint → test → build → security scan → integration test → deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A failure in any stage must prevent all subsequent stages from running. GitHub Actions handles this with &lt;code&gt;needs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;lint&lt;/span&gt;
&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
&lt;span class="na"&gt;security&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lint stage
&lt;/h3&gt;

&lt;p&gt;Three linters run in sequence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;flake8&lt;/code&gt; for Python — catches style violations, unused imports, undefined names&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;eslint&lt;/code&gt; for JavaScript — catches syntax errors and bad patterns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hadolint&lt;/code&gt; for Dockerfiles — catches common Dockerfile mistakes like missing &lt;code&gt;--no-install-recommends&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Getting Python files to pass flake8 was the most tedious part. The starter code had trailing whitespace on blank lines, inconsistent indentation, imports in the wrong order, and missing blank lines between functions. Every line had to be cleaned up manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test stage
&lt;/h3&gt;

&lt;p&gt;Three unit tests with pytest and coverage reporting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_redis_connection_mocked&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mock_redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MagicMock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;mock_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;mock_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_health_logic&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_math_logic&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Coverage report uploaded as a pipeline artifact so you can see exactly which lines are tested.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build stage
&lt;/h3&gt;

&lt;p&gt;This stage runs a local Docker registry as a GitHub Actions service container, builds all three images, tags each with the git SHA and &lt;code&gt;latest&lt;/code&gt;, and pushes them to the local registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry:2&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5000:5000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; localhost:5000/hng-api:&lt;span class="nv"&gt;$SHA&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; localhost:5000/hng-api:latest ./api
docker push localhost:5000/hng-api:&lt;span class="nv"&gt;$SHA&lt;/span&gt;
docker push localhost:5000/hng-api:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tagging with the git SHA means every image is traceable back to the exact commit that built it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security scan stage
&lt;/h3&gt;

&lt;p&gt;Trivy scans all three images for known vulnerabilities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hng-api:latest'&lt;/span&gt;
    &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sarif'&lt;/span&gt;
    &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trivy-api.sarif'&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL'&lt;/span&gt;
    &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results uploaded as SARIF artifacts — GitHub can render these in the Security tab. We set &lt;code&gt;exit-code: '0'&lt;/code&gt; so the pipeline continues even if vulnerabilities are found, but they are reported and visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration test stage
&lt;/h3&gt;

&lt;p&gt;This is the most valuable stage. It starts the complete stack inside the GitHub Actions runner, submits a real job, and polls until it completes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Submit a job&lt;/span&gt;
&lt;span class="nv"&gt;JOB&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/jobs &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;JOB_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$JOB&lt;/span&gt; | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import sys,json; print(json.load(sys.stdin)['job_id'])"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Poll until completed&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 20&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;STATUS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8000/jobs/&lt;span class="nv"&gt;$JOB_ID&lt;/span&gt; | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"import sys,json; print(json.load(sys.stdin).get('status',''))"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STATUS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"completed"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;exit &lt;/span&gt;0
  &lt;span class="k"&gt;fi
  &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;5
&lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the job doesn't complete within 100 seconds, the pipeline fails. The stack tears down cleanly regardless of the outcome.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy stage
&lt;/h3&gt;

&lt;p&gt;The deploy stage only runs on pushes to &lt;code&gt;main&lt;/code&gt;. It SSHs into the production VM and performs a rolling update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy the API first&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;--no-deps&lt;/span&gt; api

&lt;span class="c"&gt;# Wait up to 60 seconds for the health check to pass&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 12&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  if &lt;/span&gt;docker compose &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-T&lt;/span&gt; api python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="c"&gt;# Health check passed — deploy the rest&lt;/span&gt;
    docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;--no-deps&lt;/span&gt; worker frontend
    &lt;span class="nb"&gt;exit &lt;/span&gt;0
  &lt;span class="k"&gt;fi
  &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;5
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Health check failed — abort, leave old container running&lt;/span&gt;
&lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The old container keeps serving traffic until the new one passes its health check. If the new version is broken, nothing goes down.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problems I Hit Along the Way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;YAML duplicate jobs.&lt;/strong&gt; I accidentally appended the &lt;code&gt;integration-test&lt;/code&gt; and &lt;code&gt;deploy&lt;/code&gt; stages to the ci.yml file twice using &lt;code&gt;cat &amp;gt;&amp;gt;&lt;/code&gt;. GitHub rejected the workflow because job names were duplicated. Fixed by rewriting the entire file from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pinned apt package version not found.&lt;/strong&gt; Hadolint flagged &lt;code&gt;apt-get install curl&lt;/code&gt; without a pinned version (DL3008). I tried to pin it as &lt;code&gt;curl=7.88.1-10+deb12u5&lt;/code&gt; but that exact version didn't exist in the GitHub Actions runner's package index, breaking the Docker build. Fixed by ignoring DL3008 with &lt;code&gt;hadolint --ignore DL3008&lt;/code&gt; — a pragmatic tradeoff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Windows CRLF line endings.&lt;/strong&gt; Editing files on Windows and pushing to a Linux CI environment caused flake8 to report phantom whitespace errors. Every blank line showed as &lt;code&gt;W293 blank line contains whitespace&lt;/code&gt; because of the carriage return character. Fixed by configuring git with &lt;code&gt;core.autocrlf false&lt;/code&gt; and converting files to LF.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token scope too narrow.&lt;/strong&gt; Pushing changes to the workflow file required a GitHub token with the &lt;code&gt;workflow&lt;/code&gt; scope, not just &lt;code&gt;repo&lt;/code&gt;. Generated a new token with both scopes to resolve the 403 error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSH key missing on VM.&lt;/strong&gt; The deploy stage needed to SSH into the production server but no SSH key existed on the VM. Generated one with &lt;code&gt;ssh-keygen -t ed25519&lt;/code&gt;, added the public key to &lt;code&gt;authorized_keys&lt;/code&gt;, and stored the private key as a GitHub Actions secret.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Final Pipeline
&lt;/h2&gt;

&lt;p&gt;After all of that, the pipeline looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ lint          — 16s
✅ test          — 12s
✅ build         — 1m 4s
✅ security      — 46s
✅ integration-test — 1m 33s
✅ deploy        — 8s

Status: Success — Total duration: 2m 37s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 6 stages green. Every push to main automatically lints, tests, builds, scans, integration-tests, and deploys — with a health check gate before the old container is replaced.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The most important lesson from Stage 2 is that reading code before writing infrastructure is not optional. Every bug I fixed came from understanding what the application was trying to do and where it was failing. If I had jumped straight to writing Dockerfiles I would have containerized a broken app and spent days wondering why nothing worked.&lt;/p&gt;

&lt;p&gt;The second lesson is that CI/CD is not just automation — it is documentation. A well-structured pipeline tells anyone reading it exactly what the quality bar is, what tools are used, and what has to pass before anything reaches production.&lt;/p&gt;

&lt;p&gt;The third lesson is that container security is not complicated but it is easy to skip. Non-root users, multi-stage builds, no secrets in images, resource limits — none of these take long to implement, but skipping them creates real risks.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stage 2 complete. Find the repo at &lt;a href="https://github.com/asanteedith/Containerized_MicroService" rel="noopener noreferrer"&gt;github.com/asanteedith/Containerized_MicroService&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>docker</category>
      <category>cicd</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Building a Policy-Gated Deployment System with Observability (SwiftDeploy Stage 4B)</title>
      <dc:creator>Edith Asante</dc:creator>
      <pubDate>Wed, 06 May 2026 19:59:23 +0000</pubDate>
      <link>https://dev.to/edithasante/building-a-policy-gated-deployment-system-with-observability-swiftdeploy-stage-4b-4od2</link>
      <guid>https://dev.to/edithasante/building-a-policy-gated-deployment-system-with-observability-swiftdeploy-stage-4b-4od2</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In Stage 4A, I built a CLI tool (swiftdeploy) that generates infrastructure from a single file (manifest.yaml).&lt;br&gt;
In Stage 4B, I extended it to &lt;/p&gt;

&lt;p&gt;include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observability (metrics)&lt;/li&gt;
&lt;li&gt;Policy enforcement (OPA)&lt;/li&gt;
&lt;li&gt;Auditing (history + reports)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was simple but strict:&lt;/p&gt;

&lt;p&gt;The system must refuse to deploy or promote if it is unsafe.&lt;/p&gt;

&lt;p&gt;This meant moving from just “running containers” to building a system that can think and decide before acting.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Architectural Overview &lt;/p&gt;

&lt;p&gt;manifest.yaml&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ↓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;swiftdeploy CLI&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ↓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;docker-compose + nginx&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ↓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Docker Network&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ↓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;[ NGINX ] → [ APP (/metrics) ]&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              ↓

           metrics

              ↓

           CLI

              ↓

            OPA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;At a high level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;manifest.yaml is the single source of truth&lt;/li&gt;
&lt;li&gt;swiftdeploy CLI reads it and generates:

&lt;ul&gt;
&lt;li&gt;docker-compose.yml&lt;/li&gt;
&lt;li&gt;nginx.conf&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Docker runs:

&lt;ul&gt;
&lt;li&gt;API service&lt;/li&gt;
&lt;li&gt;Nginx (reverse proxy)&lt;/li&gt;
&lt;li&gt;OPA (policy engine)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;flow:&lt;br&gt;
CLI → collect data → send to OPA → receive decision → deploy or block&lt;/p&gt;

&lt;p&gt;The Design: A Tool That Writes Its Own Infrastructure&lt;/p&gt;

&lt;p&gt;The core idea was:&lt;/p&gt;

&lt;p&gt;I don’t manually write configs — I generate them.&lt;/p&gt;

&lt;p&gt;Instead of editing multiple files, I only update:&lt;br&gt;
manifest.yaml&lt;/p&gt;

&lt;p&gt;then:&lt;br&gt;
python swiftdeploy.py init&lt;/p&gt;

&lt;p&gt;This generates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;docker-compose.yml&lt;/li&gt;
&lt;li&gt;nginx.conf&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why this matters&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduces manual errors&lt;/li&gt;
&lt;li&gt;Keeps configuration consistent&lt;/li&gt;
&lt;li&gt;Makes the system reproducible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I deletes my configs, I can regenerate everything from the manifest.&lt;/p&gt;

&lt;p&gt;Observability: Adding the “Eyes” (/metrics)&lt;/p&gt;

&lt;p&gt;I added a /metrics endpoint to the API in Prometheus format.&lt;/p&gt;

&lt;p&gt;It tracks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Throughput &amp;amp; Errors&lt;br&gt;
http_requests_total{method, path, status_code}&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Latency&lt;br&gt;
http_request_duration_seconds_bucket&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Application State&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;app_uptime_seconds&lt;/p&gt;

&lt;p&gt;app_mode (0=stable, 1=canary)&lt;/p&gt;

&lt;p&gt;chaos_active&lt;/p&gt;

&lt;p&gt;The Guardrails: Policy Enforcement with OPA&lt;/p&gt;

&lt;p&gt;Instead of writing logic inside the CLI, I used Open Policy Agent.&lt;/p&gt;

&lt;p&gt;Key Rule:&lt;/p&gt;

&lt;p&gt;The CLI must NOT decide anything — OPA decides everything.&lt;/p&gt;

&lt;p&gt;🔹 Infrastructure Policy (Pre-Deploy)&lt;/p&gt;

&lt;p&gt;Checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Disk space&lt;/li&gt;
&lt;li&gt;CPU load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example rule:&lt;br&gt;
Deny if disk_free &amp;lt; 10GB  &lt;/p&gt;

&lt;p&gt;Deny if cpu_load &amp;gt; 2.0&lt;/p&gt;

&lt;p&gt;If I artificially reduce disk space:&lt;/p&gt;

&lt;p&gt;BLOCKED: Disk below threshold&lt;/p&gt;

&lt;p&gt;👉 This satisfies the Hard Gate requirement&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;🔹 Canary Safety Policy (Pre-Promote)&lt;/p&gt;

&lt;p&gt;Before promoting, the CLI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scrapes /metrics&lt;/li&gt;
&lt;li&gt;Calculates:

&lt;ul&gt;
&lt;li&gt;Error rate&lt;/li&gt;
&lt;li&gt;P99 latency&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Sends to OPA&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Policy:&lt;br&gt;
Deny if error_rate &amp;gt; 1%&lt;/p&gt;

&lt;p&gt;Deny if p99_latency &amp;gt; 500ms&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why Isolation Matters&lt;/p&gt;

&lt;p&gt;OPA runs as a separate container and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is reachable by the CLI&lt;/li&gt;
&lt;li&gt;Is NOT exposed through Nginx&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No external access to policy engine&lt;/li&gt;
&lt;li&gt;Clear separation of responsibilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This satisfies the “No Leakage” requirement&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;🧪 The Chaos: Testing Failure Scenarios&lt;/p&gt;

&lt;p&gt;I implemented a /chaos endpoint:&lt;/p&gt;

&lt;p&gt;Modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;slow → delays responses&lt;/li&gt;
&lt;li&gt;error → randomly returns 500&lt;/li&gt;
&lt;li&gt;recover → resets system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;{ "mode": "slow", "duration": 2 }&lt;/p&gt;

&lt;p&gt;What Happened&lt;/p&gt;

&lt;p&gt;When I injected chaos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency increased&lt;/li&gt;
&lt;li&gt;Error rate increased&lt;/li&gt;
&lt;li&gt;Metrics reflected the change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I tried to promote:&lt;br&gt;
BLOCKED: Latency too high&lt;/p&gt;

&lt;p&gt;👉 This confirmed:&lt;br&gt;
The system reacts to real runtime conditions, not assumptions&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Eyes: swiftdeploy status&lt;/p&gt;

&lt;p&gt;This command:&lt;br&gt;
python swiftdeploy.py status&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuously scrapes /metrics&lt;/li&gt;
&lt;li&gt;Displays live system state&lt;/li&gt;
&lt;li&gt;Logs everything to:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;history.jsonl&lt;/p&gt;

&lt;p&gt;The Memory: Audit System&lt;/p&gt;

&lt;p&gt;From the logs, I generate:&lt;/p&gt;

&lt;p&gt;python swiftdeploy.py audit&lt;/p&gt;

&lt;p&gt;This creates:&lt;br&gt;
audit_report.md&lt;/p&gt;

&lt;p&gt;Contents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Timeline of events&lt;/li&gt;
&lt;li&gt;Policy violations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 The report renders cleanly in GitHub Markdown&lt;br&gt;
(Satisfies submission requirement)&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Lessons Learned&lt;/p&gt;

&lt;p&gt;This stage changed how I think about DevOps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deployment is not just execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It’s decision-making&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Policies should be external&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Keeping logic in OPA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;makes it reusable&lt;/li&gt;
&lt;li&gt;avoids tightly coupled code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Metrics are not just for monitoring&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;They actively drive decisions&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Debugging is part of the process&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I faced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;YAML errors&lt;/li&gt;
&lt;li&gt;Docker rebuild issues&lt;/li&gt;
&lt;li&gt;Nginx misconfigurations&lt;/li&gt;
&lt;li&gt;OPA connection failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fixing them helped me understand the system deeply.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;✅ Final Checklist (Submission Criteria)&lt;/p&gt;

&lt;p&gt;✔ manifest.yaml is the only edited file&lt;br&gt;
✔ Deployment blocked when disk is low&lt;br&gt;
✔ OPA not exposed via Nginx&lt;br&gt;
✔ Metrics fully implemented&lt;br&gt;
✔ Audit report generated and readable&lt;br&gt;
✔ Blog includes architecture diagram&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;This project helped me move from:&lt;/p&gt;

&lt;p&gt;running commands → building systems that enforce rules&lt;/p&gt;

&lt;p&gt;I now better understand how:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;policy&lt;/li&gt;
&lt;li&gt;infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;work together in real-world systems.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;If you’re learning DevOps, my biggest takeaway is:&lt;/p&gt;

&lt;p&gt;Don’t just deploy — build systems that decide when deployment is safe.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>automation</category>
      <category>cloud</category>
      <category>docker</category>
    </item>
  </channel>
</rss>
