DEV Community: Edith Asante

I Spent 6 Hours Debugging AWS Before Realising the Bug Was a Capital Letter

Edith Asante — Wed, 13 May 2026 17:07:10 +0000

I stared at my screen for 6 hours. The API kept returning 404. I checked the Lambda code line by line. I tested the DynamoDB table. I redeployed the API three times. Everything looked right.Then I noticed it. The resource was named `/Students` — capital S. My frontend was calling `/students` — lowercase. That was it. Six hours. One capital letter.

This is the story of building my first serverless app on AWS — a Student Record Management System — as part of my AWS Cloud Practitioner journey. I'll walk through the full architecture, how I built it, and every AWS configuration bug I hit along the way. Spoiler: 5 out of 8 bugs had nothing to do with my code.

What I Built

A Student Record Management System that allows you to:

Create new student records
View all students in a table
Search by Student ID
Edit student information
Delete students

With a clean UI showing live stats — total students, average GPA, and number of unique majors.

Live site: http://student-records-edith-321.s3-website-us-east-1.amazonaws.com

GitHub: https://github.com/asanteedith/student-record-system

Architecture

The entire application is serverless:

User Browser (S3 Static Website)
        ↓
API Gateway (REST API)
        ↓
Lambda Functions (Python 3.12)
        ↓
DynamoDB (StudentRecords table)

No EC2, no servers to manage, no infrastructure to maintain. Everything scales automatically and stays within the AWS Free Tier.

AWS Services Used

Service	Purpose
DynamoDB	NoSQL database to store student records
Lambda	5 serverless functions for CRUD operations
API Gateway	REST API connecting frontend to backend
S3	Hosts the static frontend website
IAM	Permissions and security roles

Project Structure

student-record-system/
├── README.md
├── BUGS.md
├── frontend/
│   ├── index.html
│   ├── styles.css
│   └── app.js
└── lambda/
    ├── GetAllStudents/
    │   └── lambda_function.py
    ├── GetStudent/
    │   └── lambda_function.py
    ├── CreateStudent/
    │   └── lambda_function.py
    ├── UpdateStudent/
    │   └── lambda_function.py
    └── DeleteStudent/
        └── lambda_function.py

Phase 1: DynamoDB Setup

DynamoDB is AWS's managed NoSQL database. I created a table called StudentRecords with studentId as the partition key (think primary key).

Key settings:

On-demand capacity — you only pay for what you use, perfect for a project like this
Partition key: studentId (String) — every student needs a unique ID

Each record stores: studentId, name, email, major, gpa

One thing I learned early — DynamoDB stores numbers as Python's Decimal type, not a regular float. This caused a JSON serialization bug later (more on that below).

Phase 2: Lambda Functions

I created 5 Lambda functions in Python 3.12, one for each CRUD operation:

Function	Method	Purpose
`GetAllStudents`	GET	Scan entire table
`GetStudent`	GET	Get one student by ID
`CreateStudent`	POST	Add new student
`UpdateStudent`	PUT	Update student fields
`DeleteStudent`	DELETE	Remove student

Here's the GetAllStudents function — it scans the entire DynamoDB table and handles pagination for large datasets:

import json
import boto3
from decimal import Decimal

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('StudentRecords')

class DecimalEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Decimal):
            return float(obj)
        return super(DecimalEncoder, self).default(obj)

def lambda_handler(event, context):
    try:
        response = table.scan()
        students = response['Items']
        while 'LastEvaluatedKey' in response:
            response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
            students.extend(response['Items'])
        return {
            'statusCode': 200,
            'headers': {'Access-Control-Allow-Origin': '*'},
            'body': json.dumps(students, cls=DecimalEncoder)
        }
    except Exception as e:
        return {
            'statusCode': 500,
            'headers': {'Access-Control-Allow-Origin': '*'},
            'body': json.dumps({'error': str(e)})
        }

Important: After creating each function, I had to manually attach AmazonDynamoDBFullAccess to the Lambda execution role in IAM. Lambda has no DynamoDB access by default — this tripped me up more than once.

Phase 3: API Gateway

API Gateway is what connects the frontend to the Lambda functions. I created a REST API with this structure:

/students
  GET     → GetAllStudents
  POST    → CreateStudent
  /{studentid}
    GET    → GetStudent
    PUT    → UpdateStudent
    DELETE → DeleteStudent

Key settings for each method:

Integration type: Lambda Function
Lambda Proxy integration: ✅ ON (passes the full request to Lambda)

CORS must be enabled on both resources — without it the browser blocks every API call.

Phase 4: Frontend

The frontend is pure HTML, CSS and vanilla JavaScript — no React, no frameworks. It's hosted as a static website on S3.

Features:

Stats bar showing live total students, average GPA and unique majors
Color-coded avatar initials per student
Major badges with different colors per field
Actions dropdown (View / Edit / Delete) per row
Smooth animated modals for all operations
Toast notifications for success and error feedback
Fully responsive on mobile

The entire API connection is in app.js — one file handles all 5 CRUD operations by calling the API Gateway endpoints.

The Bugs — This Is Where I Actually Learned AWS

This section is the most valuable part. Every one of these bugs taught me something important about how AWS services work together.

Bug 1 — Missing `GET /students` Endpoint

Symptom: Table was always empty on page load

Cause: I set up routes for individual student operations but completely forgot to create a GET /students endpoint to fetch all students. The frontend called it on load and got a 404.

Fix: Created a new GetAllStudents Lambda function and added GET /students → GetAllStudents in API Gateway.

Lesson: Always map out ALL your API routes before you start building.

Bug 2 — `/Students` vs `/students` (Case Sensitivity)

Symptom: View, Edit and Delete all failed silently

Cause: I accidentally created the resource as /Students (capital S) instead of /students. AWS API Gateway is case-sensitive — these are completely different paths.

Fix: Deleted /Students, recreated as /{studentid} under /students (lowercase), re-added all methods.

Lesson: API Gateway resource paths are case-sensitive. Always double-check before adding methods.

Bug 3 — Path Parameter Case Mismatch

Symptom: Error: 'studentId' on every individual student operation

Cause: My Lambda functions read event['pathParameters']['studentId'] (camelCase) but the API Gateway resource was named /{studentid} (all lowercase). AWS passes the exact parameter name — no automatic case conversion.

Fix: Updated all 3 Lambda functions to use event['pathParameters']['studentid'].

Lesson: The path parameter name in your Lambda code must match exactly what's in the API Gateway resource path.

Bug 4 — CORS Not Re-enabled After Changes

Symptom: CORS policy blocked errors in browser console

Cause: Every time I modified a resource or added a method in API Gateway, CORS got reset. I forgot to re-enable it after making fixes.

Fix: After any resource change — Enable CORS on both resources → replace existing → redeploy.

Lesson: CORS must be re-enabled every time you modify API Gateway resources.

Bug 5 — API Gateway Changes Not Going Live

Symptom: Fixed things in API Gateway but nothing changed on the live site

Cause: API Gateway uses a staging system. Changes are saved as drafts until you explicitly deploy them to a stage.

Fix: Always: API Actions → Deploy API → Stage: prod → Deploy after every change.

Lesson: Unlike Lambda (where Deploy is instant), API Gateway changes are never live until deployed to a stage.

Bug 6 — Lambda Missing DynamoDB Permissions

Symptom: User is not authorized to perform: dynamodb:PutItem

Cause: Lambda functions are created with a minimal execution role that only has CloudWatch logging permissions. They have no DynamoDB access by default.

Fix: For each Lambda function: Configuration → Permissions → click role → Attach AmazonDynamoDBFullAccess.

Lesson: In AWS, no service has access to another service by default. IAM permissions must always be explicitly granted.

Bug 7 — DynamoDB Decimal Serialization Error

Symptom: Lambda returned 500 when reading student data

Cause: DynamoDB stores numbers as Python's Decimal type. Python's json.dumps() can't serialize Decimal objects.

Fix: Added a custom DecimalEncoder class to convert Decimal to float during JSON serialization.

Lesson: Always handle the Decimal ↔ float conversion when working with DynamoDB numbers in Python.

Bug 8 — S3 Serving Old Cached Files

Symptom: Updated files uploaded to S3 but site still showed old version

Cause: The browser cached the old files aggressively.

Fix: Hard refresh with Ctrl + Shift + R, or test in an incognito window.

Lesson: Always hard refresh or use incognito after deploying new files to S3.

Bug Summary

#	Bug	Service	Severity
1	Missing GET /students endpoint	API Gateway + Lambda	🔴 Critical
2	/Students vs /students case mismatch	API Gateway	🔴 Critical
3	Path parameter case mismatch	Lambda + API Gateway	🔴 Critical
4	CORS not re-enabled after changes	API Gateway	🔴 Critical
5	Changes not live without redeployment	API Gateway	🟠 Medium
6	Lambda missing DynamoDB permissions	Lambda + IAM	🔴 Critical
7	DynamoDB Decimal serialization error	Lambda + DynamoDB	🟠 Medium
8	S3 browser caching old files	S3	🟡 Minor

5 out of 8 bugs were Critical — and every single one was an AWS configuration issue, not an application code bug. That's the biggest takeaway from this project.

💡 Key Takeaways

1. IAM permissions are everything
No AWS service can talk to another without explicit IAM permissions. Check permissions first when something isn't working.

2. API Gateway requires redeployment
Every change to API Gateway — methods, CORS, integrations — must be deployed to a stage before it goes live.

3. Case sensitivity matters in AWS
Resource paths, parameter names, table names — AWS is case-sensitive everywhere. Be consistent and always use lowercase.

4. CORS needs to be re-enabled after every change
Don't just enable it once and forget about it. Any resource modification resets it.

5. DynamoDB numbers are Decimal, not float
Always use a DecimalEncoder when returning DynamoDB data as JSON.

Resources

Live Site: http://student-records-edith-321.s3-website-us-east-1.amazonaws.com
GitHub: https://github.com/asanteedith/student-record-system
AWS Free Tier: https://aws.amazon.com/free

If you're working through your AWS Cloud Practitioner certification and want a hands-on project that touches DynamoDB, Lambda, API Gateway, S3 and IAM all at once — this is a great one to build. The bugs you'll hit will teach you more than any documentation.

Feel free to fork the repo, ask questions in the comments, or connect with me!

Tags: #aws #serverless #python #javascript #cloudpractitioner #beginners

# I Built a Tool That Watches Your Server, Learns Your Traffic, and Blocks Attackers Automatically

Edith Asante — Tue, 12 May 2026 06:27:34 +0000

Most developers deploy servers. Few think about what happens when someone tries to take them down. I did. I built ShieldDaemon — a tool that watches every request hitting your server, learns your normal traffic patterns, and automatically blocks attackers the moment something looks wrong. No manual intervention. No hardcoded rules. Just a daemon that never sleeps. Here is exactly how I built it.

What Is This Project About?

Imagine you run an online shop. Everything is working fine until one day thousands of fake requests flood your website all at once. Your server crashes. Real customers can't access your shop. You lose money and trust.

That is called a DDoS attack — Distributed Denial of Service. It is one of the most common ways attackers take down websites.

In this project I built ShieldDaemon — a tool that watches every request coming into a server, learns what normal traffic looks like, and automatically blocks any IP address that starts behaving suspiciously.

The best part? It does all of this in real time, without any human intervention.

The Stack

Python — the detection daemon
Nginx — reverse proxy that logs all traffic in JSON format
Nextcloud — the application being protected
Docker Compose — runs everything together
iptables — Linux firewall used to block bad IPs
Flask — powers the live dashboard
Slack — receives instant alert notifications

How the System Works — In Plain English

Think of it like a security camera system at a shopping mall:

1. The Camera (Nginx)
Every person who walks through the mall entrance gets recorded. Their face, the time they arrived, which shop they visited, and whether they were let in or turned away. Nginx does the same thing — it records every request that hits your server in JSON format and saves it to a shared log file.

2. The Recording (JSON Log File)
All that information is saved to a log file in real time. Every single request — who made it, when, what they asked for, and what happened. It looks like this:

{
  "source_ip": "45.33.32.156",
  "timestamp": "2026-05-11T22:07:28+00:00",
  "method": "GET",
  "path": "/",
  "status": 200,
  "response_size": 6674
}

3. The Security Guard (ShieldDaemon)
There is a guard watching that recording live. Not checking it hours later — watching it as it happens. The guard has been watching long enough to know what a normal busy day looks like versus something suspicious.

4. The Pattern Recognition
If one person walks past the same shop 300 times in one minute, the guard knows that is not normal. ShieldDaemon does the same — it compares current traffic against what it has learned is normal and raises an alarm when something is off.

5. The Bouncer (iptables)
When the alarm is raised, the bouncer steps in. The suspicious visitor is blocked at the door — they cannot get back in. This happens automatically within 10 seconds.

6. The Radio (Slack)
Every time someone is blocked or unblocked, a message is sent to the security team instantly via Slack.

7. The Monitor Screen (Dashboard)
A live screen shows everything happening in real time — who is visiting, how fast, who is blocked, and how the system is performing.

Part 1 — Watching the Logs

The first thing ShieldDaemon does is read the Nginx access log line by line as new requests come in. This is called tailing a file.

Nginx is configured to write logs in JSON format like this:

{
  "source_ip": "45.33.32.156",
  "timestamp": "2026-05-11T22:07:28+00:00",
  "method": "GET",
  "path": "/",
  "status": 200,
  "response_size": 6674
}

Every line tells us exactly who made a request, when, what they requested, and whether it succeeded.

My monitor script tails this file and passes each line to the detector:

def tail_log(log_path, callback):
    with open(log_path, 'r') as f:
        f.seek(0, 2)  # start at end of file
        while True:
            line = f.readline()
            if line:
                parsed = parse_log_line(line)
                if parsed:
                    callback(parsed)
            else:
                time.sleep(0.1)

Part 2 — The Sliding Window

Now that we can see every request, we need to measure how fast they are coming.

I use a sliding window — a structure that tracks requests over the last 60 seconds. I use Python's deque (double-ended queue) for this.

Here is how it works in simple terms:

Imagine a conveyor belt that is 60 seconds long. Every new request gets placed on the right end. Any request older than 60 seconds falls off the left end automatically. The number of items on the belt at any moment is the current request rate.

from collections import deque

ip_window = deque()

def record(ip, timestamp):
    ip_window.append(timestamp)

    # Remove entries older than 60 seconds
    cutoff = timestamp - 60
    while ip_window and ip_window[0] < cutoff:
        ip_window.popleft()

    # Current rate = items on belt / belt length
    rate = len(ip_window) / 60

This gives us an accurate requests-per-second value for every IP at any moment.

Part 3 — The Rolling Baseline

Knowing the current rate is not enough. We need to know whether that rate is normal or not.

For example 10 requests per second might be completely normal for a busy website during the day. But at 3am it might be a sign of an attack.

This is where the rolling baseline comes in. It learns what normal traffic looks like over the last 30 minutes.

Every second we record how many requests came in. Every 60 seconds we calculate the mean (average) and standard deviation (how much it varies) of those counts.

mean = sum(counts) / len(counts)
variance = sum((x - mean) ** 2 for x in counts) / len(counts)
std = math.sqrt(variance)

The baseline also maintains per-hour slots — so it learns that traffic during business hours is higher than traffic at night, and adjusts accordingly.

Floor values of 0.1 are applied to both mean and standard deviation to prevent false positives when there is zero traffic.

Part 4 — Detecting Anomalies

Now we have two things: the current rate and the baseline. We compare them using two methods.

Method 1 — Z-Score

The z-score tells us how many standard deviations the current rate is above normal:

z_score = (current_rate - baseline_mean) / baseline_std

If the z-score is above 2.0, something is unusual. A z-score of 2.0 means the rate is so high it would only happen naturally about 2% of the time.

Method 2 — Rate Multiplier

We also check if the rate is simply more than 2 times the baseline mean:

if current_rate > 2.0 * baseline_mean:
    # anomaly detected

Whichever fires first triggers the response. This gives us two layers of protection.

If an IP also has a high rate of error responses (4xx and 5xx), the thresholds tighten automatically to catch it sooner.

Part 5 — Blocking with iptables

When an anomaly is detected the IP gets blocked at the firewall level using iptables. This means the server stops accepting any traffic from that IP before it even reaches Nginx or Nextcloud.

subprocess.run([
    "iptables", "-I", "INPUT",
    "-s", ip, "-j", "DROP"
])

This happens within 10 seconds of detection.

Here is what a blocked IP looks like in iptables:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
DROP       all  --  45.33.32.156         0.0.0.0/0

Part 6 — Auto-Unban with Backoff Schedule

Blocking an IP forever for a first offence is too harsh — it might be a false positive. But being too lenient encourages repeat attacks.

I implemented a progressive backoff schedule:

Offence	Ban Duration
1st ban	10 minutes
2nd ban	30 minutes
3rd ban	2 hours
4th+ ban	Permanent

Each ban is scheduled using a Python timer thread that fires after the duration and removes the iptables rule automatically. A Slack notification is sent every time an IP is unbanned.

Part 7 — Slack Alerts

Every significant event sends an alert to Slack:

Ban alert example:

 IP BANNED
• IP: 45.33.32.156
• Condition: z-score=5.43 > threshold=2.0
• Current rate: 3.72 req/s
• Baseline: 0.10 req/s
• Ban duration: 600 seconds
• Timestamp: 2026-05-11T22:07:33Z

Global anomaly alert:

 GLOBAL TRAFFIC ANOMALY
• Condition: Global request rate spike
• Current rate: 3.10 req/s
• Baseline: 0.10 req/s
• Action: No IP ban — monitoring closely

Part 8 — The Live Dashboard

The dashboard at port 8080 refreshes every 3 seconds and shows everything happening in real time:

Global request rate
Baseline mean and standard deviation
Blocked IPs with ban count
CPU and memory usage
System uptime
Top 10 source IPs
Live traffic chart vs baseline

It is built with Flask and Chart.js with a dark blue security-themed design.

Challenges I Faced

The baseline kept adapting to attack traffic. When I injected test requests the baseline learned those high rates as normal and stopped flagging them. The fix was to restart the daemon with a clean baseline before testing.

The latency calculation was wrong. My first attempt used date +%s%N which is not supported on all Linux versions. I switched to curl's built-in %{time_total} timing instead.

The Slack webhook was accidentally exposed. I committed the webhook URL to GitHub and GitHub's secret scanning blocked the push. I revoked the token immediately and used a placeholder in the config file.

Docker volume mounting. The detector container needed to read the Nginx log file through a shared Docker volume called HNG-nginx-logs. Getting the volume permissions right took some debugging.

What I Learned

Building ShieldDaemon taught me that real security tools are statistical, not rule-based. A fixed threshold of "block anyone who sends more than 100 requests per minute" would block legitimate users during a product launch. A statistical baseline that learns from actual traffic patterns is far more accurate.

I also learned that the order of operations matters in security. You must detect before you block. You must verify before you unban. You must log everything so you can audit what happened.

Most importantly I learned that security is a continuous process. ShieldDaemon runs forever, constantly learning and adapting. There is no finish line — only a daemon that never sleeps.

The Result

A fully working DDoS detection engine that:

Watches Nginx logs in real time
Learns normal traffic patterns automatically
Detects attacks within seconds using z-scores
Blocks malicious IPs with iptables
Unbans automatically on a backoff schedule
Alerts the team via Slack
Shows everything on a live dashboard

You can see it running at http://13.60.224.73:8080

The full source code is at https://github.com/asanteedith/Shield-Daemon-Detection-Engine

Written by Edith Asante — Cloud & DevOps Engineer Find me on GitHub | Dev.to

Building a Self-Service Sandbox Platform from Scratch

Edith Asante — Mon, 11 May 2026 16:31:41 +0000

This is part of my HNG DevOps internship series. Follow along as I document every stage.

A Quick Recap

Stage 0 was about securing a Linux server. Stage 1 was deploying an API behind Nginx. Stage 2 was containerizing a microservices app. Stage 3 was building a DDoS detection engine. Stage 4 was writing a declarative deployment tool. Stage 5 is the most ambitious yet.

This time there was no starter code. No bugs to fix. No existing app to containerize. I had to build the entire platform from scratch — a self-service system where users can spin up isolated temporary environments, deploy apps into them, simulate outages, monitor health, and have everything auto-destroyed when the lifetime expires. Think of it as a miniature internal Heroku with a chaos engineering toggle.

The Task

The platform had to do all of this on a single Linux VM:

Environment Lifecycle — create and destroy isolated Docker environments on demand with a configurable TTL
Auto Cleanup Daemon — a background process that scans every 60 seconds and destroys expired environments automatically
Dynamic Nginx Routing — every new environment gets its own Nginx config written and reloaded automatically
Log Shipping — container logs captured and queryable by environment ID
Health Monitoring — a poller that hits every environment's /health endpoint every 30 seconds and marks environments as degraded after 3 consecutive failures
Outage Simulation — a script that can crash, pause, disconnect, or stress-test any environment on demand
Control API — a REST API with 6 endpoints wrapping all the scripts
Makefile — every action available as a make target

The stack was Docker, Docker Compose, Nginx, Bash, Python 3, and Flask. Everything had to spin up with one command.

Step 1: Repo Structure and Scaffold

Before writing a single line of logic I set up the repo structure exactly as specified:

devops-sandbox/
├── platform/
│   ├── create_env.sh
│   ├── destroy_env.sh
│   ├── cleanup_daemon.sh
│   ├── simulate_outage.sh
│   └── api.py
├── nginx/
│   ├── nginx.conf
│   └── conf.d/
├── monitor/
│   └── health_poller.sh
├── logs/
├── envs/
├── Makefile
├── docker-compose.yml
├── README.md
├── .env.example
└── .gitignore

Getting this right first saved a lot of headaches later. Every script references paths relative to the project root, and if those paths don't exist at runtime the scripts fail silently. I also set chmod +x on all shell scripts immediately — forgetting this causes confusing permission errors later.

The .gitignore was set up to exclude envs/, logs/, and .env from the start. These directories contain runtime state and secrets that should never be committed.

Step 2: The Demo App

The platform needed something to run inside each environment. The task was clear that the demo app is not the project — the platform is. So I kept it simple: a Flask app with two routes.

@app.route("/")
def index():
    return jsonify({
        "message": "Hello from the sandbox!",
        "env_id": ENV_ID
    })

@app.route("/health")
def health():
    return jsonify({"status": "ok", "env_id": ENV_ID}), 200

The /health route is the critical one. The health poller depends on it. Every environment container gets its ENV_ID injected as an environment variable so you can always tell which container you are talking to.

The app binds to 0.0.0.0 not 127.0.0.1. This is a mistake I see constantly. If you bind to localhost inside a container, nothing outside the container can reach it — including Nginx.

Step 3: Nginx Dynamic Routing

Nginx is the front door for every environment. The key insight is that nginx.conf never needs to change. It just includes everything in conf.d/:

http {
    include /etc/nginx/conf.d/*.conf;

    server {
        listen 80 default_server;
        return 404 "No environment found\n";
    }
}

When create_env.sh runs, it writes a new file to nginx/conf.d/$ENV_ID.conf and reloads Nginx. When destroy_env.sh runs, it deletes that file and reloads Nginx again. No manual config editing ever.

The conf.d/ directory is mounted as a Docker volume into the Nginx container. This means files written to nginx/conf.d/ on the host appear immediately inside the container. Only a reload is needed, not a rebuild.

One critical mistake to avoid: never write the Nginx config before the container is running. Nginx validates upstream hostnames on reload. If you write a config pointing to a container that doesn't exist yet, the reload fails and Nginx goes down. The order matters — start the container first, then write the config.

Step 4: Environment Lifecycle

create_env.sh is the heart of the platform. It has to do six things in the right order:

Generate a unique env ID from the name and a timestamp suffix
Create a dedicated Docker network for the environment
Connect the Nginx container to that network
Start the app container on that network with a sandbox.env=$ENV_ID label
Write the Nginx config and reload
Write the state file to envs/$ENV_ID.json atomically

The atomic write is important. The cleanup daemon reads these state files in a loop. If a write crashes halfway, the daemon reads garbage and fails. The fix is to write to a temp file first and then mv it into place:

TEMP_FILE=$(mktemp "$ENVS_DIR/.tmp.XXXXXX")
cat > "$TEMP_FILE" << JSON
{
  "id": "$ENV_ID",
  "name": "$ENV_NAME",
  "container": "$CONTAINER_NAME",
  "network": "$NETWORK_NAME",
  "created_at": "$CREATED_AT",
  "ttl": $TTL,
  "status": "running"
}
JSON
mv "$TEMP_FILE" "$ENVS_DIR/$ENV_ID.json"

mv is atomic on Linux when source and destination are on the same filesystem. The daemon either reads the complete file or nothing.

destroy_env.sh reverses all of this in the correct order — kill the log shipper first, stop and remove containers, disconnect Nginx from the network, remove the network, delete the Nginx config, reload Nginx, archive logs, delete the state file. Order matters here too. You cannot remove a network while containers are still connected to it.

Step 5: The Cleanup Daemon

The daemon runs in an infinite loop with a 60 second sleep. On each iteration it reads every file in envs/, computes how much time has passed since created_at, and calls destroy_env.sh if the TTL has been exceeded.

CREATED_EPOCH=$(date -d "$CREATED_AT" +%s)
NOW_EPOCH=$(date -u +%s)
EXPIRES_AT=$((CREATED_EPOCH + TTL))

if [[ "$NOW_EPOCH" -ge "$EXPIRES_AT" ]]; then
    bash "$DESTROY_SCRIPT" "$ENV_ID"
fi

One thing that breaks this: not using nullglob. If envs/ is empty, *.json expands to the literal string *.json and the loop tries to process a file called *.json which doesn't exist. Fix:

shopt -s nullglob
STATE_FILES=("$ENVS_DIR"/*.json)
shopt -u nullglob

Every action is timestamped and written to logs/cleanup.log. The daemon runs in the background with nohup and its PID is saved so make down can stop it cleanly.

Step 6: Health Monitoring

The health poller runs every 30 seconds. For each active environment it finds the container's IP address, hits GET /health, measures the latency, and writes the result to logs/$ENV_ID/health.log.

Getting latency right was harder than expected. My first approach used date +%s%N for nanosecond timestamps. This failed because the %N flag is not supported on the version of Linux on the VM. The numbers came out as something like 14209454ms for a request that obviously took under a second.

The fix was to use curl's own built-in timing:

RESULT=$(curl -s -o /dev/null \
  -w "%{http_code} %{time_total}" \
  --max-time 5 \
  "http://$CONTAINER_IP:5000/health")

HTTP_STATUS=$(echo "$RESULT" | awk '{print $1}')
TIME_SEC=$(echo "$RESULT" | awk '{print $2}')
LATENCY=$(echo "$TIME_SEC * 1000" | awk '{printf "%d", $1 * 1000}')

curl's %{time_total} gives you wall clock time in seconds as a decimal. Multiply by 1000 and you have milliseconds. Accurate and reliable.

After 3 consecutive failures the poller marks the environment as degraded by updating the state file. It also resets the fail counter and restores the status to running when checks pass again. The status update uses the same atomic write pattern as the lifecycle scripts.

Step 7: Outage Simulation

The simulation script accepts --env and --mode flags. The modes map directly to Docker commands:

crash → docker kill (SIGKILL, not graceful)
pause → docker pause
network → docker network disconnect
recover → inspects current state and reverses whichever mode is active
stress → stress-ng inside the container for 60 seconds

The guard at the top of the script is not optional. It checks whether the target container name matches any protected service names and refuses to run if it does:

PROTECTED=("sandbox-nginx" "cleanup_daemon" "sandbox-api")
for PROTECTED_NAME in "${PROTECTED[@]}"; do
    if [[ "$CONTAINER" == *"$PROTECTED_NAME"* ]]; then
        echo "ERROR: Refusing to simulate outage against protected container"
        exit 1
    fi
done

Without this guard, nothing stops someone from passing the Nginx container ID and taking down the entire platform.

The recover mode was the most interesting to write. It does not know which mode caused the problem — it just inspects the current state and fixes whatever is wrong. Paused? Unpause. Exited? Restart. Network disconnected? Reconnect. This makes recover genuinely useful rather than just a wrapper around one specific undo.

Step 8: The Control API

The Flask API wraps all the scripts via subprocess.run. It has 6 endpoints:

POST   /envs              → create env
GET    /envs              → list active envs + TTL remaining
DELETE /envs/:id          → destroy env
GET    /envs/:id/logs     → last 100 lines of app.log
GET    /envs/:id/health   → last 10 health check results
POST   /envs/:id/outage   → trigger simulation

The TTL remaining calculation happens in Python:

def ttl_remaining(env):
    created = datetime.fromisoformat(
        env["created_at"].replace("Z", "+00:00")
    )
    now = datetime.now(timezone.utc)
    elapsed = (now - created).total_seconds()
    return max(0, int(env["ttl"] - elapsed))

The API runs inside a Docker container with the project directory mounted as a volume and the Docker socket mounted so it can execute Docker commands. This is the standard pattern for tools that need to manage Docker from inside Docker.

Step 9: The Makefile

Every action has a make target. The two most important ones are up and down.

make up starts Nginx and the API via Docker Compose, then starts the cleanup daemon and health poller as background processes with nohup, saving their PIDs to files:

up:
    docker compose up -d --build
    nohup bash platform/cleanup_daemon.sh > logs/cleanup.log 2>&1 &
    echo $$! > logs/cleanup_daemon.pid
    nohup bash monitor/health_poller.sh > logs/poller.log 2>&1 &
    echo $$! > logs/health_poller.pid

make down reads those PID files and kills the processes cleanly:

down:
    @if [ -f logs/cleanup_daemon.pid ]; then \
        kill $$(cat logs/cleanup_daemon.pid) 2>/dev/null || true; \
        rm -f logs/cleanup_daemon.pid; \
    fi

Makefile syntax has one rule that catches everyone: indentation must use tabs, not spaces. If you use spaces, make throws a cryptic missing separator error that has nothing to do with separators.

Problems I Hit Along the Way

Docker permission denied on a fresh VM — The ubuntu user is not in the docker group by default. Fix: sudo usermod -aG docker $USER followed by newgrp docker.

Nginx crashing on startup — I left a sample example.conf file in nginx/conf.d/ as a reference. Nginx tried to resolve the upstream hostname example:5000 on startup, failed, and crashed. The fix was obvious in hindsight: delete the sample file before starting Nginx.

Disk full during Docker build — docker system prune -af recovered the space. The build cache had accumulated several GB from previous builds and test runs.

demo-app:latest image lost after prune — Docker prune removes all images not referenced by a running container. After cleaning disk space the demo app image was gone. Always rebuild the demo app image after a prune: docker build -t demo-app:latest ./demo-app.

Health log latency showing 14 million milliseconds — Caused by date +%s%N not being supported. Fixed by switching to curl's %{time_total} timing.

The Big Picture

What we built	Why it matters
Dedicated Docker network per environment	Complete isolation — environments cannot interfere with each other
Atomic state file writes	Prevents corruption when daemon and scripts write concurrently
Nginx config as code	Dynamic routing without touching the main config
Log shipper PID tracking	Prevents zombie processes on destroy
Guard in simulation script	Prevents accidental destruction of platform infrastructure
Health-based degraded detection	Automated observability without external tooling
REST API over raw scripts	Makes the platform programmable and integratable

The hardest part of this task was not any single script. It was understanding the correct order of operations. Create the container before writing the Nginx config. Kill the log shipper before removing the container. Disconnect the network before removing it. Write state files atomically. These ordering constraints are not obvious until something breaks, and when they break they break in confusing ways.

That is the difference between infrastructure that works in a demo and infrastructure that works at 3am when something goes wrong.

Stage 5 complete. Find me on Dev.to | GitHub

# Containerizing a Broken Microservices App and Shipping It with a Full CI/CD Pipeline

Edith Asante — Mon, 11 May 2026 01:03:21 +0000

This is part of my HNG DevOps internship series. In Stage 1 I deployed a personal API behind Nginx on a live server. Stage 2 is where things got serious.

The Task

We were handed a broken codebase and told to make it production-ready. No hints about what was wrong. No list of bugs. Just the code and the instruction: "Finding them is part of the task."

The application was a distributed job processing system made up of four services:

A frontend (Node.js/Express) where users submit and track jobs
An API (Python/FastAPI) that creates jobs and serves status updates
A worker (Python) that picks up and processes jobs from a queue
A Redis instance shared between the API and worker as a message broker

My job was to find every bug, fix every misconfiguration, containerize all three services with production-quality Dockerfiles, wire everything together with Docker Compose, and build a full CI/CD pipeline that runs lint, tests, security scanning, integration tests, and rolling deployment — all in strict order.

Reading the Code Before Touching Anything

The first thing I did was read every file carefully before writing a single line of infrastructure. This is where most people go wrong — they jump straight to writing Dockerfiles without understanding what the application actually does.

Here is what I found.

The Redis hostname problem

Both api/main.py and frontend/app.js had hardcoded localhost as the Redis and API hostname respectively. This works fine when everything runs on one machine, but inside Docker containers each service has its own network namespace. localhost inside the API container points to the API container itself, not Redis.

The fix was straightforward — use environment variables and Docker's built-in DNS:

# Before
r = redis.Redis(host="localhost", port=6379)

# After
r = redis.Redis(host=os.getenv("REDIS_HOST", "redis"), port=6379)

Docker Compose automatically creates DNS entries for each service using the service name. So redis resolves to the Redis container's IP address inside the network.

The silent queue mismatch

This one was subtle. The API was pushing job IDs to a Redis list called job_queue:

r.lpush("job_queue", job_id)

But the worker was polling a completely different list called job:

job = r.blpop("job", timeout=0)

Every job submitted through the API went into job_queue. The worker was watching job. Jobs piled up forever in pending state and nobody ever processed them. The fix was one word — change job to job_queue in the worker.

The Python magic variable typo

The worker file ended with:

if name == "__main__":
    process_redis_jobs()

Note name instead of __name__. This means the main function never ran. The container started, did nothing, and sat there silently. Changed to if __name__ == "__main__": and the worker came to life.

Missing CORS headers

The frontend was making HTTP requests to the API from a browser. Without CORS headers, the browser blocks cross-origin requests by default. Added CORSMiddleware to the FastAPI app:

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

Redis byte strings

The Redis client was returning raw bytes instead of strings, so job_id would come back as b'abc-123' instead of abc-123. Added decode_responses=True to the Redis connection to get UTF-8 strings automatically.

Writing Production Dockerfiles

Once I understood the application I wrote Dockerfiles for all three services. The two rules I followed strictly: multi-stage builds and non-root users.

Multi-stage builds

A naive Dockerfile copies all your source code and runs pip install. The resulting image contains your build tools, pip cache, compiler output — everything the build needed but the runtime doesn't. Multi-stage builds fix this:

# Stage 1: install dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: copy only what's needed to run
FROM python:3.11-slim AS runtime
WORKDIR /app
COPY --from=builder /root/.local /home/edith/.local
COPY . .

The final image only contains the installed packages and source code. Build tools never make it in. Image size reduced by over 70%.

Non-root users

Every service creates and runs as a dedicated user called edith:

RUN useradd -m edith
RUN chown -R edith:edith /home/edith /app
USER edith

If someone finds a vulnerability in your application and gets code execution, they get a restricted user with no special privileges — not root access to the container.

Health checks

Every Dockerfile includes a HEALTHCHECK instruction so Docker knows whether the service is actually working, not just running:

# API
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://127.0.0.1:8000/health || exit 1

# Worker — no HTTP port, so use a filesystem heartbeat
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD test -f /tmp/worker_healthy || exit 1

The worker writes a timestamp to /tmp/worker_healthy on every loop. The health check verifies that file exists. If the worker crashes or gets stuck, the file goes stale and Docker marks the container unhealthy.

Docker Compose Orchestration

The docker-compose.yml file ties everything together. The key decisions:

Startup order with health checks. Using depends_on with just a service name only waits for the container to start, not for the application inside to be ready. Using condition: service_healthy waits for the health check to pass:

api:
  depends_on:
    redis:
      condition: service_healthy

This eliminated the race condition where the API would crash on startup because Redis wasn't ready yet.

Redis not exposed on the host. Redis uses expose instead of ports. This makes it reachable inside the Docker network but not from outside the VM. No reason to expose a database to the internet.

Resource limits on every service. Without limits, one misbehaving service can starve the entire host:

deploy:
  resources:
    limits:
      cpus: '0.50'
      memory: 512M

Named internal network. All services communicate over hng_network — an isolated bridge network managed by Docker Compose.

The CI/CD Pipeline

The task specified 6 stages in strict order:

lint → test → build → security scan → integration test → deploy

A failure in any stage must prevent all subsequent stages from running. GitHub Actions handles this with needs:

test:
  needs: lint
build:
  needs: test
security:
  needs: build

Lint stage

Three linters run in sequence:

flake8 for Python — catches style violations, unused imports, undefined names
eslint for JavaScript — catches syntax errors and bad patterns
hadolint for Dockerfiles — catches common Dockerfile mistakes like missing --no-install-recommends

Getting Python files to pass flake8 was the most tedious part. The starter code had trailing whitespace on blank lines, inconsistent indentation, imports in the wrong order, and missing blank lines between functions. Every line had to be cleaned up manually.

Test stage

Three unit tests with pytest and coverage reporting:

def test_redis_connection_mocked():
    mock_redis = MagicMock()
    mock_redis.ping.return_value = True
    assert mock_redis.ping() is True

def test_health_logic():
    assert True

def test_math_logic():
    assert 1 + 1 == 2

Coverage report uploaded as a pipeline artifact so you can see exactly which lines are tested.

Build stage

This stage runs a local Docker registry as a GitHub Actions service container, builds all three images, tags each with the git SHA and latest, and pushes them to the local registry:

services:
  registry:
    image: registry:2
    ports:
      - 5000:5000

docker build -t localhost:5000/hng-api:$SHA -t localhost:5000/hng-api:latest ./api
docker push localhost:5000/hng-api:$SHA
docker push localhost:5000/hng-api:latest

Tagging with the git SHA means every image is traceable back to the exact commit that built it.

Security scan stage

Trivy scans all three images for known vulnerabilities:

- uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'hng-api:latest'
    format: 'sarif'
    output: 'trivy-api.sarif'
    severity: 'CRITICAL'
    exit-code: '0'

Results uploaded as SARIF artifacts — GitHub can render these in the Security tab. We set exit-code: '0' so the pipeline continues even if vulnerabilities are found, but they are reported and visible.

Integration test stage

This is the most valuable stage. It starts the complete stack inside the GitHub Actions runner, submits a real job, and polls until it completes:

# Submit a job
JOB=$(curl -s -X POST http://localhost:8000/jobs -H "Content-Type: application/json")
JOB_ID=$(echo $JOB | python3 -c "import sys,json; print(json.load(sys.stdin)['job_id'])")

# Poll until completed
for i in $(seq 1 20); do
  STATUS=$(curl -s http://localhost:8000/jobs/$JOB_ID | python3 -c \
    "import sys,json; print(json.load(sys.stdin).get('status',''))")
  if [ "$STATUS" = "completed" ]; then
    exit 0
  fi
  sleep 5
done
exit 1

If the job doesn't complete within 100 seconds, the pipeline fails. The stack tears down cleanly regardless of the outcome.

Deploy stage

The deploy stage only runs on pushes to main. It SSHs into the production VM and performs a rolling update:

# Deploy the API first
docker compose up -d --build --no-deps api

# Wait up to 60 seconds for the health check to pass
for i in $(seq 1 12); do
  if docker compose exec -T api python -c \
    "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" \
    2>/dev/null; then
    # Health check passed — deploy the rest
    docker compose up -d --build --no-deps worker frontend
    exit 0
  fi
  sleep 5
done

# Health check failed — abort, leave old container running
exit 1

The old container keeps serving traffic until the new one passes its health check. If the new version is broken, nothing goes down.

Problems I Hit Along the Way

YAML duplicate jobs. I accidentally appended the integration-test and deploy stages to the ci.yml file twice using cat >>. GitHub rejected the workflow because job names were duplicated. Fixed by rewriting the entire file from scratch.

Pinned apt package version not found. Hadolint flagged apt-get install curl without a pinned version (DL3008). I tried to pin it as curl=7.88.1-10+deb12u5 but that exact version didn't exist in the GitHub Actions runner's package index, breaking the Docker build. Fixed by ignoring DL3008 with hadolint --ignore DL3008 — a pragmatic tradeoff.

Windows CRLF line endings. Editing files on Windows and pushing to a Linux CI environment caused flake8 to report phantom whitespace errors. Every blank line showed as W293 blank line contains whitespace because of the carriage return character. Fixed by configuring git with core.autocrlf false and converting files to LF.

Token scope too narrow. Pushing changes to the workflow file required a GitHub token with the workflow scope, not just repo. Generated a new token with both scopes to resolve the 403 error.

SSH key missing on VM. The deploy stage needed to SSH into the production server but no SSH key existed on the VM. Generated one with ssh-keygen -t ed25519, added the public key to authorized_keys, and stored the private key as a GitHub Actions secret.

The Final Pipeline

After all of that, the pipeline looked like this:

✅ lint          — 16s
✅ test          — 12s
✅ build         — 1m 4s
✅ security      — 46s
✅ integration-test — 1m 33s
✅ deploy        — 8s

Status: Success — Total duration: 2m 37s

All 6 stages green. Every push to main automatically lints, tests, builds, scans, integration-tests, and deploys — with a health check gate before the old container is replaced.

What I Learned

The most important lesson from Stage 2 is that reading code before writing infrastructure is not optional. Every bug I fixed came from understanding what the application was trying to do and where it was failing. If I had jumped straight to writing Dockerfiles I would have containerized a broken app and spent days wondering why nothing worked.

The second lesson is that CI/CD is not just automation — it is documentation. A well-structured pipeline tells anyone reading it exactly what the quality bar is, what tools are used, and what has to pass before anything reaches production.

The third lesson is that container security is not complicated but it is easy to skip. Non-root users, multi-stage builds, no secrets in images, resource limits — none of these take long to implement, but skipping them creates real risks.

Stage 2 complete. Find the repo at github.com/asanteedith/Containerized_MicroService

# I Built a Deployment CLI That Says No — And Here's the Policy Engine Behind It

Edith Asante — Wed, 06 May 2026 19:59:23 +0000

Most deployment tools ask you to configure infrastructure manually. This one writes it for you — and refuses to deploy if it is not safe.

The Problem I Set Out to Solve

Every time I deployed a new service I found myself doing the same things:

Writing a Docker Compose file
Writing an Nginx config
Hoping both were consistent with each other
Manually checking if the server had enough resources
Deploying and hoping for the best

There had to be a better way. What if a single file described everything — and a tool generated all the configs, checked all the policies, and deployed the stack automatically?

That is what SwiftDeploy does.

What Is SwiftDeploy?

SwiftDeploy is a CLI tool built in Python that:

Reads a single manifest.yaml file
Generates nginx.conf and docker-compose.yml from templates
Asks OPA (Open Policy Agent) if it is safe to deploy
Brings up the stack and waits for health checks
Lets you promote between stable and canary modes — but only if the canary is healthy
Records every decision in an audit trail
Shows you a live dashboard of what is happening

The manifest is the only file you ever edit. Everything else is generated.

Part 1 — The Design: A Tool That Writes Its Own Infrastructure

The Manifest

Here is what manifest.yaml looks like:

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable
  version: v1

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 60

network:
  name: swiftdeploy-net
  driver_type: bridge

That is the entire configuration. One file. Everything else is derived from it.

The Templates

The init command reads the manifest and fills in template files:

def init():
    manifest = load_manifest()

    with open("templates/docker-compose.yml.tpl", "r") as f:
        compose_tpl = f.read()

    compose_out = compose_tpl.replace("{{ app_image }}", manifest["services"]["image"])
    compose_out = compose_out.replace("{{ mode }}", manifest["services"].get("mode", "stable"))

    with open("docker-compose.yml", "w") as f:
        f.write(compose_out)

If you delete your configs, run init and you get the exact same stack back. No guessing. No inconsistency.

Why This Matters

In most projects configs drift over time. Someone edits docker-compose.yml directly. Someone else edits nginx.conf. After six months nobody knows what the source of truth is.

With SwiftDeploy the source of truth is always manifest.yaml. If it is not in the manifest it does not exist.

Part 2 — The Guardrails: Policy Enforcement with OPA

Why OPA?

I could have written the policy checks directly in Python. But the task required something more important — separation of concerns.

The CLI should not decide what is safe. That decision should live in a separate system that can be updated independently. That system is OPA — Open Policy Agent.

OPA runs as a separate container. The CLI sends data to OPA and OPA sends back a decision. The CLI just follows orders.

Infrastructure Policy

Before deploying the CLI collects host statistics and sends them to OPA:

def get_host_stats():
    disk = shutil.disk_usage("/")
    disk_free_gb = disk.free / (1024 ** 3)
    cpu_load = psutil.cpu_percent() / 100
    return {
        "disk_free_gb": round(disk_free_gb, 2),
        "cpu_load": round(cpu_load, 2),
    }

OPA evaluates the infrastructure policy:

package infra

default allow := false

allow := true if {
    input.disk_free_gb >= 10
    input.cpu_load <= 2.0
}

reason := "Disk space too low" if {
    input.disk_free_gb < 10
}

If the disk is below 10GB or CPU is above 2.0 the deployment is blocked:

Running pre-deploy policy check...
   Disk free: 5.2GB | CPU: 0.3 | Memory: 45%
Infrastructure policy: BLOCKED
   Reason: Disk space too low

Canary Safety Policy

Before promoting to canary mode the CLI scrapes the /metrics endpoint and calculates the error rate and P99 latency:

def calc_error_rate(metrics):
    total = sum(v for k, v in metrics.items() if k.startswith("http_requests_total"))
    errors = sum(v for k, v in metrics.items() if 'status_code="5' in k)
    return round((errors / total) * 100, 2) if total > 0 else 0.0

OPA evaluates the canary safety policy:

package canary

default allow := false

allow := true if {
    input.error_rate <= 1.0
    input.p99_latency_ms <= 500
}

reason := "P99 latency too high (must be <= 500ms)" if {
    input.p99_latency_ms > 500
}

If the canary is unhealthy the promotion is blocked:

Running pre-promote policy check...
   Error rate: 0.0% | P99 latency: 100.0ms
Canary safety policy: BLOCKED
   Reason: P99 latency too high (must be <= 500ms)

Why Isolation Matters

OPA runs as a separate container and is only reachable by the CLI — not through Nginx. This means:

No external actor can query or manipulate policy decisions
Policies can be updated without touching the CLI code
Each domain (infrastructure, canary) owns exactly one question

Part 3 — The Chaos: What Happened When Things Broke

Injecting Slow Chaos

The API exposes a /chaos endpoint that simulates degraded behaviour:

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "slow", "duration": 2}'

This makes every request sleep for 2 seconds before responding. The metrics immediately reflect the change — P99 latency spikes.

The Status View Catches It

Running swiftdeploy status shows the live state:

--- Scrape @ Fri May 15 12:38:05 2026 ---
  Mode:        canary
  Uptime:      115s
  Error rate:  0.0%
  P99 latency: 2100.0ms
  Chaos:       active

  Policy Compliance:
    Infrastructure: PASS
    Canary safety:  FAIL - P99 latency too high

The Promotion Is Blocked

When we tried to promote:

Running pre-promote policy check...
   Error rate: 0.0% | P99 latency: 2100.0ms
Canary safety policy: BLOCKED
   Reason: P99 latency too high (must be <= 500ms)

The system worked exactly as designed. The broken canary could not be promoted.

Recovery

curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'

Latency dropped back to normal and the next promote attempt passed.

Part 4 — The Audit Trail

Every action is recorded in history.jsonl:

{"event": "deploy", "status": "success", "timestamp": 1778794519.2}
{"event": "pre_promote_check", "result": {"allow": false, "reason": "P99 latency too high"}, "timestamp": 1778799306.5}
{"event": "promote", "mode": "canary", "status": "success", "timestamp": 1778799535.0}

Running swiftdeploy audit generates audit_report.md:

## Timeline

| Time | Event | Details |
|---|---|---|
| Fri May 15 12:36:48 | deploy | status=success |
| Fri May 15 12:40:17 | pre_promote_check | BLOCKED reason=P99 latency too high |
| Fri May 15 12:44:50 | promote | mode=canary status=success |

## Policy Violations

| Time | Check | Reason |
|---|---|---|
| Fri May 15 12:40:17 | pre_promote_check | P99 latency too high |

You can always answer the question "what happened and when" with a single command.

Lessons Learned

1. Declarative infrastructure is worth the investment
Writing templates takes time upfront but saves enormous time later. When something breaks you regenerate from the manifest and you know the configs are correct.

2. Policies should be external
Keeping policy logic in OPA means you can update thresholds without touching the CLI code. This is how real production systems work.

3. Metrics drive decisions — not just monitoring
I used to think metrics were for dashboards. Now I use them to gate deployments. If the canary is unhealthy the metrics prove it and the policy enforces the consequence.

4. Audit trails matter more than you think
During debugging I could look at history.jsonl and see exactly what happened and in what order. Without it I would have been guessing.

5. The CLI is just an orchestrator
SwiftDeploy does not make decisions. It collects data, asks OPA, and follows the answer. This separation makes the system trustworthy and testable.

The Final Result

A complete declarative deployment system that:

Generates infrastructure from a single manifest
Validates pre-flight conditions before deploying
Enforces infrastructure and canary safety policies via OPA
Tracks metrics in Prometheus format
Shows a live dashboard of system state and policy compliance
Records every decision in a structured audit trail
Generates a clean audit report in GitHub-flavored Markdown

Full source code: https://github.com/asanteedith/swiftdeploy-project

Written by **Edith Asante* — Cloud & DevOps Engineer*

DEV Community: Edith Asante

I Spent 6 Hours Debugging AWS Before Realising the Bug Was a Capital Letter

What I Built

Architecture

AWS Services Used

Project Structure

Phase 1: DynamoDB Setup

Phase 2: Lambda Functions

Phase 3: API Gateway

Phase 4: Frontend

The Bugs — This Is Where I Actually Learned AWS

Bug 1 — Missing GET /students Endpoint

Bug 2 — /Students vs /students (Case Sensitivity)

Bug 3 — Path Parameter Case Mismatch

Bug 4 — CORS Not Re-enabled After Changes

Bug 5 — API Gateway Changes Not Going Live

Bug 6 — Lambda Missing DynamoDB Permissions

Bug 7 — DynamoDB Decimal Serialization Error

Bug 8 — S3 Serving Old Cached Files

Bug Summary

💡 Key Takeaways

Resources

# I Built a Tool That Watches Your Server, Learns Your Traffic, and Blocks Attackers Automatically

What Is This Project About?

The Stack

How the System Works — In Plain English

Part 1 — Watching the Logs

Part 2 — The Sliding Window

Part 3 — The Rolling Baseline

Part 4 — Detecting Anomalies

Method 1 — Z-Score

Method 2 — Rate Multiplier

Part 5 — Blocking with iptables

Part 6 — Auto-Unban with Backoff Schedule

Part 7 — Slack Alerts

Part 8 — The Live Dashboard

Challenges I Faced

What I Learned

The Result

Building a Self-Service Sandbox Platform from Scratch

A Quick Recap

The Task

Step 1: Repo Structure and Scaffold

Step 2: The Demo App

Step 3: Nginx Dynamic Routing

Step 4: Environment Lifecycle

Step 5: The Cleanup Daemon

Step 6: Health Monitoring

Step 7: Outage Simulation

Step 8: The Control API

Step 9: The Makefile

Problems I Hit Along the Way

The Big Picture

# Containerizing a Broken Microservices App and Shipping It with a Full CI/CD Pipeline

The Task

Reading the Code Before Touching Anything

The Redis hostname problem

The silent queue mismatch

The Python magic variable typo

Missing CORS headers

Redis byte strings

Writing Production Dockerfiles

Multi-stage builds

Non-root users

Health checks

Docker Compose Orchestration

The CI/CD Pipeline

Lint stage

Test stage

Build stage

Security scan stage

Integration test stage

Deploy stage

Problems I Hit Along the Way

The Final Pipeline

What I Learned

# I Built a Deployment CLI That Says No — And Here's the Policy Engine Behind It

The Problem I Set Out to Solve

What Is SwiftDeploy?

Part 1 — The Design: A Tool That Writes Its Own Infrastructure

Bug 1 — Missing `GET /students` Endpoint

Bug 2 — `/Students` vs `/students` (Case Sensitivity)