Abraham Acha

Posted on May 3

SwiftDeploy: Building a Declarative Infrastructure Manager from Scratch; A Complete Technical Walkthrough

#automation #cli #devops #showdev

How I built a CLI tool that generates, manages, and monitors a full containerised stack from a single YAML manifest — and everything that broke along the way.

Introduction

Most DevOps tasks ask you to configure infrastructure manually. You write an Nginx config. You write a Docker Compose file. You run commands. You check if things are healthy. You repeat this every time you spin up a new service.

SwiftDeploy flips that script entirely.

The premise is simple but powerful: one YAML file describes your entire deployment, and a CLI tool derives everything else from it. No handwritten Nginx configs. No manually crafted Docker Compose files. No guessing at container states. You edit the manifest, you run the CLI, and your stack is live.

This article is a complete technical walkthrough of how I built SwiftDeploy for the HNG DevOps Internship Stage 4A task — covering the architecture, every component, every design decision, the bugs I hit (including a particularly nasty WSL2 healthcheck issue), and the debugging process that resolved them.

By the end, you'll understand:

How to build a declarative infrastructure tool from scratch
How template-based config generation works
How to write a multi-subcommand Python CLI
How Docker healthchecks work and why they fail in unexpected ways
How nginx reverse proxying, canary deployments, and chaos engineering fit together

Let's get into it.

The Architecture

Before writing a single line of code, I mapped out how the pieces would connect.

You (the human)
     │
     │  edit only this
     ▼
manifest.yaml  ─────────────────────────────────────┐
                                                     │
                                                     ▼
                                           swiftdeploy (CLI)
                                                     │
                         ┌───────────────────────────┤
                         │                           │
                         ▼                           ▼
                   nginx.conf              docker-compose.yml
               (generated file)            (generated file)
                         │                           │
                         └──────────────┬────────────┘
                                        │
                                        ▼
                               Docker starts 2 containers
                                        │
                         ┌──────────────┴──────────────┐
                         │                             │
                         ▼                             ▼
                [nginx container]            [app container]
                 port 8080 (public)           port 3000 (private)
                         │                             │
                         └──── nginx proxies ──────────┘

Internet → port 8080 → nginx → port 3000 → Python app

The key architectural decisions:

The manifest is the single source of truth. Everything — nginx timeouts, container ports, network names, deployment mode — lives in manifest.yaml. The CLI reads it and generates everything else.
The app is never exposed directly. All traffic flows through nginx on port 8080. The app container only uses expose (internal Docker network), never ports (host-facing).
Generated files are gitignored. nginx.conf and docker-compose.yml are outputs, not inputs. They're always regeneratable from the manifest. The grader tests this explicitly — deleting generated files and re-running init.
The CLI is self-contained. One executable Python script, no framework, handles five subcommands.

The Project Structure

swiftdeploy/
├── manifest.yaml               ← the ONLY file you edit
├── swiftdeploy                 ← CLI executable
├── Dockerfile                  ← app image definition
├── app/
│   └── main.py                 ← Python HTTP service
├── templates/
│   ├── nginx.conf.tmpl         ← nginx template
│   └── docker-compose.yml.tmpl ← compose template
├── nginx.conf                  ← generated by init (gitignored)
├── docker-compose.yml          ← generated by init (gitignored)
├── .gitignore
└── README.md

Component 1: The Manifest

manifest.yaml is the brain of the entire system. Every other component reads from it either directly or via the generated files.

services:
  image: swift-deploy-1-node:latest
  port: 3000
  mode: stable
  version: "1.0.0"
  restart_policy: unless-stopped
  log_volume: swiftdeploy-logs

nginx:
  image: nginx:latest
  port: 8080
  proxy_timeout: 30

network:
  name: swiftdeploy-net
  driver_type: bridge

contact: "ops@swiftdeploy.local"

The design intention is that this reads like infrastructure-as-documentation. You can look at this file and understand the entire deployment without reading a single generated config.

What each field controls

services.mode — controls whether the app runs in stable or canary mode. The CLI's promote subcommand updates this in-place.
nginx.proxy_timeout — propagates into the nginx config as proxy_connect_timeout, proxy_send_timeout, and proxy_read_timeout.
contact — injected into nginx's custom JSON error bodies for 502/503/504 responses.
log_volume — a named Docker volume shared between the app container and nginx, so both write logs to the same persistent location.

Component 2: The Python HTTP Service

The app is a from-scratch HTTP server built on Python's http.server stdlib — no Flask, no FastAPI, no external dependencies. This keeps the Docker image small and the container startup fast.

The server setup

import os
import time
import random
import threading
import json
from http.server import HTTPServer, BaseHTTPRequestHandler

MODE = os.environ.get("MODE", "stable")
APP_VERSION = os.environ.get("APP_VERSION", "1.0.0")
APP_PORT = int(os.environ.get("APP_PORT", "3000"))

START_TIME = time.time()

Configuration comes entirely from environment variables injected by Docker Compose at runtime. The defaults exist only as fallbacks for local development.

START_TIME is captured at module load — this is how /healthz calculates uptime without a database or external state store.

Thread-safe chaos state

chaos_lock = threading.Lock()
chaos_state = {"mode": None, "duration": None, "rate": None}

def get_chaos():
    with chaos_lock:
        return dict(chaos_state)

def set_chaos(state):
    with chaos_lock:
        chaos_state.update(state)

Python's http.server handles each request in the same thread by default (it's not async), but I added explicit thread safety here anyway — the Lock ensures that if you ever extend this to a threaded server, chaos state reads and writes remain atomic. dict(chaos_state) returns a copy, preventing the caller from holding a reference to the mutable internal state.

The request handler

class Handler(BaseHTTPRequestHandler):
    def log_message(self, format, *args):
        pass  # suppress default logging — nginx handles access logs

    def send_json(self, code, body, extra_headers=None):
        payload = json.dumps(body).encode()
        self.send_response(code)
        self.send_header("Content-Type", "application/json")
        self.send_header("X-Deployed-By", "swiftdeploy")
        if MODE == "canary":
            self.send_header("X-Mode", "canary")
        self.end_headers()
        self.wfile.write(payload)

send_json is a helper that consolidates all the boilerplate of setting response codes, content type, and custom headers in one place. Every route calls it — this is the DRY principle applied to HTTP handlers.

Suppressing log_message is intentional. The default Python HTTP server writes its own access log to stdout, which would duplicate what nginx already logs in the structured format we defined.

The three routes

GET / — welcome endpoint

if self.path == "/":
    self.send_json(200, {
        "message": "Welcome to SwiftDeploy API",
        "mode": MODE,
        "version": APP_VERSION,
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    })

GET /healthz — liveness check

elif self.path == "/healthz":
    uptime = round(time.time() - START_TIME, 2)
    self.send_json(200, {
        "status": "ok",
        "mode": MODE,
        "version": APP_VERSION,
        "uptime_seconds": uptime,
    })

The /healthz endpoint does three jobs simultaneously: it proves the server is alive (Docker healthcheck), it reports the current mode (so promote can confirm the switch happened), and it reports uptime (useful for debugging restart loops).

POST /chaos — chaos injection (canary only)

def do_POST(self):
    if self.path == "/chaos":
        if MODE != "canary":
            self.send_json(403, {"error": "chaos endpoint only available in canary mode"})
            return

        length = int(self.headers.get("Content-Length", 0))
        body = self.rfile.read(length)
        data = json.loads(body)
        mode = data.get("mode")

        if mode == "slow":
            set_chaos({"mode": "slow", "duration": data.get("duration", 2), "rate": None})
        elif mode == "error":
            set_chaos({"mode": "error", "duration": None, "rate": data.get("rate", 0.5)})
        elif mode == "recover":
            set_chaos({"mode": None, "duration": None, "rate": None})

Reading Content-Length before calling rfile.read() is standard HTTP protocol — you must know exactly how many bytes to read, otherwise the read blocks waiting for more data that never comes.

The chaos modes:

slow — injects time.sleep(N) before responding, simulating a slow upstream
error — uses random.random() < rate to return 500 on a configurable percentage of requests
recover — clears all chaos state, returning to normal behaviour

This is real chaos engineering in miniature — the same concept used by tools like Chaos Monkey, just scoped to a single service.

Component 3: The Dockerfile

FROM python:3.12-alpine

RUN addgroup -S appgroup && adduser -S appuser -G appgroup

WORKDIR /app
COPY app/main.py .
RUN chown -R appuser:appgroup /app

USER appuser

ENV MODE=stable
ENV APP_VERSION=1.0.0
ENV APP_PORT=3000

EXPOSE 3000

HEALTHCHECK --interval=10s --timeout=5s --start-period=15s --retries=5 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1

CMD ["python", "main.py"]

Why Alpine?

python:3.12-alpine is approximately 60MB. python:3.12 (Debian-based) is approximately 1GB. The task requires images under 300MB. Alpine gets us there with room to spare.

Why non-root?

The addgroup / adduser pattern is a security baseline. If someone exploits a vulnerability in the app, they get a user with zero privileges — no ability to write to system directories, install packages, or escalate. Running as root inside a container means a container escape gives the attacker root on the host.

The healthcheck evolution

The Dockerfile healthcheck went through several iterations during development. The original:

# BROKEN — env vars don't expand in CMD array form
HEALTHCHECK CMD wget -qO- http://localhost:${APP_PORT}/healthz || exit 1

${APP_PORT} doesn't expand inside the Dockerfile HEALTHCHECK CMD array — it's evaluated at build time, not runtime, so it literally tries to connect to http://localhost:${APP_PORT}. Fixed by hardcoding:

# BROKEN on WSL2 — localhost doesn't resolve inside Alpine healthcheck context
HEALTHCHECK CMD wget -qO- http://localhost:3000/healthz || exit 1

This also failed on WSL2 + Docker Desktop. The wget inside Alpine couldn't resolve localhost to 127.0.0.1 in the healthcheck execution context (a known WSL2 networking quirk). Switching to 127.0.0.1 still failed because of how Docker Desktop on WSL2 handles the network namespace for healthcheck processes.

The final working solution:

# WORKS — uses Python's urllib, no external tool, no DNS resolution
HEALTHCHECK --interval=10s --timeout=5s --start-period=15s --retries=5 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1

Using Python's own urllib sidesteps the wget/DNS issue entirely. Python's socket layer handles 127.0.0.1 directly without going through the system resolver.

Component 4: The Templates

Templates are the bridge between the manifest and the generated configs. They contain placeholders in {{ key }} format that the CLI replaces with real values.

nginx.conf.tmpl

upstream app_backend {
    server app:{{ service_port }};
    keepalive 32;
}

log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request';

server {
    listen {{ nginx_port }};
    server_name _;

    access_log /var/log/nginx/access.log swiftdeploy;

    proxy_connect_timeout {{ proxy_timeout }}s;
    proxy_send_timeout {{ proxy_timeout }}s;
    proxy_read_timeout {{ proxy_timeout }}s;

    add_header X-Deployed-By swiftdeploy always;
    proxy_pass_header X-Mode;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    error_page 502 = @error502;
    error_page 503 = @error503;
    error_page 504 = @error504;

    location @error502 {
        default_type application/json;
        add_header X-Deployed-By swiftdeploy always;
        return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"{{ contact }}"}';
    }
    # ... 503, 504 same pattern
}

Design decisions worth noting:

Custom log format: $time_iso8601 | $status | ${request_time}s | $upstream_addr | $request gives you timestamp, HTTP status, response time in seconds, the upstream container IP, and the full request line — everything you need for debugging in one line.
JSON error bodies: Rather than nginx's default HTML error pages, we return structured JSON. This is essential for APIs — clients expect JSON and need machine-readable error codes.
proxy_pass_header X-Mode: nginx strips most custom headers by default. This directive explicitly forwards the X-Mode: canary header from the upstream app through to the client, so callers can identify which mode they're talking to.
keepalive 32: keeps 32 persistent connections to the upstream, reducing connection overhead under load.

docker-compose.yml.tmpl

services:
  app:
    image: {{ service_image }}
    container_name: swiftdeploy-app
    environment:
      MODE: "{{ mode }}"
      APP_VERSION: "{{ version }}"
      APP_PORT: "{{ service_port }}"
    networks:
      - {{ network_name }}
    volumes:
      - {{ log_volume }}:/app/logs
    restart: {{ restart_policy }}
    user: "appuser"
    cap_drop:
      - NET_ADMIN
      - SYS_ADMIN
    healthcheck:
      test: ["CMD-SHELL", "python -c 'import urllib.request; urllib.request.urlopen(\"http://127.0.0.1:3000/healthz\", timeout=4)'"]
      interval: 10s
      timeout: 5s
      start_period: 15s
      retries: 5
    expose:
      - "{{ service_port }}"

  nginx:
    image: {{ nginx_image }}
    container_name: swiftdeploy-nginx
    ports:
      - "{{ nginx_port }}:{{ nginx_port }}"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
      - {{ log_volume }}:/var/log/nginx
    networks:
      - {{ network_name }}
    restart: {{ restart_policy }}
    depends_on:
      app:
        condition: service_healthy

The expose vs ports distinction is critical:

expose: ["3000"] — makes port 3000 reachable between containers on the same Docker network. Not published to the host.
ports: ["8080:8080"] — publishes port 8080 to the host machine and the outside world.

The app uses expose only. There is no way to reach port 3000 from outside Docker. All traffic must enter through nginx on 8080.

depends_on: condition: service_healthy means nginx won't start until the app container's healthcheck passes. This prevents nginx from starting and immediately returning 502s because the upstream isn't ready yet.

Component 5: The CLI

The swiftdeploy CLI is a single Python script with five subcommands. Here's how each works internally.

The template engine

def render_template(tmpl_path, context):
    with open(tmpl_path) as f:
        content = f.read()
    for key, val in context.items():
        content = content.replace("{{ " + key + " }}", str(val))
    return content

Five lines. No Jinja2. No external library. Simple string replacement. This is intentional — the templates are straightforward enough that a minimal custom engine is cleaner than pulling in a dependency.

The context builder

def build_context(manifest):
    svc = manifest["services"]
    ngx = manifest["nginx"]
    net = manifest["network"]
    return {
        "service_image":  svc["image"],
        "service_port":   svc["port"],
        "mode":           svc.get("mode", "stable"),
        "version":        svc.get("version", "1.0.0"),
        "restart_policy": svc.get("restart_policy", "unless-stopped"),
        "log_volume":     svc.get("log_volume", "swiftdeploy-logs"),
        "nginx_image":    ngx["image"],
        "nginx_port":     ngx["port"],
        "proxy_timeout":  ngx.get("proxy_timeout", 30),
        "network_name":   net["name"],
        "network_driver": net["driver_type"],
        "contact":        manifest.get("contact", "ops@swiftdeploy.local"),
    }

This translates the nested YAML structure into the flat dictionary that matches the {{ placeholders }} in the templates. It's the glue layer between manifest and generated configs.

`init` subcommand

def cmd_init():
    manifest = load_manifest()
    ctx = build_context(manifest)

    nginx_conf = render_template(NGINX_TMPL, ctx)
    with open(NGINX_OUT, "w") as f:
        f.write(nginx_conf)

    compose_conf = render_template(COMPOSE_TMPL, ctx)
    with open(COMPOSE_OUT, "w") as f:
        f.write(compose_conf)

Straightforward: load manifest → build context → render both templates → write files. The grader deletes the generated files and re-runs this to verify regeneration.

`validate` subcommand — 5 pre-flight checks

# Check 1: manifest.yaml exists and is valid YAML
try:
    manifest = load_manifest()
    ok("manifest.yaml found and parsed successfully")
except yaml.YAMLError as e:
    fail(f"Invalid YAML: {e}")

# Check 2: required fields present and non-empty
required = {
    "services.image": manifest.get("services", {}).get("image"),
    "services.port":  manifest.get("services", {}).get("port"),
    "nginx.image":    manifest.get("nginx", {}).get("image"),
    "nginx.port":     manifest.get("nginx", {}).get("port"),
    "network.name":   manifest.get("network", {}).get("name"),
    "network.driver_type": manifest.get("network", {}).get("driver_type"),
}

# Check 3: Docker image exists locally
result = run(f"docker image inspect {image}", capture=True, check=False)
# exit code 0 = found, non-zero = not found

# Check 4: Nginx port not already bound
result = run(f"ss -tlnp | grep ':{nginx_port} '", capture=True, check=False)
# stdout non-empty = port in use

# Check 5: nginx.conf syntactically valid
result = subprocess.run(
    ["docker", "run", "--rm",
     "-v", f"{test_conf_path}:/etc/nginx/conf.d/default.conf:ro",
     "nginx:latest", "nginx", "-t"],
    capture_output=True, text=True
)
combined = result.stdout + result.stderr
if "successful" in combined:
    ok("nginx.conf syntax is valid")

Check 5 is the most interesting. Running nginx -t in an isolated container is elegant — it validates syntax without needing nginx installed on the host. However, we hit a complication: in an isolated container, app:3000 (the upstream hostname) can't be resolved because there's no Docker network. nginx refuses to start if it can't resolve upstream hostnames, even for a syntax check.

The fix: before handing the config to the test container, swap server app: with server 127.0.0.1: in a temporary copy. 127.0.0.1 always resolves, so nginx validates the rest of the syntax (listen ports, timeouts, location blocks, error pages) correctly. The actual nginx.conf on disk is untouched.

`deploy` subcommand

def cmd_deploy():
    cmd_init()
    run(compose_cmd("up -d --build"))

    deadline = time.time() + 60
    healthy = False
    while time.time() < deadline:
        try:
            url = f"http://localhost:{nginx_port}/healthz"
            with urllib.request.urlopen(url, timeout=3) as resp:
                body = json.loads(resp.read())
                if body.get("status") == "ok":
                    healthy = True
                    break
        except Exception:
            pass
        time.sleep(2)

    if not healthy:
        fail("Health checks did not pass within 60 seconds")
        sys.exit(1)

The polling loop is the key part. Containers don't start instantly. docker compose up -d returns as soon as the containers are created, not when they're healthy. The loop hits /healthz through nginx every 2 seconds for up to 60 seconds. Connection refused, timeout, bad JSON — all exceptions are caught and ignored. Only {"status": "ok"} breaks the loop successfully.

`promote` subcommand

def cmd_promote(target_mode):
    # 1. Update manifest in-place using regex
    with open(MANIFEST_PATH) as f:
        content = f.read()
    content = re.sub(r"(mode:\s*)(\S+)", f"\\g<1>{target_mode}", content, count=1)
    with open(MANIFEST_PATH, "w") as f:
        f.write(content)

    # 2. Regenerate docker-compose.yml only
    manifest = load_manifest()
    ctx = build_context(manifest)
    compose_conf = render_template(COMPOSE_TMPL, ctx)
    with open(COMPOSE_OUT, "w") as f:
        f.write(compose_conf)

    # 3. Restart app container only — nginx stays up
    run(compose_cmd("up -d --no-deps app"))

    # 4. Confirm mode via /healthz
    deadline = time.time() + 30
    while time.time() < deadline:
        try:
            with urllib.request.urlopen(url, timeout=3) as resp:
                body = json.loads(resp.read())
                if body.get("mode") == target_mode and body.get("status") == "ok":
                    confirmed = True
                    break
        except Exception:
            pass
        time.sleep(2)

--no-deps in docker compose up -d --no-deps app is the rolling restart mechanism. It tells Compose to restart only the app service without touching nginx. Since nginx is already running and healthy, there's zero downtime at the proxy level — nginx continues serving requests while the app container restarts with the new mode.

The regex re.sub(r"(mode:\s*)(\S+)", f"\\g<1>{target_mode}", content, count=1) uses a backreference \\g<1> to preserve the mode: prefix and only replace the value. count=1 ensures only the first occurrence is replaced.

The Debugging Saga: WSL2 + Docker Desktop Healthchecks

This section documents the most painful part of the build — a cascade of healthcheck failures that took multiple debugging rounds to resolve.

Failure 1: `${APP_PORT}` not expanding

wget: can't connect to remote host: Connection refused

Root cause: environment variables don't expand in Dockerfile HEALTHCHECK CMD at runtime. ${APP_PORT} was being passed literally to wget. Fix: hardcode 3000.

Failure 2: `localhost` not resolving in Alpine

wget: can't connect to remote host: Connection refused

Same error, different cause. Inside the Alpine container's healthcheck execution context on WSL2 + Docker Desktop, localhost wasn't resolving to 127.0.0.1. Fix: use 127.0.0.1 explicitly.

Failure 3: `wget` still failing with `127.0.0.1`

wget: can't connect to remote host: Connection refused

Confirmed the server was listening:

docker exec swiftdeploy-app ss -tlnp
# tcp LISTEN 0.0.0.0:3000

docker exec swiftdeploy-app python -c "import urllib.request; print(urllib.request.urlopen('http://127.0.0.1:3000/healthz').read())"
# b'{"status": "ok", ...}'

The server was reachable via docker exec but not from the healthcheck process. This is a known WSL2 + Docker Desktop network namespace issue — the healthcheck runs in a slightly different network context than docker exec. Fix: replace wget with Python's urllib:

CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1

Failure 4: Docker cache serving old image

After fixing the Dockerfile, the healthcheck was still running wget. The container was using a cached image layer. Fix:

docker rmi swift-deploy-1-node:latest
docker build --no-cache -t swift-deploy-1-node:latest .

Failure 5: docker-compose.yml overriding the Dockerfile healthcheck

Even after fixing the image, the container was still running wget. Discovery:

grep -A5 "healthcheck" docker-compose.yml
# test: ["CMD", "wget", "-qO-", "http://localhost:3000/healthz"]

The docker-compose.yml template had its own healthcheck definition that was overriding the Dockerfile's. Docker Compose healthcheck always wins over Dockerfile HEALTHCHECK. Fixed the template to use the Python urllib approach and correct YAML quoting:

healthcheck:
  test: ["CMD-SHELL", "python -c 'import urllib.request; urllib.request.urlopen(\"http://127.0.0.1:3000/healthz\", timeout=4)'"]
  interval: 10s
  timeout: 5s
  start_period: 15s
  retries: 5

Failure 6: YAML indentation error

The healthcheck block had 3 spaces of indentation instead of 4 — a single space difference that broke YAML parsing entirely:

   healthcheck:  # ← 3 spaces — YAML parse error

    healthcheck:  # ← 4 spaces — correct

After all six of these were resolved, the deploy succeeded:

✔ Container swiftdeploy-app    Healthy   6.8s
✔ Container swiftdeploy-nginx  Started   7.2s
✔ Stack is healthy! Listening on port 8080

Deployment Walkthrough

1. Build the image

docker build -t swift-deploy-1-node:latest .

2. Validate

./swiftdeploy validate

[1/5] manifest.yaml exists and is valid YAML
  ✔ manifest.yaml found and parsed successfully

[2/5] Required fields are present and non-empty
  ✔ services.image = swift-deploy-1-node:latest
  ✔ services.port = 3000
  ✔ nginx.image = nginx:latest
  ✔ nginx.port = 8080
  ✔ network.name = swiftdeploy-net
  ✔ network.driver_type = bridge

[3/5] Docker image exists locally
  ✔ Image found: swift-deploy-1-node:latest

[4/5] Nginx port not already bound on host
  ✔ Port 8080 is free

[5/5] Generated nginx.conf is syntactically valid
  ✔ nginx.conf syntax is valid

ALL CHECKS PASSED ✔

3. Deploy

./swiftdeploy deploy

✔ Container swiftdeploy-app    Healthy
✔ Container swiftdeploy-nginx  Started
✔ Stack is healthy! Listening on port 8080

deploy complete.

4. Test endpoints

curl http://localhost:8080/
# {"message": "Welcome to SwiftDeploy API", "mode": "stable", "version": "1.0.0", "timestamp": "..."}

curl http://localhost:8080/healthz
# {"status": "ok", "mode": "stable", "version": "1.0.0", "uptime_seconds": 12.4}

5. Promote to canary

./swiftdeploy promote canary
curl http://localhost:8080/healthz
# {"status": "ok", "mode": "canary", ...}

curl -I http://localhost:8080/
# X-Mode: canary
# X-Deployed-By: swiftdeploy

6. Test chaos (canary mode only)

# Slow mode — 3 second delay
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "slow", "duration": 3}'

time curl http://localhost:8080/
# real    0m3.012s

# Error mode — 50% 500 errors
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "error", "rate": 0.5}'

# Recover
curl -X POST http://localhost:8080/chaos \
  -H "Content-Type: application/json" \
  -d '{"mode": "recover"}'

7. View nginx access logs

docker logs swiftdeploy-nginx

2026-05-03T14:23:01+00:00 | 200 | 0.002s | 172.18.0.3:3000 | GET / HTTP/1.1
2026-05-03T14:23:05+00:00 | 200 | 0.001s | 172.18.0.3:3000 | GET /healthz HTTP/1.1
2026-05-03T14:23:12+00:00 | 500 | 0.001s | 172.18.0.3:3000 | GET / HTTP/1.1

8. Teardown

./swiftdeploy teardown --clean

Key Learnings

1. Docker Compose healthcheck overrides Dockerfile HEALTHCHECK. This is documented but easy to miss. If both exist, Compose wins. Always check your generated compose file when healthchecks misbehave.

2. WSL2 + Docker Desktop has a quirky network namespace for healthchecks. docker exec and HEALTHCHECK run in slightly different contexts. If something works via exec but not via healthcheck, it's almost always a network namespace or tool availability issue. Python's stdlib is more portable than wget in this environment.

3. --no-cache is essential when debugging Dockerfile changes. Docker's layer caching is aggressive. If you change a HEALTHCHECK line but the layers above it are cached, Docker will use the old healthcheck. Always docker rmi and --no-cache when debugging image-level issues.

4. expose vs ports in Docker Compose is a security boundary, not just documentation. expose is container-to-container only. ports is host-facing. Using expose for the app and ports only for nginx enforces the proxy pattern at the infrastructure level.

5. YAML indentation is unforgiving. A single space difference between 3 and 4 spaces of indentation produces a cryptic parse error. Always use a YAML linter or at minimum python3 -c "import yaml; yaml.safe_load(open('file.yml'))" to validate before deploying.

6. Declarative infrastructure pays off immediately. The grader deletes generated files and re-runs init — because we built it right, this is a non-issue. The manifest is always there, the templates are always there, and regeneration is instantaneous.

Conclusion

SwiftDeploy started as a task requirement and ended up being a genuinely useful mental model for how declarative infrastructure tools work. Tools like Terraform, Helm, and Pulumi are all variations of the same core idea: describe what you want, let the tool figure out how to get there.

Building this from scratch — the template engine, the CLI subcommands, the healthcheck polling loop, the rolling restart — gives you an appreciation for what those tools are doing under the hood at scale.

The debugging journey through six layers of healthcheck failures was frustrating in the moment but valuable in retrospect. Every failure taught something concrete about how Docker, WSL2, Alpine Linux, nginx, and Python's network stack interact.

The full source code is available at: https://github.com/AirFluke/hng-swiftdeploy

Built for HNG DevOps Internship — Stage 4A: SwiftDeploy

Tags: #devops #docker #nginx #python #infrastructure #hng