A self-service DevOps sandbox platform with auto-destroying environments, dynamic Nginx routing, and a chaos engineering toggle — plus every painful deployment war story.
The Challenge
Imagine you're on a DevOps team and every developer needs their own isolated environment to test their code. Spinning them up manually is slow. Forgetting to tear them down wastes resources. And nobody ever tests what happens when things actually break in production.
That was the problem I set out to solve.
The result? A fully self-service DevOps Sandbox Platform — a miniature internal Heroku where environments are short-lived by design, chaos is a feature, and everything cleans itself up automatically.
What I didn't plan for was the deployment war that followed.
What I Built
The platform lets any user:
Spin up an isolated environment with one command
Deploy an app into it automatically
Monitor its health every 30 seconds
Simulate outages — crashes, network failures, CPU stress
Auto-destroy everything when the TTL expires
All of this runs on a single Linux VM and starts with one command: make up.
The Architecture
Everything lives inside one Azure VM:
Client → Nginx (port 80) → App Containers
↑
Auto-generated
conf.d/*.conf
API (port 5000) → Bash Scripts → Docker
↓
envs/*.json (state)
logs// (logs)
Background: Health Monitor (30s) + Cleanup Daemon (60s)
Five core components:
- Nginx — The Front Door Every environment gets its own config file auto-written to nginx/conf.d/. When a new environment is created, the script writes the config and runs nginx -s reload. Traffic is routed by hostname.
- FastAPI Control API Seven REST endpoints wrapping all the bash scripts. Create, list, destroy, fetch logs, check health, trigger outages. Swagger docs at /docs.
- The Bash Engine Four scripts power everything:
create_env.sh — spins up container, network, Nginx config, log shipping
destroy_env.sh — tears everything down cleanly, archives logs
simulate_outage.sh — chaos engineering with crash/pause/network/recover/stress
cleanup_daemon.sh — runs every 60 seconds, auto-destroys expired environments
- Health Monitor Python script polling every active environment's /health endpoint every 30 seconds. Three consecutive failures marks the environment as "degraded."
- State Management JSON files in envs/ written atomically using temp-file + mv to prevent corruption.
Building It Was the Easy Part
The platform came together cleanly. One command to start everything, environments spinning up in seconds, chaos simulation working perfectly. make up → make create → make simulate → everything worked.
Then came deployment.
The Azure Deployment Wars
Battle 1 — The SSH Key That Didn't Exist
First attempt to SSH into the VM:
Warning: Identity file azureuser_key.pem not accessible: No such file or directory
Permission denied (publickey)
The .pem file path was wrong. Classic. Found the actual file — hng5-vm_key.pem in Downloads — and fixed the path. But then:
Permission denied (publickey)
Still failing. The key didn't match the VM. Had to reset the SSH key directly in Azure portal → Connect → Reset SSH public key. Twenty minutes lost.
Lesson: Always verify your SSH key matches the VM it was created with. Azure makes it easy to reset but you lose time.
Battle 2 — The Azure Firewall That Silently Blocked Everything
Platform was running. API was live on port 5000 inside the VM. But the browser couldn't reach it.
ERR_CONNECTION_REFUSED
Added inbound port rules in Azure NSG for port 5000. Still refused. Added them again. Still refused.
Tried routing through Nginx on port 8080. Tried a proxy container. Tried 172.17.0.1. Tried 127.0.0.1. Every attempt returned:
502 Bad Gateway
The real problem? The NSG rules were saving but Azure has an additional firewall layer that was silently dropping traffic. Port 5000 was listening perfectly inside the VM — ss -tlnp confirmed it — but nothing from outside could reach it.
The fix that actually worked: Run the API container with --network host mode instead of the default bridge network:
bashdocker run -d \
--name sandbox-api \
--network host \
-v $(pwd):/app \
-v /var/run/docker.sock:/var/run/docker.sock \
-w /app \
devops-sandbox-api \
python3 platform/api.py
Host network mode binds directly to the VM's network interface, bypassing Docker's bridge entirely. Suddenly port 5000 was reachable from outside.
Lesson: On Azure VMs, Docker bridge networking can be blocked by Azure's internal firewall even when NSG rules look correct. Host network mode is your escape hatch.
Battle 3 — platform Is a Reserved Python Name
With host networking, the API still wouldn't start:
ERROR: Could not import module "platform.api"
platform is a built-in Python standard library module. Uvicorn was trying to import Python's built-in platform module instead of our platform/api.py file.
The fix: Run the API directly as a Python script instead of through uvicorn's module import:
bashpython3 platform/api.py
instead of:
bashuvicorn platform.api:app
Lesson: Never name your application directory the same as a Python standard library module. platform, json, os, sys — all reserved. Rename to app, api, src instead.
Battle 4 — The EOF That Kept Disappearing
Writing Nginx config files directly in the terminal kept producing malformed output. The heredoc EOF delimiter was being swallowed or the content was getting duplicated:
bash# This kept failing silently:
cat > nginx/conf.d/api.conf << 'EOF'
server {
...
}
EOF # ← this was the problem line
The terminal was interpreting EOF as part of the previous command instead of as a delimiter.
The fix: Use tee instead of cat redirection:
bashtee nginx/conf.d/api.conf > /dev/null << 'EOF'
server {
listen 8080;
location / {
proxy_pass http://172.17.0.1:5000/;
}
}
EOF
Then verify immediately:
bashcat nginx/conf.d/api.conf
Lesson: Always verify config files after writing them in the terminal. One malformed line silently breaks everything downstream.
The Moment It Worked
After hours of SSH key resets, firewall rules, proxy containers, network debugging, and Python module conflicts — the browser finally loaded:
http://20.121.185.0:5000/docs
DevOps Sandbox API — 1.0.0 — OAS 3.1
All 7 endpoints. Live. Publicly accessible. 🔥
The API
MethodEndpointWhat it doesPOST/envsCreate environmentGET/envsList all + TTL remainingDELETE/envs/:idDestroy environmentGET/envs/:id/logsLast 100 lines of logsGET/envs/:id/healthLast 10 health checksPOST/envs/:id/outageTrigger simulationGET/healthAPI health check
The Chaos Engineering Toggle
bashmake simulate ENV=env-demo-123 MODE=crash # Kill container
make simulate ENV=env-demo-123 MODE=pause # Freeze processes
make simulate ENV=env-demo-123 MODE=network # Cut the network
make simulate ENV=env-demo-123 MODE=stress # Spike CPU
make simulate ENV=env-demo-123 MODE=recover # Fix everything
The health monitor detects crashes within 90 seconds. Watch it live:
bashtail -f logs/env-demo-123/health.log
What I Learned
- The platform code was the easy part. Bash scripts, Docker networking, Python APIs — all of that came together in hours. The deployment took longer than the build.
- Azure's firewall has layers. NSG rules are not the only thing blocking traffic. Docker bridge networking adds another layer. When in doubt, use --network host to isolate the variable.
- platform is taken. Never name your directory after a Python standard library module. It will bite you at the worst possible moment — right before a deadline.
- Always verify file writes. cat yourfile after every tee or heredoc. One silent corruption cascades into hours of 502 errors.
- Chaos engineering is a mindset. Building the outage simulator forced me to think about every failure mode before they happened in production. The deployment battle was unplanned chaos engineering on the platform itself.
- Deadlines are the best debugging tool. Nothing focuses the mind like a submission deadline. Every error gets solved eventually — you just move faster when the clock is running.
Try It Yourself
GitHub: github.com/ntonous/devops-sandbox
Live API: http://20.121.185.0:5000/docs
Clone it, spin it up, break something, watch it recover. That's the whole point.
Built and deployed as part of the HNG14 DevOps track — Stage 5 task. Special thanks to every Azure error message that taught me something.
Top comments (0)