instanceofGod

Posted on Jun 10

DevOps Sandbox Platform: the Engineering Design Document:

Project Overview: I built a self-service mini platform (DevOps Sandbox) for short-lived isolated environments, app deployment, outage simulation, health monitoring, and automatic cleanup.
Source Code: https://github.com/nielvid/devops-sandbox

1. Architecture Critique

Overview of Architecture

The DevOps Sandbox is a single-node, script-driven mini platform designed to provision short-lived isolated environments. The architecture consists of a FastAPI control plane (api.py) that interacts directly with the local Docker daemon by executing bash scripts (create_env.sh, destroy_env.sh, simulate_outage.sh). Incoming traffic is routed through a single static Nginx container, which dynamically updates its configuration via volume mounts. Environment state and lifecycle metadata are maintained in flat JSON files stored on the host, while a rudimentary Python polling loop monitors health, and a bash-based background daemon (cleanup_daemon.sh) checks for TTL expiration.

Exact Breaking Points

Single Point of Failure (SPOF): The entire platform, from the control plane to the sandbox workloads, runs on a single host machine. If the underlying VM crashes or undergoes maintenance, all active sandbox environments and the control plane go down immediately.
State Management & Concurrency: Storing state in flat JSON files (envs/*.json) entirely lacks concurrency control. Simultaneous API requests to create or delete environments can cause race conditions, corrupting the state files and leading to orphaned resources.
Nginx Reload Bottleneck: Every environment creation or destruction rewrites a monolithic Nginx configuration file and issues an nginx -s reload command. Under high load (e.g., numerous simultaneous requests), this constant reloading will overwhelm the reverse proxy, sever active connections, and cause severe latency spikes.
Zombie Resources & Fragile Daemons: The cleanup daemon relies on a fragile bash loop utilizing sleep. If the daemon process crashes or hangs, expired environments will never be cleaned up, eventually leading to Docker resource exhaustion (OOM errors or disk space depletion).
Orchestration Limits: Managing Docker networks and containers via imperative shell scripts scales poorly. Container crashes rely solely on basic Docker restart policies without intelligent rescheduling, node balancing, or meaningful self-healing.

Security Blind Spots

No Authentication or Authorization (AuthN/AuthZ): The Control API (/envs) is completely unauthenticated. Anyone with access to the API endpoint can arbitrarily create, list, maliciously destroy (DELETE /envs/:id) environments, or trigger destructive outages.
Root Daemon Access: The API process runs with sufficient privileges to execute arbitrary Docker commands against the host daemon. Any command injection vulnerability within the FastAPI layer could lead to full host compromise via Docker socket escalation.
Lack of Resource Quotas: There are no CPU, memory, or network ingress/egress limits enforced on the spawned app containers. A single user could launch an environment with a malicious payload that monopolizes host resources, starving other environments (Noisy Neighbor).
Unencrypted Secrets & Hardcoded Configs: .env variables and internal network configurations are passed around plainly without a dedicated secrets manager.

2. New Features Fully Designed

Feature 1: Multi-Tenant Role-Based Access Control (RBAC) & Authentication

What it does & Why it is needed: Introduces strict AuthN via OAuth2/OIDC and AuthZ via RBAC. It ensures that only authenticated engineers can access the platform. Furthermore, it enforces boundaries so users can only mutate or destroy their own sandboxes unless they hold an admin role. This is an absolute necessity for internal security and auditability.
Architectural Integration: The Control API will integrate with an Identity Provider (IdP) such as Auth0 or Keycloak. An API Gateway will intercept and validate JWT tokens before requests hit the FastAPI backend.
Data Model Changes:
- Users table: id (PK), email, role, created_at.
- Environments table: Introduce owner_id (FK to Users) and team_id.
Trade-offs: Adds latency (a few milliseconds) to every API call due to token validation and database lookups. It also increases operational complexity compared to an open API, but this is a mandatory trade-off for a secure system.

Feature 2: Kubernetes-Native Environment Orchestration

What it does & Why it is needed: Replaces brittle bash scripts and direct local Docker daemon interactions with Kubernetes Custom Resource Definitions (CRDs) and a custom Operator. This provides native self-healing, distributed scheduling across multiple nodes, and robust health checking.
Architectural Integration: The Control API transitions to a translation layer that submits YAML manifests directly to the Kubernetes API server. Sandboxes become isolated K8s Namespaces containing their respective Deployments, Services, and Ingresses.
Data Model Changes: Infrastructure state is offloaded to K8s etcd. The application database's Environments table will store mapping metadata: k8s_namespace, cluster_id, and sync_status.
Trade-offs: Introduces high operational complexity and a steeper learning curve for the engineering team. Infrastructure cost increases significantly due to the requirement of running a multi-node Kubernetes cluster instead of a single VM.

Feature 3: Dynamic Ingress and Route Management

What it does & Why it is needed: Replaces the static Nginx container and bash-reload mechanism with a robust Kubernetes Ingress Controller (e.g., Traefik or Envoy). It dynamically routes traffic to new environments without dropping existing connections.
Architectural Integration: When the K8s Operator provisions an environment, it concurrently creates an Ingress resource. The Ingress Controller automatically detects the new route and configures itself dynamically without requiring manual service reloads.
Data Model Changes: No direct database schema changes. The routing state is managed entirely by K8s Ingress resources and the controller.
Trade-offs: Introduces dependency on specific K8s network add-ons. Debugging routing issues transitions from inspecting a simple static Nginx config file to querying K8s Ingress states and analyzing controller logs.

Architecture Diagram (V2)

Below is a blueprint mapping out the services, caches, queues, and labeled data flows for the V2 Architecture.

3. Production Readiness

Security

AuthN/AuthZ: All API endpoints will enforce a valid JWT issued by our IdP. Role-based access logic will ensure users can only access workloads within their bounded K8s namespaces.
Secrets Management: HashiCorp Vault will be integrated into the architecture. The API and worker nodes will fetch database credentials and API keys dynamically via Vault-injected sidecars, ensuring no secrets ever exist in code, plaintext configurations, or static environment variables.
Input Validation: The FastAPI backend will leverage strict Pydantic schemas for all payloads, vigorously sanitizing environment names to prevent command injection and directory traversal attacks.
Attack Surface Minimization: Sandbox workloads will be strictly isolated using Kubernetes Network Policies, denying egress to internal corporate subnets and restricting ingress exclusively to the API gateway. All sandbox containers will run with securityContext.runAsNonRoot = true.

Scalability

Horizontal Scaling Boundaries:
- Control Plane: The API will run as a stateless service behind an Application Load Balancer, scaling horizontally based on CPU/Memory utilization via the K8s Horizontal Pod Autoscaler (HPA).
- Worker Nodes: Nodes running sandbox environments will scale elastically using the Cluster Autoscaler when pending pods lack sufficient compute resources.
Caching Strategy & Eviction:
- Active environment status and routing metadata will be cached in Redis to offload read-heavy GET /envs requests from the primary database.
- Eviction Strategy: Redis keys will utilize a Time-To-Live (TTL) that matches the environment's projected lifespan, combined with a standard LRU (Least Recently Used) policy to protect memory limits.
Handling Traffic Spikes:
- The API Gateway will implement strict rate limiting (e.g., 10 creations/minute per user).
- Environment creation requests during massive traffic spikes will be pushed to an asynchronous message queue (e.g., RabbitMQ), returning a 202 Accepted response with a tracking Job ID rather than blocking synchronously.

Observability

Structured Logging: All platform services will output structured JSON-formatted logs. We will deploy Fluent Bit as a DaemonSet to collect, parse, and forward these logs to a centralized Elasticsearch/OpenSearch cluster. This ensures logs are decoupled from ephemeral pods and searchable out-of-the-box.
Core Metrics Tracking:
- API Metrics: Request latency (P95, P99), error rates (HTTP 4xx/5xx), and throughput (RPS).
- Infrastructure Metrics: Node CPU/Memory utilization, Pod restart loops, and Sandbox provisioning latency.
Alerting Thresholds: Prometheus Alertmanager will page the on-call engineer for critical events. Strict thresholds include: API error rate > 5% over 5 minutes, Node CPU > 85% for 10 consecutive minutes, or RabbitMQ queue depth > 500. Lower-priority warnings will be routed to a dedicated Slack channel.
Distributed Error Tracking: Sentry will be integrated into the FastAPI backend to capture granular stack traces, request context, and unhandled exceptions across the distributed services. Standardized Trace IDs (via OpenTelemetry) will be injected into all API headers and logs for seamless cross-service tracing.

Performance

Resource Optimization: We will leverage Kubernetes resource requests and limits (requests.cpu, limits.memory) to guarantee baseline compute for the control plane while enforcing hard caps on individual sandboxes. This prevents CPU throttling during peak load.
Database Connection Pooling: The FastAPI application will use PgBouncer or an async equivalent like asyncpg built-in connection pooling. This prevents the PostgreSQL database from crashing due to connection exhaustion during traffic spikes.
Asynchronous Execution: By moving environment provisioning—a slow, multi-second process—into a RabbitMQ background queue, the API responds to clients in milliseconds with an HTTP 202 Accepted. Clients can poll a status endpoint or receive a webhook, vastly improving perceived performance.
Ingress Throughput: Traefik natively handles thousands of concurrent requests with minimal overhead. SSL termination will be offloaded to the Ingress controller or cloud Load Balancer to free up compute cycles in the application layer.

4. Tech Stack Decisions

Language/Framework: Python 3.11 + FastAPI
- Justification: Retained from V1 to leverage the existing team's domain knowledge. FastAPI provides excellent asynchronous concurrency out of the box (via ASGI) and native data validation using Pydantic, which is structurally essential for our enhanced V2 security posture.
Orchestration & Compute: Kubernetes (Amazon EKS)
- Justification: Fundamentally replaces the raw Docker daemon. EKS provides a highly available managed control plane across multiple Availability Zones (AZs), native bin-packing for scheduling sandboxes efficiently, and massive ecosystem maturity. It natively resolves our V1 zombie resource leaks by enforcing workload TTLs via K8s CronJobs or a custom operator controller, rather than brittle bash loops.
Database: PostgreSQL 15
- Justification: Replaces flat JSON files. We require strict ACID compliance, specifically row-level locking (SELECT ... FOR UPDATE), to guarantee that concurrent environment state mutations (creation, deletion, billing quotas) do not result in race conditions. Relational mapping is also ideal for the new Users -> Environments RBAC model and supports complex joins for reporting.
Caching: Redis
- Justification: Chosen for sub-millisecond read latency on highly-polled /envs endpoints. Redis's native TTL support maps perfectly to our sandbox environment lifecycles, and it handles our API Gateway rate-limiting counters incredibly efficiently via atomic (INCR, EXPIRE) operations.
Asynchronous Queuing: RabbitMQ
- Justification: Sandbox provisioning is an inherently slow operation. RabbitMQ guarantees message delivery (at-least-once) with explicit ACKs. It allows us to decouple the fast API requests from the slow infrastructure orchestration, effectively leveling out massive traffic spikes without dropping requests.
Ingress / Routing: Traefik
- Justification: Traefik continuously auto-discovers K8s Ingress routes and updates its internal configuration dynamically without the dropped connections or hard process reloads that severely bottlenecked our V1 Nginx setup.
Secrets Management: HashiCorp Vault
- Justification: The industry standard for dynamic, lease-based secrets. It integrates cleanly with Kubernetes via the Vault Agent sidecar to inject secrets directly into memory (tmpfs), fulfilling our strict V2 security requirements and preventing secret sprawl.
Observability (Metrics & Logs): Prometheus + Grafana + ELK Stack
- Justification: Prometheus utilizes a pull-based model that is highly optimized for K8s ephemeral environments, perfectly suiting our short-lived sandboxes. Grafana is the industry standard for visualizing these time-series metrics. The ELK stack provides robust, full-text indexing for our structured JSON logs, which is a hard requirement for distributed debugging in a microservices environment.

DEV Community