Chetan

Posted on May 30

I Built a Production-Grade DevSecOps Platform From Scratch — Here's Every Decision I Made

#devops #kubernetes #terraform #github

Most DevOps tutorials show you how to push a Docker image to DockerHub and call it a day. This is not that post.

I spent weeks building a platform that mirrors what actually runs inside companies like Stripe, Notion, or Cloudflare — automated security gates, infrastructure as code, self-healing Kubernetes deployments, and a full observability stack that pages you on Slack at 3am. Every decision was deliberate. Every tool earns its place.

Here's the whole thing, phase by phase.

The Goal

The challenge I set myself: build a platform where:

No code reaches production without passing security checks — automatically
Infrastructure is version-controlled — no manual clicking in AWS consoles
Deployments are zero-touch — git push is the only operator action
The cluster corrects itself — manual changes get reverted, failed deploys roll back
You can see everything — metrics, dashboards, and alerts firing to Slack

The app itself is intentionally boring: a Flask API with three endpoints. The infrastructure is the point.

Phase 1 — DevSecOps CI Pipeline

Security as an afterthought is how you end up on HaveIBeenPwned. I baked it into the pipeline from day one.

Every push to main triggers four sequential checks before a single byte gets deployed:

jobs:
  security-scan:
    steps:
      - uses: trufflesecurity/trufflehog@main    # leaked secrets
        with:
          extra_args: --only-verified

      - run: pip install safety && safety check  # CVE audit on deps

      - run: docker build -t devops-app ./backend # build locally for scanning

      - uses: aquasecurity/trivy-action@master    # OS-level vuln scan
        with:
          severity: 'CRITICAL,HIGH'

TruffleHog scans every commit diff for leaked API keys, tokens, and passwords — not just regex patterns, but verified against live services. Safety audits Python dependencies against the CVE database. Trivy scans the built container image for OS-level vulnerabilities.

The pipeline only continues to build-and-push if all three pass. Security is a gate, not a suggestion.

The Dockerfile

Multi-stage builds are non-negotiable in production. The builder stage installs dependencies; the final image copies only the installed packages — not pip, not build tools, not anything that expands the attack surface.

FROM python:3.11-slim AS builder
WORKDIR /app
RUN pip install --no-cache-dir flask prometheus-client

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY app.py .

RUN useradd -u 10001 appuser && chown -R appuser:appuser /app
USER appuser

EXPOSE 5000
CMD ["python", "app.py"]

Running as uid 10001 means if the container is ever compromised, the attacker gets a user with zero system privileges — not root. This is a hard requirement in enterprise container security audits in 2025.

The result: an image that's roughly 60% smaller than a naive single-stage build, with significantly fewer Trivy findings.

Phase 2 — Infrastructure as Code with Terraform

The rule: if it can't be terraform apply'd, it doesn't exist.

I provisioned the full AWS environment — VPC, subnets, security groups, EC2, S3, IAM roles, and EKS — in code. No manual console clicks, ever.

# The entire network fabric
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_subnet" "public"  { cidr_block = "10.0.1.0/24" ... }
resource "aws_subnet" "private" { cidr_block = "10.0.2.0/24" ... }

A few decisions worth explaining:

Why RDS gets a private subnet. The database should never be reachable from the internet, only from within the VPC. This is enforced at the network layer, not just via security groups.

Why I generate the EC2 SSH key via Terraform.

resource "tls_private_key" "rsa_key" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

resource "aws_key_pair" "app_key" {
  key_name   = "${var.project_name}-key"
  public_key = tls_private_key.rsa_key.public_key_openssh
}

No manual key generation, no keys sitting in someone's Downloads folder. The private key is a Terraform output marked sensitive = true — it exists in state, not in source control.

Why S3 for Terraform state. Local .tfstate files go out of sync between teammates and are catastrophic to lose. S3 with versioning means state is always current and recoverable.

The payoff: terraform apply brings up the entire environment in about 15 minutes. terraform destroy tears it down and stops the billing instantly. Reproducible, auditable, version-controlled infrastructure.

Phase 3 — Automated Deployment

The pipeline pushes two tags on every successful build: latest and the exact git SHA.

- uses: docker/build-push-action@v5
  with:
    context: ./backend
    push: true
    tags: |
      ${{ env.IMAGE_NAME }}:latest
      ${{ env.IMAGE_NAME }}:${{ github.sha }}

Why both? latest is for convenience. The SHA tag is for precision — you can roll back to any exact commit with a single command. This matters when you're debugging a production incident at midnight and need to know exactly what's running.

Phase 4 — Kubernetes + GitOps with ArgoCD

This is where it gets interesting.

The EKS cluster runs the app via a Helm chart. The chart manages replicas, resource limits, health probes, and autoscaling:

# values.yaml
replicaCount: 2

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

autoscaling:
  minReplicas: 2
  maxReplicas: 5
  targetCPUUtilizationPercentage: 70

The HPA scaling target is 70%, not 90%. At 90% you're already overwhelmed — new pods take time to start and warm up. 70% gives the cluster headroom to scale before traffic saturates the existing pods.

The GitOps Loop

Here's the part that makes this different from "deploy via kubectl":

# argocd/application.yaml
syncPolicy:
  automated:
    prune: true      # delete resources removed from Git
    selfHeal: true   # revert any manual cluster changes

When GitHub Actions updates the image tag in values.yaml and pushes the commit:

ArgoCD detects the change in Git within seconds
Triggers a rolling update on the cluster — zero downtime
If health checks fail post-deploy, ArgoCD auto-rolls back to the last healthy state
If someone manually kubectl apply's something directly to the cluster, ArgoCD reverts it within minutes

Git is the single source of truth. The cluster is a reflection of the repo, not an independent entity that drifts over time.

Phase 5 — Full Observability Stack

You cannot operate what you cannot observe.

The Flask app exposes custom Prometheus metrics at /metrics:

REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total number of requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'app_request_latency_seconds',
    'Request duration',
    ['endpoint']
)

A ServiceMonitor tells Prometheus to scrape the endpoint every 15 seconds. From there, four Grafana panels give full visibility:

Panel	Query
Request Rate	`rate(app_requests_total[5m])`
Error Rate %	`rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m]) * 100`
P95 Latency	`histogram_quantile(0.95, rate(app_request_latency_seconds_bucket[5m]))`
Pod Restarts	`kube_pod_container_status_restarts_total{namespace="default"}`

AlertManager Rules

Four alerts fire to a Slack #devops-alerts channel:

- alert: HighErrorRate
  expr: rate(app_requests_total{status=~"5.."}[5m]) > 0.1
  for: 2m
  labels:
    severity: critical

- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: critical

The for: 2m duration on error rate prevents false positives from a momentary spike. The alert only fires if the condition holds for two consecutive minutes — sustained degradation, not noise.

What I'd Do Differently

A few things I'd change building this again:

Multi-environment from the start. One Terraform workspace and one ArgoCD app works fine for learning, but the first thing you'd add in a real org is separate staging and prod environments with promotion gates between them.

Spot instances on the node group. The EKS worker nodes run on t3.small on-demand. Mixing in Spot instances with appropriate interruption handling would cut the compute cost by 60-70%.

OpenTelemetry instead of manual instrumentation. Hand-instrumenting the Flask app with Prometheus counters and histograms works, but OpenTelemetry gives you traces, metrics, and logs through a single SDK — and it's vendor-neutral.

The Full Stack at a Glance

Category	Tool	Why
Secret scanning	TruffleHog	Verified detections, not just regex
Dependency audit	Safety	CVE database for Python packages
Container scanning	Trivy	OS + package layer vulns
IaC	Terraform	Reproducible, version-controlled AWS
Orchestration	Kubernetes (EKS)	Self-healing, scalable containers
Packaging	Helm	Templated K8s manifests
GitOps	ArgoCD	Git as source of truth, auto-revert
Metrics	Prometheus	Custom app + node + cluster metrics
Dashboards	Grafana	Real-time visualisation
Alerting	AlertManager + Slack	Threshold-based incident paging
CI/CD	GitHub Actions	Pipeline on every push

Repo

Everything is open source: github.com/ChetanEpuri/modern-devops-project

The README walks through prerequisites, getting started locally with Docker Compose, provisioning the full cloud stack, and connecting to the ArgoCD and Grafana dashboards.

If any of this is useful or you're building something similar, drop a comment. I'm particularly interested in talking to people who've taken GitOps patterns further — multi-cluster setups, progressive delivery with Flagger, that kind of thing.

Top comments (1)

Harjot Singh • May 31

"No code reaches production without passing security checks, automatically" is the line that separates a platform from a pile of scripts, the automatic part is everything, because any gate that depends on a human remembering to run it is already broken. The "every tool earns its place" framing is the right discipline too: most platform sprawl comes from adding tools defensively rather than because a specific failure demanded them, and unearned tools become unmaintained attack surface. The thread running through your whole stack (security gates, IaC, self-healing, paging observability) is really one idea: make the safe path the default path and the unsafe path impossible, so correctness doesn't rely on discipline under pressure. That's exactly how I think about shipping in Moonshift, the harness enforces the gates so a tired human at 3am can't skip them. The honest question on a from-scratch build like this: what was the decision you'd reverse now, the place where you over-engineered a gate that cost more in friction than it ever caught?