DEV Community

Cover image for I Built a Production-Grade DevSecOps Platform From Scratch — Here's Every Decision I Made
Chetan
Chetan

Posted on

I Built a Production-Grade DevSecOps Platform From Scratch — Here's Every Decision I Made

Most DevOps tutorials show you how to push a Docker image to DockerHub and call it a day. This is not that post.

I spent weeks building a platform that mirrors what actually runs inside companies like Stripe, Notion, or Cloudflare — automated security gates, infrastructure as code, self-healing Kubernetes deployments, and a full observability stack that pages you on Slack at 3am. Every decision was deliberate. Every tool earns its place.

Here's the whole thing, phase by phase.


The Goal

The challenge I set myself: build a platform where:

  1. No code reaches production without passing security checks — automatically
  2. Infrastructure is version-controlled — no manual clicking in AWS consoles
  3. Deployments are zero-touch — git push is the only operator action
  4. The cluster corrects itself — manual changes get reverted, failed deploys roll back
  5. You can see everything — metrics, dashboards, and alerts firing to Slack

The app itself is intentionally boring: a Flask API with three endpoints. The infrastructure is the point.


Phase 1 — DevSecOps CI Pipeline

Security as an afterthought is how you end up on HaveIBeenPwned. I baked it into the pipeline from day one.

Every push to main triggers four sequential checks before a single byte gets deployed:

jobs:
  security-scan:
    steps:
      - uses: trufflesecurity/trufflehog@main    # leaked secrets
        with:
          extra_args: --only-verified

      - run: pip install safety && safety check  # CVE audit on deps

      - run: docker build -t devops-app ./backend # build locally for scanning

      - uses: aquasecurity/trivy-action@master    # OS-level vuln scan
        with:
          severity: 'CRITICAL,HIGH'
Enter fullscreen mode Exit fullscreen mode

TruffleHog scans every commit diff for leaked API keys, tokens, and passwords — not just regex patterns, but verified against live services. Safety audits Python dependencies against the CVE database. Trivy scans the built container image for OS-level vulnerabilities.

The pipeline only continues to build-and-push if all three pass. Security is a gate, not a suggestion.

The Dockerfile

Multi-stage builds are non-negotiable in production. The builder stage installs dependencies; the final image copies only the installed packages — not pip, not build tools, not anything that expands the attack surface.

FROM python:3.11-slim AS builder
WORKDIR /app
RUN pip install --no-cache-dir flask prometheus-client

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY app.py .

RUN useradd -u 10001 appuser && chown -R appuser:appuser /app
USER appuser

EXPOSE 5000
CMD ["python", "app.py"]
Enter fullscreen mode Exit fullscreen mode

Running as uid 10001 means if the container is ever compromised, the attacker gets a user with zero system privileges — not root. This is a hard requirement in enterprise container security audits in 2025.

The result: an image that's roughly 60% smaller than a naive single-stage build, with significantly fewer Trivy findings.


Phase 2 — Infrastructure as Code with Terraform

The rule: if it can't be terraform apply'd, it doesn't exist.

I provisioned the full AWS environment — VPC, subnets, security groups, EC2, S3, IAM roles, and EKS — in code. No manual console clicks, ever.

# The entire network fabric
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_subnet" "public"  { cidr_block = "10.0.1.0/24" ... }
resource "aws_subnet" "private" { cidr_block = "10.0.2.0/24" ... }
Enter fullscreen mode Exit fullscreen mode

A few decisions worth explaining:

Why RDS gets a private subnet. The database should never be reachable from the internet, only from within the VPC. This is enforced at the network layer, not just via security groups.

Why I generate the EC2 SSH key via Terraform.

resource "tls_private_key" "rsa_key" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

resource "aws_key_pair" "app_key" {
  key_name   = "${var.project_name}-key"
  public_key = tls_private_key.rsa_key.public_key_openssh
}
Enter fullscreen mode Exit fullscreen mode

No manual key generation, no keys sitting in someone's Downloads folder. The private key is a Terraform output marked sensitive = true — it exists in state, not in source control.

Why S3 for Terraform state. Local .tfstate files go out of sync between teammates and are catastrophic to lose. S3 with versioning means state is always current and recoverable.

The payoff: terraform apply brings up the entire environment in about 15 minutes. terraform destroy tears it down and stops the billing instantly. Reproducible, auditable, version-controlled infrastructure.


Phase 3 — Automated Deployment

The pipeline pushes two tags on every successful build: latest and the exact git SHA.

- uses: docker/build-push-action@v5
  with:
    context: ./backend
    push: true
    tags: |
      ${{ env.IMAGE_NAME }}:latest
      ${{ env.IMAGE_NAME }}:${{ github.sha }}
Enter fullscreen mode Exit fullscreen mode

Why both? latest is for convenience. The SHA tag is for precision — you can roll back to any exact commit with a single command. This matters when you're debugging a production incident at midnight and need to know exactly what's running.


Phase 4 — Kubernetes + GitOps with ArgoCD

This is where it gets interesting.

The EKS cluster runs the app via a Helm chart. The chart manages replicas, resource limits, health probes, and autoscaling:

# values.yaml
replicaCount: 2

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

autoscaling:
  minReplicas: 2
  maxReplicas: 5
  targetCPUUtilizationPercentage: 70
Enter fullscreen mode Exit fullscreen mode

The HPA scaling target is 70%, not 90%. At 90% you're already overwhelmed — new pods take time to start and warm up. 70% gives the cluster headroom to scale before traffic saturates the existing pods.

The GitOps Loop

Here's the part that makes this different from "deploy via kubectl":

# argocd/application.yaml
syncPolicy:
  automated:
    prune: true      # delete resources removed from Git
    selfHeal: true   # revert any manual cluster changes
Enter fullscreen mode Exit fullscreen mode

When GitHub Actions updates the image tag in values.yaml and pushes the commit:

  1. ArgoCD detects the change in Git within seconds
  2. Triggers a rolling update on the cluster — zero downtime
  3. If health checks fail post-deploy, ArgoCD auto-rolls back to the last healthy state
  4. If someone manually kubectl apply's something directly to the cluster, ArgoCD reverts it within minutes

Git is the single source of truth. The cluster is a reflection of the repo, not an independent entity that drifts over time.


Phase 5 — Full Observability Stack

You cannot operate what you cannot observe.

The Flask app exposes custom Prometheus metrics at /metrics:

REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total number of requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'app_request_latency_seconds',
    'Request duration',
    ['endpoint']
)
Enter fullscreen mode Exit fullscreen mode

A ServiceMonitor tells Prometheus to scrape the endpoint every 15 seconds. From there, four Grafana panels give full visibility:

Panel Query
Request Rate rate(app_requests_total[5m])
Error Rate % rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m]) * 100
P95 Latency histogram_quantile(0.95, rate(app_request_latency_seconds_bucket[5m]))
Pod Restarts kube_pod_container_status_restarts_total{namespace="default"}

AlertManager Rules

Four alerts fire to a Slack #devops-alerts channel:

- alert: HighErrorRate
  expr: rate(app_requests_total{status=~"5.."}[5m]) > 0.1
  for: 2m
  labels:
    severity: critical

- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: critical
Enter fullscreen mode Exit fullscreen mode

The for: 2m duration on error rate prevents false positives from a momentary spike. The alert only fires if the condition holds for two consecutive minutes — sustained degradation, not noise.


What I'd Do Differently

A few things I'd change building this again:

Multi-environment from the start. One Terraform workspace and one ArgoCD app works fine for learning, but the first thing you'd add in a real org is separate staging and prod environments with promotion gates between them.

Spot instances on the node group. The EKS worker nodes run on t3.small on-demand. Mixing in Spot instances with appropriate interruption handling would cut the compute cost by 60-70%.

OpenTelemetry instead of manual instrumentation. Hand-instrumenting the Flask app with Prometheus counters and histograms works, but OpenTelemetry gives you traces, metrics, and logs through a single SDK — and it's vendor-neutral.


The Full Stack at a Glance

Category Tool Why
Secret scanning TruffleHog Verified detections, not just regex
Dependency audit Safety CVE database for Python packages
Container scanning Trivy OS + package layer vulns
IaC Terraform Reproducible, version-controlled AWS
Orchestration Kubernetes (EKS) Self-healing, scalable containers
Packaging Helm Templated K8s manifests
GitOps ArgoCD Git as source of truth, auto-revert
Metrics Prometheus Custom app + node + cluster metrics
Dashboards Grafana Real-time visualisation
Alerting AlertManager + Slack Threshold-based incident paging
CI/CD GitHub Actions Pipeline on every push

Repo

Everything is open source: github.com/ChetanEpuri/modern-devops-project

The README walks through prerequisites, getting started locally with Docker Compose, provisioning the full cloud stack, and connecting to the ArgoCD and Grafana dashboards.

If any of this is useful or you're building something similar, drop a comment. I'm particularly interested in talking to people who've taken GitOps patterns further — multi-cluster setups, progressive delivery with Flagger, that kind of thing.

Top comments (0)