Most DevOps tutorials show you how to push a Docker image to DockerHub and call it a day. This is not that post.
I spent weeks building a platform that mirrors what actually runs inside companies like Stripe, Notion, or Cloudflare — automated security gates, infrastructure as code, self-healing Kubernetes deployments, and a full observability stack that pages you on Slack at 3am. Every decision was deliberate. Every tool earns its place.
Here's the whole thing, phase by phase.
The Goal
The challenge I set myself: build a platform where:
- No code reaches production without passing security checks — automatically
- Infrastructure is version-controlled — no manual clicking in AWS consoles
- Deployments are zero-touch — git push is the only operator action
- The cluster corrects itself — manual changes get reverted, failed deploys roll back
- You can see everything — metrics, dashboards, and alerts firing to Slack
The app itself is intentionally boring: a Flask API with three endpoints. The infrastructure is the point.
Phase 1 — DevSecOps CI Pipeline
Security as an afterthought is how you end up on HaveIBeenPwned. I baked it into the pipeline from day one.
Every push to main triggers four sequential checks before a single byte gets deployed:
jobs:
security-scan:
steps:
- uses: trufflesecurity/trufflehog@main # leaked secrets
with:
extra_args: --only-verified
- run: pip install safety && safety check # CVE audit on deps
- run: docker build -t devops-app ./backend # build locally for scanning
- uses: aquasecurity/trivy-action@master # OS-level vuln scan
with:
severity: 'CRITICAL,HIGH'
TruffleHog scans every commit diff for leaked API keys, tokens, and passwords — not just regex patterns, but verified against live services. Safety audits Python dependencies against the CVE database. Trivy scans the built container image for OS-level vulnerabilities.
The pipeline only continues to build-and-push if all three pass. Security is a gate, not a suggestion.
The Dockerfile
Multi-stage builds are non-negotiable in production. The builder stage installs dependencies; the final image copies only the installed packages — not pip, not build tools, not anything that expands the attack surface.
FROM python:3.11-slim AS builder
WORKDIR /app
RUN pip install --no-cache-dir flask prometheus-client
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY app.py .
RUN useradd -u 10001 appuser && chown -R appuser:appuser /app
USER appuser
EXPOSE 5000
CMD ["python", "app.py"]
Running as uid 10001 means if the container is ever compromised, the attacker gets a user with zero system privileges — not root. This is a hard requirement in enterprise container security audits in 2025.
The result: an image that's roughly 60% smaller than a naive single-stage build, with significantly fewer Trivy findings.
Phase 2 — Infrastructure as Code with Terraform
The rule: if it can't be terraform apply'd, it doesn't exist.
I provisioned the full AWS environment — VPC, subnets, security groups, EC2, S3, IAM roles, and EKS — in code. No manual console clicks, ever.
# The entire network fabric
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
}
resource "aws_subnet" "public" { cidr_block = "10.0.1.0/24" ... }
resource "aws_subnet" "private" { cidr_block = "10.0.2.0/24" ... }
A few decisions worth explaining:
Why RDS gets a private subnet. The database should never be reachable from the internet, only from within the VPC. This is enforced at the network layer, not just via security groups.
Why I generate the EC2 SSH key via Terraform.
resource "tls_private_key" "rsa_key" {
algorithm = "RSA"
rsa_bits = 4096
}
resource "aws_key_pair" "app_key" {
key_name = "${var.project_name}-key"
public_key = tls_private_key.rsa_key.public_key_openssh
}
No manual key generation, no keys sitting in someone's Downloads folder. The private key is a Terraform output marked sensitive = true — it exists in state, not in source control.
Why S3 for Terraform state. Local .tfstate files go out of sync between teammates and are catastrophic to lose. S3 with versioning means state is always current and recoverable.
The payoff: terraform apply brings up the entire environment in about 15 minutes. terraform destroy tears it down and stops the billing instantly. Reproducible, auditable, version-controlled infrastructure.
Phase 3 — Automated Deployment
The pipeline pushes two tags on every successful build: latest and the exact git SHA.
- uses: docker/build-push-action@v5
with:
context: ./backend
push: true
tags: |
${{ env.IMAGE_NAME }}:latest
${{ env.IMAGE_NAME }}:${{ github.sha }}
Why both? latest is for convenience. The SHA tag is for precision — you can roll back to any exact commit with a single command. This matters when you're debugging a production incident at midnight and need to know exactly what's running.
Phase 4 — Kubernetes + GitOps with ArgoCD
This is where it gets interesting.
The EKS cluster runs the app via a Helm chart. The chart manages replicas, resource limits, health probes, and autoscaling:
# values.yaml
replicaCount: 2
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
autoscaling:
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 70
The HPA scaling target is 70%, not 90%. At 90% you're already overwhelmed — new pods take time to start and warm up. 70% gives the cluster headroom to scale before traffic saturates the existing pods.
The GitOps Loop
Here's the part that makes this different from "deploy via kubectl":
# argocd/application.yaml
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert any manual cluster changes
When GitHub Actions updates the image tag in values.yaml and pushes the commit:
- ArgoCD detects the change in Git within seconds
- Triggers a rolling update on the cluster — zero downtime
- If health checks fail post-deploy, ArgoCD auto-rolls back to the last healthy state
- If someone manually
kubectl apply's something directly to the cluster, ArgoCD reverts it within minutes
Git is the single source of truth. The cluster is a reflection of the repo, not an independent entity that drifts over time.
Phase 5 — Full Observability Stack
You cannot operate what you cannot observe.
The Flask app exposes custom Prometheus metrics at /metrics:
REQUEST_COUNT = Counter(
'app_requests_total',
'Total number of requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'app_request_latency_seconds',
'Request duration',
['endpoint']
)
A ServiceMonitor tells Prometheus to scrape the endpoint every 15 seconds. From there, four Grafana panels give full visibility:
| Panel | Query |
|---|---|
| Request Rate | rate(app_requests_total[5m]) |
| Error Rate % | rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m]) * 100 |
| P95 Latency | histogram_quantile(0.95, rate(app_request_latency_seconds_bucket[5m])) |
| Pod Restarts | kube_pod_container_status_restarts_total{namespace="default"} |
AlertManager Rules
Four alerts fire to a Slack #devops-alerts channel:
- alert: HighErrorRate
expr: rate(app_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
The for: 2m duration on error rate prevents false positives from a momentary spike. The alert only fires if the condition holds for two consecutive minutes — sustained degradation, not noise.
What I'd Do Differently
A few things I'd change building this again:
Multi-environment from the start. One Terraform workspace and one ArgoCD app works fine for learning, but the first thing you'd add in a real org is separate staging and prod environments with promotion gates between them.
Spot instances on the node group. The EKS worker nodes run on t3.small on-demand. Mixing in Spot instances with appropriate interruption handling would cut the compute cost by 60-70%.
OpenTelemetry instead of manual instrumentation. Hand-instrumenting the Flask app with Prometheus counters and histograms works, but OpenTelemetry gives you traces, metrics, and logs through a single SDK — and it's vendor-neutral.
The Full Stack at a Glance
| Category | Tool | Why |
|---|---|---|
| Secret scanning | TruffleHog | Verified detections, not just regex |
| Dependency audit | Safety | CVE database for Python packages |
| Container scanning | Trivy | OS + package layer vulns |
| IaC | Terraform | Reproducible, version-controlled AWS |
| Orchestration | Kubernetes (EKS) | Self-healing, scalable containers |
| Packaging | Helm | Templated K8s manifests |
| GitOps | ArgoCD | Git as source of truth, auto-revert |
| Metrics | Prometheus | Custom app + node + cluster metrics |
| Dashboards | Grafana | Real-time visualisation |
| Alerting | AlertManager + Slack | Threshold-based incident paging |
| CI/CD | GitHub Actions | Pipeline on every push |
Repo
Everything is open source: github.com/ChetanEpuri/modern-devops-project
The README walks through prerequisites, getting started locally with Docker Compose, provisioning the full cloud stack, and connecting to the ArgoCD and Grafana dashboards.
If any of this is useful or you're building something similar, drop a comment. I'm particularly interested in talking to people who've taken GitOps patterns further — multi-cluster setups, progressive delivery with Flagger, that kind of thing.
Top comments (0)