<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chetan</title>
    <description>The latest articles on DEV Community by Chetan (@chetanepuri).</description>
    <link>https://dev.to/chetanepuri</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960175%2F391d7cf4-e817-40cc-8f92-66066a69bd88.jpeg</url>
      <title>DEV Community: Chetan</title>
      <link>https://dev.to/chetanepuri</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/chetanepuri"/>
    <language>en</language>
    <item>
      <title>I Built a Production-Grade DevSecOps Platform From Scratch — Here's Every Decision I Made</title>
      <dc:creator>Chetan</dc:creator>
      <pubDate>Sat, 30 May 2026 16:04:48 +0000</pubDate>
      <link>https://dev.to/chetanepuri/i-built-a-production-grade-devsecops-platform-from-scratch-heres-every-decision-i-made-394l</link>
      <guid>https://dev.to/chetanepuri/i-built-a-production-grade-devsecops-platform-from-scratch-heres-every-decision-i-made-394l</guid>
      <description>&lt;p&gt;Most DevOps tutorials show you how to push a Docker image to DockerHub and call it a day. This is not that post.&lt;/p&gt;

&lt;p&gt;I spent weeks building a platform that mirrors what actually runs inside companies like Stripe, Notion, or Cloudflare — automated security gates, infrastructure as code, self-healing Kubernetes deployments, and a full observability stack that pages you on Slack at 3am. Every decision was deliberate. Every tool earns its place.&lt;/p&gt;

&lt;p&gt;Here's the whole thing, phase by phase.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;The challenge I set myself: build a platform where:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No code reaches production without passing security checks&lt;/strong&gt; — automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure is version-controlled&lt;/strong&gt; — no manual clicking in AWS consoles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployments are zero-touch&lt;/strong&gt; — git push is the only operator action&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The cluster corrects itself&lt;/strong&gt; — manual changes get reverted, failed deploys roll back&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can see everything&lt;/strong&gt; — metrics, dashboards, and alerts firing to Slack&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The app itself is intentionally boring: a Flask API with three endpoints. The infrastructure is the point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1 — DevSecOps CI Pipeline
&lt;/h2&gt;

&lt;p&gt;Security as an afterthought is how you end up on HaveIBeenPwned. I baked it into the pipeline from day one.&lt;/p&gt;

&lt;p&gt;Every push to &lt;code&gt;main&lt;/code&gt; triggers four sequential checks before a single byte gets deployed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;security-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trufflesecurity/trufflehog@main&lt;/span&gt;    &lt;span class="c1"&gt;# leaked secrets&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;extra_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--only-verified&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install safety &amp;amp;&amp;amp; safety check&lt;/span&gt;  &lt;span class="c1"&gt;# CVE audit on deps&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t devops-app ./backend&lt;/span&gt; &lt;span class="c1"&gt;# build locally for scanning&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;    &lt;span class="c1"&gt;# OS-level vuln scan&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL,HIGH'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;TruffleHog&lt;/strong&gt; scans every commit diff for leaked API keys, tokens, and passwords — not just regex patterns, but verified against live services. &lt;strong&gt;Safety&lt;/strong&gt; audits Python dependencies against the CVE database. &lt;strong&gt;Trivy&lt;/strong&gt; scans the built container image for OS-level vulnerabilities.&lt;/p&gt;

&lt;p&gt;The pipeline only continues to build-and-push if all three pass. Security is a gate, not a suggestion.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dockerfile
&lt;/h3&gt;

&lt;p&gt;Multi-stage builds are non-negotiable in production. The builder stage installs dependencies; the final image copies only the installed packages — not pip, not build tools, not anything that expands the attack surface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; flask prometheus-client

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.11-slim&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /usr/local/bin /usr/local/bin&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; app.py .&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;useradd &lt;span class="nt"&gt;-u&lt;/span&gt; 10001 appuser &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; appuser:appuser /app
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; appuser&lt;/span&gt;

&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 5000&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "app.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running as &lt;code&gt;uid 10001&lt;/code&gt; means if the container is ever compromised, the attacker gets a user with zero system privileges — not root. This is a hard requirement in enterprise container security audits in 2025.&lt;/p&gt;

&lt;p&gt;The result: an image that's roughly 60% smaller than a naive single-stage build, with significantly fewer Trivy findings.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2 — Infrastructure as Code with Terraform
&lt;/h2&gt;

&lt;p&gt;The rule: if it can't be &lt;code&gt;terraform apply&lt;/code&gt;'d, it doesn't exist.&lt;/p&gt;

&lt;p&gt;I provisioned the full AWS environment — VPC, subnets, security groups, EC2, S3, IAM roles, and EKS — in code. No manual console clicks, ever.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The entire network fabric&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_block&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;
  &lt;span class="nx"&gt;enable_dns_hostnames&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;enable_dns_support&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_subnet"&lt;/span&gt; &lt;span class="s2"&gt;"public"&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.1.0/24"&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_subnet"&lt;/span&gt; &lt;span class="s2"&gt;"private"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.2.0/24"&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few decisions worth explaining:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why RDS gets a private subnet.&lt;/strong&gt; The database should never be reachable from the internet, only from within the VPC. This is enforced at the network layer, not just via security groups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I generate the EC2 SSH key via Terraform.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"tls_private_key"&lt;/span&gt; &lt;span class="s2"&gt;"rsa_key"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;algorithm&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"RSA"&lt;/span&gt;
  &lt;span class="nx"&gt;rsa_bits&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_key_pair"&lt;/span&gt; &lt;span class="s2"&gt;"app_key"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;key_name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.project_name}-key"&lt;/span&gt;
  &lt;span class="nx"&gt;public_key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tls_private_key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rsa_key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_key_openssh&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No manual key generation, no keys sitting in someone's Downloads folder. The private key is a Terraform output marked &lt;code&gt;sensitive = true&lt;/code&gt; — it exists in state, not in source control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why S3 for Terraform state.&lt;/strong&gt; Local &lt;code&gt;.tfstate&lt;/code&gt; files go out of sync between teammates and are catastrophic to lose. S3 with versioning means state is always current and recoverable.&lt;/p&gt;

&lt;p&gt;The payoff: &lt;code&gt;terraform apply&lt;/code&gt; brings up the entire environment in about 15 minutes. &lt;code&gt;terraform destroy&lt;/code&gt; tears it down and stops the billing instantly. Reproducible, auditable, version-controlled infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3 — Automated Deployment
&lt;/h2&gt;

&lt;p&gt;The pipeline pushes two tags on every successful build: &lt;code&gt;latest&lt;/code&gt; and the exact git SHA.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v5&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./backend&lt;/span&gt;
    &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;${{ env.IMAGE_NAME }}:latest&lt;/span&gt;
      &lt;span class="s"&gt;${{ env.IMAGE_NAME }}:${{ github.sha }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why both? &lt;code&gt;latest&lt;/code&gt; is for convenience. The SHA tag is for precision — you can roll back to any exact commit with a single command. This matters when you're debugging a production incident at midnight and need to know exactly what's running.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4 — Kubernetes + GitOps with ArgoCD
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting.&lt;/p&gt;

&lt;p&gt;The EKS cluster runs the app via a Helm chart. The chart manages replicas, resource limits, health probes, and autoscaling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# values.yaml&lt;/span&gt;
&lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;128Mi&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;200m&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;256Mi&lt;/span&gt;

&lt;span class="na"&gt;autoscaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;targetCPUUtilizationPercentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HPA scaling target is 70%, not 90%. At 90% you're already overwhelmed — new pods take time to start and warm up. 70% gives the cluster headroom to scale before traffic saturates the existing pods.&lt;/p&gt;

&lt;h3&gt;
  
  
  The GitOps Loop
&lt;/h3&gt;

&lt;p&gt;Here's the part that makes this different from "deploy via kubectl":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# argocd/application.yaml&lt;/span&gt;
&lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;      &lt;span class="c1"&gt;# delete resources removed from Git&lt;/span&gt;
    &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# revert any manual cluster changes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When GitHub Actions updates the image tag in &lt;code&gt;values.yaml&lt;/code&gt; and pushes the commit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ArgoCD detects the change in Git within seconds&lt;/li&gt;
&lt;li&gt;Triggers a rolling update on the cluster — zero downtime&lt;/li&gt;
&lt;li&gt;If health checks fail post-deploy, ArgoCD auto-rolls back to the last healthy state&lt;/li&gt;
&lt;li&gt;If someone manually &lt;code&gt;kubectl apply&lt;/code&gt;'s something directly to the cluster, ArgoCD reverts it within minutes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Git is the single source of truth. The cluster is a reflection of the repo, not an independent entity that drifts over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 5 — Full Observability Stack
&lt;/h2&gt;

&lt;p&gt;You cannot operate what you cannot observe.&lt;/p&gt;

&lt;p&gt;The Flask app exposes custom Prometheus metrics at &lt;code&gt;/metrics&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;REQUEST_COUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;app_requests_total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Total number of requests&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;app_request_latency_seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Request duration&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;ServiceMonitor&lt;/code&gt; tells Prometheus to scrape the endpoint every 15 seconds. From there, four Grafana panels give full visibility:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Panel&lt;/th&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request Rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rate(app_requests_total[5m])&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error Rate %&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rate(app_requests_total{status=~"5.."}[5m]) / rate(app_requests_total[5m]) * 100&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 Latency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;histogram_quantile(0.95, rate(app_request_latency_seconds_bucket[5m]))&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod Restarts&lt;/td&gt;
&lt;td&gt;&lt;code&gt;kube_pod_container_status_restarts_total{namespace="default"}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  AlertManager Rules
&lt;/h3&gt;

&lt;p&gt;Four alerts fire to a Slack &lt;code&gt;#devops-alerts&lt;/code&gt; channel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighErrorRate&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(app_requests_total{status=~"5.."}[5m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.1&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodCrashLooping&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(kube_pod_container_status_restarts_total[15m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;for: 2m&lt;/code&gt; duration on error rate prevents false positives from a momentary spike. The alert only fires if the condition holds for two consecutive minutes — sustained degradation, not noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;A few things I'd change building this again:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-environment from the start.&lt;/strong&gt; One Terraform workspace and one ArgoCD app works fine for learning, but the first thing you'd add in a real org is separate &lt;code&gt;staging&lt;/code&gt; and &lt;code&gt;prod&lt;/code&gt; environments with promotion gates between them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spot instances on the node group.&lt;/strong&gt; The EKS worker nodes run on &lt;code&gt;t3.small&lt;/code&gt; on-demand. Mixing in Spot instances with appropriate interruption handling would cut the compute cost by 60-70%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry instead of manual instrumentation.&lt;/strong&gt; Hand-instrumenting the Flask app with Prometheus counters and histograms works, but OpenTelemetry gives you traces, metrics, and logs through a single SDK — and it's vendor-neutral.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Stack at a Glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Secret scanning&lt;/td&gt;
&lt;td&gt;TruffleHog&lt;/td&gt;
&lt;td&gt;Verified detections, not just regex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency audit&lt;/td&gt;
&lt;td&gt;Safety&lt;/td&gt;
&lt;td&gt;CVE database for Python packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container scanning&lt;/td&gt;
&lt;td&gt;Trivy&lt;/td&gt;
&lt;td&gt;OS + package layer vulns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IaC&lt;/td&gt;
&lt;td&gt;Terraform&lt;/td&gt;
&lt;td&gt;Reproducible, version-controlled AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Kubernetes (EKS)&lt;/td&gt;
&lt;td&gt;Self-healing, scalable containers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packaging&lt;/td&gt;
&lt;td&gt;Helm&lt;/td&gt;
&lt;td&gt;Templated K8s manifests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps&lt;/td&gt;
&lt;td&gt;ArgoCD&lt;/td&gt;
&lt;td&gt;Git as source of truth, auto-revert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metrics&lt;/td&gt;
&lt;td&gt;Prometheus&lt;/td&gt;
&lt;td&gt;Custom app + node + cluster metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboards&lt;/td&gt;
&lt;td&gt;Grafana&lt;/td&gt;
&lt;td&gt;Real-time visualisation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alerting&lt;/td&gt;
&lt;td&gt;AlertManager + Slack&lt;/td&gt;
&lt;td&gt;Threshold-based incident paging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD&lt;/td&gt;
&lt;td&gt;GitHub Actions&lt;/td&gt;
&lt;td&gt;Pipeline on every push&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Repo
&lt;/h2&gt;

&lt;p&gt;Everything is open source: &lt;a href="https://github.com/ChetanEpuri/modern-devops-project" rel="noopener noreferrer"&gt;github.com/ChetanEpuri/modern-devops-project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The README walks through prerequisites, getting started locally with Docker Compose, provisioning the full cloud stack, and connecting to the ArgoCD and Grafana dashboards.&lt;/p&gt;

&lt;p&gt;If any of this is useful or you're building something similar, drop a comment. I'm particularly interested in talking to people who've taken GitOps patterns further — multi-cluster setups, progressive delivery with Flagger, that kind of thing.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>github</category>
    </item>
  </channel>
</rss>
