<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mateen Anjum</title>
    <description>The latest articles on DEV Community by Mateen Anjum (@mateenali66).</description>
    <link>https://dev.to/mateenali66</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3644604%2F79a9c96c-74eb-4675-9e33-f32d208b4d1b.jpg</url>
      <title>DEV Community: Mateen Anjum</title>
      <link>https://dev.to/mateenali66</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mateenali66"/>
    <language>en</language>
    <item>
      <title>Stop Building CI Pipelines For Humans. Your AI Agents Need A Harness.</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Mon, 01 Jun 2026 03:02:56 +0000</pubDate>
      <link>https://dev.to/mateenali66/stop-building-ci-pipelines-for-humans-your-ai-agents-need-a-harness-48no</link>
      <guid>https://dev.to/mateenali66/stop-building-ci-pipelines-for-humans-your-ai-agents-need-a-harness-48no</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Your CI pipeline was designed for a human reading red text on GitHub. AI agents need a verification harness: deterministic infra, ephemeral preview environments, OPA blast-radius limits, replay traffic, and a machine-readable verdict. Here is the one I shipped, with code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faynukt0ajl19c0xxhsnw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faynukt0ajl19c0xxhsnw.png" alt=" " width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few weeks ago I let a coding agent loose on a real platform team's repo. Terraform, EKS, around 40 microservices. The agent was good. It opened clean PRs, the diffs looked fine, the tests it added were reasonable. By the end of the week it had merged six PRs and we'd rolled back four of them.&lt;/p&gt;

&lt;p&gt;The model wasn't the problem. The problem was that everything around the model assumed a human would catch the bad ones. The CI gates were written for someone who could squint at a Grafana panel, remember last Tuesday's outage, and feel uneasy. The agent has no scar tissue. Someone on r/devops put it perfectly back in March: "LLMs optimize for resolve the immediate error without understanding blast radius. A human would've paused after the first networking change went sideways. The agent doesn't have that instinct."&lt;/p&gt;

&lt;p&gt;The fix isn't a smarter model. It's what people are starting to call an &lt;strong&gt;agent harness&lt;/strong&gt;: the runtime layer wrapped around the model that gives it deterministic infra to play in, hard limits on what it can break, and a structured signal telling it whether the change worked. The term itself only hit mainstream usage in early 2026, per a &lt;a href="https://www.nxcode.io/resources/news/what-is-harness-engineering-complete-guide-2026" rel="noopener noreferrer"&gt;recent industry write-up&lt;/a&gt;, and most teams I talk to haven't built one yet.&lt;/p&gt;

&lt;p&gt;Here is the harness I ended up shipping. It costs roughly $180/month per agent slot on AWS, takes about a day to wire up if you already have Terraform and a GitOps controller, and it has cut bad-merge rollbacks from four a week to zero in the last 17 days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Failure Modes That Hurt Every Time
&lt;/h2&gt;

&lt;p&gt;Same five things keep biting teams. They sound obvious individually. Together they make agents look incompetent.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Flaky preview environments.&lt;/strong&gt; Same PR, two runs, different results. The agent's last change "worked" because Redis happened to come up first. Next run it doesn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No rollback signal.&lt;/strong&gt; Agent merges. Prod p99 quietly drifts from 180ms to 410ms. Nothing alerts because nothing watches the right thing in a way the agent can read.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-deterministic Terraform.&lt;/strong&gt; Plan looked clean. Apply diverged because a data source resolved differently in the second run. Common with &lt;code&gt;aws_ami&lt;/code&gt; lookups, IAM role ARNs, and anything pulling from the registry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No blast-radius limit.&lt;/strong&gt; Agent decides the cleanest fix is to delete the VPC. Technically it has permission, because the CI role is admin. Yes this happened.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No agent-readable test reports.&lt;/strong&gt; The Cypress run failed. The reason is buried in 4MB of stdout with ANSI color codes. The agent reads 200 lines, gives up, says "tests pass" in the PR comment.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Northflank wrote up the broader category in their &lt;a href="https://northflank.com/blog/ephemeral-execution-environments-ai-agents" rel="noopener noreferrer"&gt;March 2026 piece on ephemeral execution environments for AI agents&lt;/a&gt; and most of it tracks. The interesting bit is the gap between "we run agent code in a sandbox" and "the sandbox actually verifies the change."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Harness
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lmrr7atdnyooh9stghs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lmrr7atdnyooh9stghs.png" alt=" " width="800" height="664"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five components. None of them are new on their own. The trick is wiring them so the agent gets a verdict, not a wall of logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Lock Terraform Until The Plan Is Reproducible
&lt;/h3&gt;

&lt;p&gt;Every drift complaint I have ever heard starts with a non-pinned provider or an implicit data source. Fix it once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;required_version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"= 1.9.8"&lt;/span&gt;

  &lt;span class="nx"&gt;required_providers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hashicorp/aws"&lt;/span&gt;
      &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"= 5.74.0"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;kubernetes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hashicorp/kubernetes"&lt;/span&gt;
      &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"= 2.33.0"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"agent-harness-tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"preview/${terraform.workspace}.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"agent-harness-locks"&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Pin the AMI. Do not look it up at plan time.&lt;/span&gt;
&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_ami"&lt;/span&gt; &lt;span class="s2"&gt;"eks_node"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="nx"&gt;owners&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"602401143452"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"image-id"&lt;/span&gt;
    &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ami-0c2f3d8a17b7d4f91"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;= 1.9.8&lt;/code&gt; style is exact-pin, not &lt;code&gt;~&amp;gt; 1.9&lt;/code&gt;. Agents try to "fix" version constraints; they shouldn't. Run &lt;code&gt;terraform plan -refresh=false -lock-timeout=120s&lt;/code&gt; in the harness so a stale data source can't sneak in.&lt;/p&gt;

&lt;p&gt;I also wrap every preview run in a workspace named after the PR number, so state is isolated and tearing down is one &lt;code&gt;terraform workspace delete pr-1247&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Give The Agent Its Own Ephemeral EKS Namespace, Not Its Own Cluster
&lt;/h3&gt;

&lt;p&gt;Spinning up a fresh EKS cluster per PR is what some Northflank docs suggest. In practice it takes 12 to 15 minutes and burns $0.10/hour just for the control plane. For agent workflows where you want a verdict in under 4 minutes, namespace-per-PR on a warm cluster wins.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# kustomize/preview/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kustomize.config.k8s.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Kustomization&lt;/span&gt;

&lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pr-${PR_NUMBER}&lt;/span&gt;

&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../base&lt;/span&gt;

&lt;span class="na"&gt;patches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
      &lt;span class="s"&gt;- op: replace&lt;/span&gt;
        &lt;span class="s"&gt;path: /spec/replicas&lt;/span&gt;
        &lt;span class="s"&gt;value: 1&lt;/span&gt;
      &lt;span class="s"&gt;- op: add&lt;/span&gt;
        &lt;span class="s"&gt;path: /metadata/labels/preview&lt;/span&gt;
        &lt;span class="s"&gt;value: "true"&lt;/span&gt;
      &lt;span class="s"&gt;- op: add&lt;/span&gt;
        &lt;span class="s"&gt;path: /spec/template/spec/priorityClassName&lt;/span&gt;
        &lt;span class="s"&gt;value: preview-low&lt;/span&gt;

&lt;span class="na"&gt;commonAnnotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;agent-harness/ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3600"&lt;/span&gt;
  &lt;span class="na"&gt;agent-harness/pr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${PR_NUMBER}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A small janitor controller deletes namespaces older than the TTL annotation. Costs me about $14/month for a 5-node m6i.large pool that holds 30 concurrent preview namespaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Put Hard Limits On What The Agent Can Change Via OPA
&lt;/h3&gt;

&lt;p&gt;This is the one nobody wants to write and everybody needs. The agent is going to try to widen its own permissions. Block it at the policy layer, not the IAM layer, because the IAM layer is too coarse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blastradius&lt;/span&gt;

&lt;span class="ow"&gt;import&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="ow"&gt;some&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;
  &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt;
  &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"no-op"&lt;/span&gt;
  &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"agent cannot modify IAM roles: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="ow"&gt;some&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;
  &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"aws_vpc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"aws_subnet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="s2"&gt;"delete"&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;
  &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"agent cannot delete network primitives: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="ow"&gt;some&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;
  &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group_rule"&lt;/span&gt;
  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt;
  &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cidr_blocks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
  &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_port&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
  &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_port&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
  &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s2"&gt;"agent cannot open SSH to the world"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Cap the total number of resources touched in a single plan&lt;/span&gt;
&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
  &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"plan touches %d resources, max 50"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with &lt;code&gt;conftest test plan.json -p policies/&lt;/code&gt;. The conftest exit code becomes the PR check. Total cost: about 80 lines of Rego I wrote on a Sunday morning.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Argo Rollouts For An Automatic Rollback Signal The Agent Can Read
&lt;/h3&gt;

&lt;p&gt;Argo Rollouts has analysis templates that compare canary metrics to the stable baseline. The output is structured. That is the whole point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;preview-slo-gate&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
      &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result &amp;lt; &lt;/span&gt;&lt;span class="m"&gt;0.001&lt;/span&gt;
      &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(http_requests_total{&lt;/span&gt;
              &lt;span class="s"&gt;service="{{args.service}}",&lt;/span&gt;
              &lt;span class="s"&gt;status=~"5..",&lt;/span&gt;
              &lt;span class="s"&gt;preview="true"&lt;/span&gt;
            &lt;span class="s"&gt;}[1m]))&lt;/span&gt;
            &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(http_requests_total{&lt;/span&gt;
              &lt;span class="s"&gt;service="{{args.service}}",&lt;/span&gt;
              &lt;span class="s"&gt;preview="true"&lt;/span&gt;
            &lt;span class="s"&gt;}[1m]))&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;p99-latency&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
      &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result &amp;lt; &lt;/span&gt;&lt;span class="m"&gt;0.300&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;histogram_quantile(0.99,&lt;/span&gt;
              &lt;span class="s"&gt;sum(rate(http_request_duration_seconds_bucket{&lt;/span&gt;
                &lt;span class="s"&gt;service="{{args.service}}",&lt;/span&gt;
                &lt;span class="s"&gt;preview="true"&lt;/span&gt;
              &lt;span class="s"&gt;}[1m])) by (le)&lt;/span&gt;
            &lt;span class="s"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this template fails, Argo writes the failure mode into a Rollout status field. My harness scrapes that field and turns it into the structured verdict the agent reads next.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneqpwe4cxelmnek13vrp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneqpwe4cxelmnek13vrp.png" alt=" " width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Replay Real Traffic, Not Synthetic Probes
&lt;/h3&gt;

&lt;p&gt;The lie every preview environment tells is that a &lt;code&gt;curl /health&lt;/code&gt; loop is "verification." It isn't. Mirror 1 to 5 percent of real prod traffic to the preview namespace. GoReplay is the path of least resistance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On a prod ingress node, sample 2% and shadow to preview&lt;/span&gt;
gor &lt;span class="nt"&gt;--input-raw&lt;/span&gt; :443 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--input-raw-track-response&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output-http&lt;/span&gt; &lt;span class="s2"&gt;"https://preview-pr-1247.harness.internal"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output-http-tracking-headers&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--http-allow-method&lt;/span&gt; GET,POST &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--http-disallow-url&lt;/span&gt; &lt;span class="s2"&gt;"/admin|/internal"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output-http-stats&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--split-output-percent&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PII rule: never replay request bodies for auth endpoints, never replay anything carrying card data. The &lt;code&gt;--http-disallow-url&lt;/code&gt; flag is the line you do not skip. I add a second filter in a small Go pre-processor that strips &lt;code&gt;Authorization&lt;/code&gt;, &lt;code&gt;Cookie&lt;/code&gt;, and any header matching &lt;code&gt;*-token&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Five minutes of shadowed prod traffic against the preview surfaces the kind of bug that synthetic tests will never find: a corner case where the agent's "optimization" doubled DB calls for users with more than 50 saved items. We caught that on PR 1183.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Write An Agent-Readable Verdict, Not A Log Tail
&lt;/h3&gt;

&lt;p&gt;The whole loop is wasted if the agent can't parse the result. Generate a JSON file and stash it in S3 with a stable key the agent can fetch from a tool call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1247&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fail"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;218&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"checks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"opa.blast_radius"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"resources_changed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"terraform.plan"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"drift"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"argo.error_rate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fail"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0043&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"trace_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://tempo.harness.internal/trace/abc123"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"argo.p99_latency_seconds"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.187&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.300&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"replay.divergence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fail"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"diffs_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://harness-verdicts/pr-1247/replay-diff.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"notes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"23% of /api/items requests returned 500 in preview, 0% in prod baseline"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent reads this and knows exactly which step to fix. No log scraping. No ANSI codes. No vibes-based debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results From The First 17 Days
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before harness&lt;/th&gt;
&lt;th&gt;After harness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bad merges/week&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent PRs reviewed by humans&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean time from PR open to verdict&lt;/td&gt;
&lt;td&gt;38 min&lt;/td&gt;
&lt;td&gt;3 min 41 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per agent PR run&lt;/td&gt;
&lt;td&gt;$0.42&lt;/td&gt;
&lt;td&gt;$0.11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineer time on agent oversight&lt;/td&gt;
&lt;td&gt;6 hours/week&lt;/td&gt;
&lt;td&gt;45 minutes/week&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cost drop is mostly from killing per-PR EKS clusters and using a warm shared pool with namespace isolation. The time drop is the OPA gate failing fast on the 30 percent of plans that were going to be rejected anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;A few mistakes that took longer than they should have:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't let the agent edit OPA policies.&lt;/strong&gt; I learned this on day three. The agent will helpfully "fix the failing policy" by deleting the rule. Put policies in a separate repo with branch protection or, simpler, mark &lt;code&gt;policies/&lt;/code&gt; as &lt;code&gt;CODEOWNERS&lt;/code&gt; requiring a human review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace IDs in the verdict, not in the logs.&lt;/strong&gt; I had it returning a &lt;code&gt;logs_url&lt;/code&gt; for two weeks. The agent never opened it. Switched to embedding the top three trace IDs with a one-line summary each, and suddenly fix quality went up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replay only GETs for the first month.&lt;/strong&gt; I tried POST replay early and corrupted preview DBs three times. Get the read path verifiably working, then add writes with a request rewriter that targets a synthetic-tenant ID.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The harness is the product, the agent is interchangeable.&lt;/strong&gt; I started with one model, swapped to another after two weeks, results were almost identical. The harness does the work. Pick whichever model is cheapest this quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The full Terraform module, OPA policies, kustomize overlays, and the verdict-builder Lambda are at &lt;a href="https://github.com/mateenali66/agent-harness" rel="noopener noreferrer"&gt;github.com/mateenali66/agent-harness&lt;/a&gt; (going public next week, ping me if you want early access).&lt;/p&gt;

&lt;p&gt;Closing thought. Harness, the CI/CD vendor, named their product before the term "agent harness" existed. That collision is going to confuse people for the next year. The concept is bigger than any vendor. If you are letting AI agents touch production infra without a verification layer that returns structured verdicts, you are running an open-loop control system and hoping the model is calibrated. It isn't.&lt;/p&gt;

&lt;p&gt;Build the harness. Then let the agents work.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://northflank.com/blog/ephemeral-execution-environments-ai-agents" rel="noopener noreferrer"&gt;Ephemeral execution environments for AI agents (Northflank, March 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.harness.io/blog/q1-2026-product-update-harness-continuous-delivery-gitops" rel="noopener noreferrer"&gt;Q1 2026 Harness CD &amp;amp; GitOps product update&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://futurumgroup.com/insights/harness-incident-agent-is-devops-now-the-ai-engineers-of-software-delivery/" rel="noopener noreferrer"&gt;Harness AI agent for incident investigation (Futurum, Jan 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.reddit.com/r/devops/comments/1s28xen/devops_ai_where_are_we_headed_need_honest/" rel="noopener noreferrer"&gt;r/devops thread on AI agents and blast radius&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://argoproj.github.io/argo-rollouts/features/analysis/" rel="noopener noreferrer"&gt;Argo Rollouts analysis templates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.conftest.dev/" rel="noopener noreferrer"&gt;Conftest for Terraform plan policy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Stop Running LLM Workloads on Vanilla Kubernetes</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Wed, 20 May 2026 19:44:34 +0000</pubDate>
      <link>https://dev.to/mateenali66/stop-running-llm-workloads-on-vanilla-kubernetes-4bke</link>
      <guid>https://dev.to/mateenali66/stop-running-llm-workloads-on-vanilla-kubernetes-4bke</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Kubernetes schedules LLM workloads well, but it does not give them the isolation boundary they need once they start calling tools, executing code, or handling tenant data.&lt;/p&gt;

&lt;p&gt;Open Source Summit North America made one thing obvious: the cloud native crowd has moved from "can Kubernetes run LLM workloads?" to "what breaks when we trust Kubernetes too much?"&lt;/p&gt;

&lt;p&gt;That is the right question.&lt;/p&gt;

&lt;p&gt;The default Kubernetes security model assumes a pod is mostly an application packaging unit. It gives you namespaces, cgroups, seccomp, AppArmor, service accounts, admission control, and network policy. All of that matters. None of it changes the central fact that normal containers share the host kernel.&lt;/p&gt;

&lt;p&gt;For a stateless API, that tradeoff is usually fine. For an LLM tool runner that can read files, call APIs, invoke Python, shell out to package managers, and chain actions across systems, that boundary starts looking thin.&lt;/p&gt;

&lt;p&gt;The uncomfortable version is this: vanilla Kubernetes is orchestration, not containment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frt3a1eli01vmie9s67e8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frt3a1eli01vmie9s67e8.png" alt=" " width="800" height="629"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;LLM inference by itself is not the scary part. A model server that receives a prompt and returns tokens is mostly a specialized API service with GPU scheduling problems.&lt;/p&gt;

&lt;p&gt;The risk changes when the workload gains agency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt input
  -&amp;gt; retrieval
  -&amp;gt; tool selection
  -&amp;gt; code execution
  -&amp;gt; network call
  -&amp;gt; file write
  -&amp;gt; another tool call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At that point, the workload is no longer just serving traffic. It is interpreting untrusted text and turning it into actions.&lt;/p&gt;

&lt;p&gt;That is why the recent CNCF security conversation around AI sandboxing matters. Kubernetes can restart a failed pod, route around a bad node, and roll a deployment. It cannot understand whether a prompt is trying to turn a tool into an escape path. It also cannot turn a shared kernel into a hard tenant boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Tried First
&lt;/h2&gt;

&lt;p&gt;My first instinct was the usual Kubernetes hardening stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-worker&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;seccompProfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RuntimeDefault&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/example/llm-worker:latest&lt;/span&gt;
      &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;allowPrivilegeEscalation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
        &lt;span class="na"&gt;readOnlyRootFilesystem&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALL"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That should still be the baseline. The mistake is treating it as the finish line.&lt;/p&gt;

&lt;p&gt;Pod Security Standards reduce obvious footguns. NetworkPolicy controls blast radius. RBAC prevents a compromised workload from casually listing secrets or mutating the cluster. Admission policies keep the platform honest.&lt;/p&gt;

&lt;p&gt;But an LLM agent running untrusted code is not just a badly configured web pod. It is closer to a multi tenant execution service. That needs a runtime boundary, not only a YAML checklist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Runtime Choice
&lt;/h2&gt;

&lt;p&gt;The Kubernetes primitive that makes this manageable is &lt;code&gt;RuntimeClass&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead of creating one magical "secure cluster," you route workloads by risk:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaec0x227wum1eqxwopd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaec0x227wum1eqxwopd.png" alt=" " width="800" height="1290"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RuntimeClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gvisor&lt;/span&gt;
&lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;runsc&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RuntimeClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kata&lt;/span&gt;
&lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kata&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then each workload declares the boundary it needs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool-using-agent&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool-using-agent&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool-using-agent&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;runtimeClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kata&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-agent&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/example/tool-agent:2026.05&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My rule of thumb:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Plain inference API&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;runc&lt;/code&gt; or &lt;code&gt;gvisor&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Low tool risk, latency sensitive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval worker with narrow egress&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gvisor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Better syscall boundary with less operational change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent that calls tools&lt;/td&gt;
&lt;td&gt;&lt;code&gt;kata&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VM boundary per pod, Kubernetes friendly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Arbitrary code execution&lt;/td&gt;
&lt;td&gt;Firecracker style microVM&lt;/td&gt;
&lt;td&gt;Treat it like untrusted tenant compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;gVisor is the easiest first step because it integrates as an OCI runtime through &lt;code&gt;runsc&lt;/code&gt;. Kata is the better fit when the isolation requirement is stronger and a VM per pod is acceptable. Firecracker is the most interesting boundary for code execution, but it is also the one I would least casually bolt onto an existing cluster without a real operations plan.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Minimum Policy Set
&lt;/h2&gt;

&lt;p&gt;The runtime is only one layer. I would not run LLM workloads without this set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-worker-egress&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool-using-agent&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Egress"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-gateway&lt;/span&gt;
      &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;telemetry&lt;/span&gt;
      &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4317&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also make the service account boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-agent&lt;/span&gt;
&lt;span class="na"&gt;automountServiceAccountToken&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the workload does not need Kubernetes API access, do not mount a token. If it does, bind only the exact verbs it needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Plan
&lt;/h2&gt;

&lt;p&gt;I am not going to fake GPU numbers from a laptop. The package needs a real GPU node before publishing final performance claims.&lt;/p&gt;

&lt;p&gt;This is the harness I would run:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpl7nojrnyw6zsioaeqh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpl7nojrnyw6zsioaeqh.png" alt=" " width="800" height="1008"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Cold start p50&lt;/th&gt;
&lt;th&gt;Cold start p95&lt;/th&gt;
&lt;th&gt;Tokens per second&lt;/th&gt;
&lt;th&gt;RSS overhead&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;runc&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gVisor&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;syscall boundary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kata&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;VM per pod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Firecracker&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;TODO&lt;/td&gt;
&lt;td&gt;strongest code runner candidate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The important part is measuring the right things. Startup time matters for bursty agents. Throughput matters for inference. RSS overhead matters because GPU nodes are already expensive. Operational failure modes matter more than all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;If you are running a normal model server, Kubernetes plus standard hardening may be enough.&lt;/p&gt;

&lt;p&gt;If you are running tool using agents, code execution, tenant prompts, or workloads with broad egress, plain pods are the wrong abstraction. Use Kubernetes for scheduling. Use sandboxed runtimes for containment. Keep policy enforcement outside the model path where possible.&lt;/p&gt;

&lt;p&gt;Kubernetes is still the control plane. It just should not be the only security boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CNCF: &lt;a href="https://www.cncf.io/blog/2026/04/30/ai-sandboxing-is-having-its-kubernetes-moment/" rel="noopener noreferrer"&gt;https://www.cncf.io/blog/2026/04/30/ai-sandboxing-is-having-its-kubernetes-moment/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kubernetes Agent Sandbox: &lt;a href="https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/" rel="noopener noreferrer"&gt;https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;llm-d joins CNCF: &lt;a href="https://www.cncf.io/blog/2026/03/24/welcome-llm-d-to-the-cncf-evolving-kubernetes-into-sota-ai-infrastructure/" rel="noopener noreferrer"&gt;https://www.cncf.io/blog/2026/03/24/welcome-llm-d-to-the-cncf-evolving-kubernetes-into-sota-ai-infrastructure/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;gVisor: &lt;a href="https://github.com/google/gvisor" rel="noopener noreferrer"&gt;https://github.com/google/gvisor&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kata Containers: &lt;a href="https://katacontainers.io/" rel="noopener noreferrer"&gt;https://katacontainers.io/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Firecracker containerd: &lt;a href="https://github.com/firecracker-microvm/firecracker-containerd" rel="noopener noreferrer"&gt;https://github.com/firecracker-microvm/firecracker-containerd&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>We Didn't Migrate Systems. We Migrated Assumptions: Heroku to EKS at Scale</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sun, 17 May 2026 05:15:45 +0000</pubDate>
      <link>https://dev.to/mateenali66/we-didnt-migrate-systems-we-migrated-assumptions-heroku-to-eks-at-scale-160p</link>
      <guid>https://dev.to/mateenali66/we-didnt-migrate-systems-we-migrated-assumptions-heroku-to-eks-at-scale-160p</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I'm speaking at Open Source Summit North America 2026 in Minneapolis on Monday, May 18, about moving a fast-growing invoicing SaaS off Heroku onto EKS. This post is the long version of that talk: the three failures that nearly rolled the whole thing back, the open source decisions that saved it, and the honest numbers on what it cost. The one line I keep coming back to: we didn't migrate systems, we migrated assumptions.&lt;/p&gt;

&lt;p&gt;If you're at OSS NA, the session is Monday at 5:25pm CDT in Room 200F. If you're not, this is everything I'd tell you over coffee afterward.&lt;/p&gt;

&lt;h2&gt;
  
  
  The platform
&lt;/h2&gt;

&lt;p&gt;The product was a fast-growing invoicing SaaS. About 2 million active small business merchants, roughly 33 million invoices a year, enterprise clients with contractual SLAs we couldn't afford to miss.&lt;/p&gt;

&lt;p&gt;The architecture was already 47 Node.js microservices on Heroku, with SQS for events and Redis for sessions. The engineering team was 10 people, 2 of us on platform.&lt;/p&gt;

&lt;p&gt;I want to be precise about the title. The services were already micro. The platform was the monolith. Everything routed through one PaaS that made a lot of decisions for us, quietly, and those decisions were exactly the ones that broke when we left.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke at scale
&lt;/h2&gt;

&lt;p&gt;Four things, all at once, all getting worse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API latency sitting at 700ms p99, with no obvious lever to pull because we'd hit the Heroku dyno scaling ceiling.&lt;/li&gt;
&lt;li&gt;A deploy pipeline that took 45 minutes or more, against enterprise SLAs we kept missing.&lt;/li&gt;
&lt;li&gt;No container-level observability, so we were guessing.&lt;/li&gt;
&lt;li&gt;A monthly bill that had quietly crossed the line where it cost more than the value it returned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We ran the decision honestly. Stay on Heroku and accept the ceiling. Rewrite for serverless and eat the rewrite cost. Move to raw AWS VMs and get cost relief but no velocity. Or move to EKS, the highest-risk, highest-ceiling option. We picked EKS, and we picked it knowing it was the riskiest path on the board.&lt;/p&gt;

&lt;h2&gt;
  
  
  We failed three times before it worked
&lt;/h2&gt;

&lt;p&gt;This is the part most migration writeups skip. Here's what actually happened.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 1: the invisible throttle
&lt;/h3&gt;

&lt;p&gt;The PDF generation service went from 800ms p99 to 9 seconds. Dashboards showed 35% CPU. Everything looked fine and nothing was fine.&lt;/p&gt;

&lt;p&gt;The CFS scheduler enforces CPU limits in 100ms slices. At a 500m limit, you get 50ms of CPU per 100ms period. Node.js libuv spawns 4 worker threads, V8 garbage collection runs separately, so you've got around 6 threads fighting over that 50ms window. A crypto operation that takes 15ms unthrottled stretches to 200ms under contention. Average CPU looked low because the process spent most of its time throttled, not running.&lt;/p&gt;

&lt;p&gt;The metric that told the truth was &lt;code&gt;container_cpu_cfs_throttled_periods_total&lt;/code&gt;, not CPU utilization.&lt;/p&gt;

&lt;p&gt;Lesson: a 500m CPU limit isn't a number. It's a 50ms-per-100ms scheduling rule, and Heroku had been hiding that from us by letting dynos burst.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 2: the DNS amplification tax
&lt;/h3&gt;

&lt;p&gt;Heroku's &lt;code&gt;resolv.conf&lt;/code&gt; had &lt;code&gt;options ndots:1&lt;/code&gt;. The EKS default is &lt;code&gt;ndots:5&lt;/code&gt;. That one number difference turned &lt;code&gt;api.stripe.com&lt;/code&gt;, which has 2 dots, into roughly 10 DNS packets per lookup because the resolver walks the search domains before trying the name as-is.&lt;/p&gt;

&lt;p&gt;We made about 150,000 Stripe calls a day. That became 1.5 million DNS queries. Across every external integration, around 12 million unnecessary DNS queries a day, and CoreDNS was the thing falling over.&lt;/p&gt;

&lt;p&gt;There was a second trap layered on top. An &lt;code&gt;npm ci&lt;/code&gt; during the Docker build produced a valid lockfile, just not the same one Heroku's slug cache had been running. A drifted &lt;code&gt;agentkeepalive&lt;/code&gt; version recycled connections every 15 seconds instead of 30, which doubled the lookup rate before we'd even noticed the first problem.&lt;/p&gt;

&lt;p&gt;Lesson: &lt;code&gt;ndots:5&lt;/code&gt; turns every short hostname into 10x DNS amplification, and your dependency tree can quietly make it worse.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 3: the connection pool death spiral
&lt;/h3&gt;

&lt;p&gt;A Tuesday deploy. Thirty seconds later the connection pool was exhausted at 450 against a 400 limit. Sixty seconds in, SIGTERM was being ignored and connections were leaking. Two minutes in, it had exhausted the shared Postgres connections on the Heroku side too, so now both environments were down.&lt;/p&gt;

&lt;p&gt;Root cause was one line in a Dockerfile. &lt;code&gt;CMD npm start&lt;/code&gt; is shell form, which makes PID 1 &lt;code&gt;/bin/sh&lt;/code&gt;, and &lt;code&gt;/bin/sh&lt;/code&gt; swallows SIGTERM. The Node process never got the signal, never drained, never shut down cleanly. &lt;code&gt;CMD ["node", "server.js"]&lt;/code&gt; is exec form, PID 1 is &lt;code&gt;node&lt;/code&gt;, and the signal arrives.&lt;/p&gt;

&lt;p&gt;The fix was three things stacked: PgBouncer in transaction mode to cap real connections around 80, exec-form CMD so SIGTERM lands, and an actual SIGTERM handler that drains gracefully.&lt;/p&gt;

&lt;p&gt;Lesson: PID 1 is a contract. Shell form breaks the contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  The pattern
&lt;/h3&gt;

&lt;p&gt;The question I had to sit with: why didn't any of our dashboards catch this? CFS, because defaults are invisible. DNS, because amplification is multiplicative, not additive. The connection pool, because PID 1 betrayed us in a way no metric was watching.&lt;/p&gt;

&lt;p&gt;That's where the talk's spine comes from. We didn't migrate systems. We migrated assumptions. Every platform hides a different class of failure, and the only safe way through is incremental, observable, reversible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four decisions that mattered most
&lt;/h2&gt;

&lt;p&gt;People ask why we didn't just use AWS directly. The answer is that four decisions cost less to make once at the platform layer than to carry per-team forever. All four are open source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traffic shifting with Istio.&lt;/strong&gt; We rejected DNS-based routing and ALB weighted target groups and landed on Istio. Canary in steps: 5%, 25%, 50%, 100%, with rollback as a single config change that takes seconds, no redeploy, no DNS propagation. Istio is heavy. Our adoption was deliberately light, and mTLS came free with the mesh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observe before you migrate.&lt;/strong&gt; Prometheus with Thanos for long-term cross-cluster metrics, Grafana showing Heroku and EKS side by side on the same panels, Elastic Stack for centralized structured logging. We collected 2 weeks of baseline before moving a single byte. You cannot migrate what you cannot measure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PR-driven infrastructure with Atlantis.&lt;/strong&gt; Open a PR, Atlantis runs &lt;code&gt;terraform plan&lt;/code&gt;, the diff lands in the PR comment, you approve and comment &lt;code&gt;atlantis apply&lt;/code&gt;, and it executes and audits itself. The on-call engineer at 2am no longer has to wonder who ran apply from their laptop, because nobody does. It also took me out of the critical path as a bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deploys are git commits with Flux.&lt;/strong&gt; HelmRelease resources for declarative deploys, drift detection that auto-corrects the inevitable manual &lt;code&gt;kubectl apply&lt;/code&gt;, and within a month everyone was working through git because it was simply easier than not.&lt;/p&gt;

&lt;p&gt;The database cutover used dual-write to RDS with checksum-validated continuous replication. When we flipped it, the cutover was anticlimactic. That's exactly what we wanted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results
&lt;/h2&gt;

&lt;p&gt;The headline numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API latency p99&lt;/td&gt;
&lt;td&gt;700ms&lt;/td&gt;
&lt;td&gt;70ms&lt;/td&gt;
&lt;td&gt;down 90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy time&lt;/td&gt;
&lt;td&gt;45 min&lt;/td&gt;
&lt;td&gt;4 min&lt;/td&gt;
&lt;td&gt;down 91%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly incidents&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;down 83%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy frequency&lt;/td&gt;
&lt;td&gt;2/week&lt;/td&gt;
&lt;td&gt;15/day&lt;/td&gt;
&lt;td&gt;up 50x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly cost&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;60%+ lower&lt;/td&gt;
&lt;td&gt;right-sizing + spot + Karpenter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I don't like a 90% number with no explanation, so here's where the 630ms went. Routing variance was about 250ms, Istio least-connection routing versus Heroku's effectively random routing. Network topology was around 160ms, pod-to-pod inside the VPC instead of a public path with TLS renegotiation. Resource isolation was about 125ms, with CFS throttling going from 65% of periods to under 2%. Connection pooling was the remaining 95ms from PgBouncer transaction mode.&lt;/p&gt;

&lt;p&gt;And the part that belongs in every honest migration post: this absorbed 2 platform engineers full-time for 5 months, plus roughly 30% of 8 application engineers' time. Nothing here was free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Developer experience after the move
&lt;/h2&gt;

&lt;p&gt;Simple for developers means complicated for the platform team, and that's the trade we chose to own. Heroku's superpower was &lt;code&gt;git push heroku main&lt;/code&gt;. We weren't going to beat that, so we got close with an internal developer portal built on Backstage. A scaffolder template stands up a new service in about 5 minutes. Kubernetes complexity stayed our problem, not the developers' problem. That's how a 10-person team scaled to 100 on a platform 2 of us maintained.&lt;/p&gt;

&lt;h2&gt;
  
  
  What almost stopped us
&lt;/h2&gt;

&lt;p&gt;Istio sidecar injection added about 8 seconds to pod startup until we tuned readiness probe timeouts across every service. Flux reconciliation during peak hours triggered rolling restarts until we scheduled reconciliation windows. cert-manager TLS rotation broke active connections until we added graceful connection draining, which we should have had from day one.&lt;/p&gt;

&lt;p&gt;Migration is not over. It's a beginning. We're still working on cost-attribution dashboards in Backstage and evaluating Istio Ambient mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we gave back
&lt;/h2&gt;

&lt;p&gt;None of this runs without code other people wrote. So we contributed back: 49 CNCF DevStats contributions in 2026, 22 merged upstream PRs across 14 projects in the last three months, spanning observability, Kubernetes, security, and developer tooling. A cert-manager maintainer's review on one of them, "this is a super cool contribution," is the kind of feedback that makes the loop worth closing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open source is the equalizer
&lt;/h2&gt;

&lt;p&gt;Here's the thing I'll close the talk on. A 2-person platform team in Ontario, Canada ran the same infrastructure stack as companies 100 times our size. The team grew from 10 engineers to 100. The service count went from 47 to 47, still, because the platform absorbed the growth instead of the codebase. The platform team went from 2 people to 2 people.&lt;/p&gt;

&lt;p&gt;That's only possible because thousands of contributors built the tools we stand on. Open source is what let a small team in a mid-market company run infrastructure that used to require a department.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should you migrate?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;git push heroku main&lt;/code&gt; is still the best deploy UX I've ever used, and half the Fortune 500 still runs on Heroku for good reason. Migrate if you have 2 or more platform engineers, steady scaling pressure, some Kubernetes exposure on the team, and a PaaS limit you've actually hit. Don't migrate yet if you're a solo platform owner, your workload is steady-state, nobody has Kubernetes time, or Heroku still meets your needs.&lt;/p&gt;

&lt;p&gt;If your team isn't ready for the highest-risk, highest-ceiling option, that's not a failure. That's a correct read of your situation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Come say hi
&lt;/h2&gt;

&lt;p&gt;If you're at Open Source Summit North America 2026, the talk is Monday, May 18, 5:25pm CDT, Room 200F at the Minneapolis Convention Center. I'll hang around after for the parts that don't fit in 25 minutes, and there are plenty.&lt;/p&gt;

&lt;p&gt;Slides and the full list of the 22 merged PRs are at phonotech.ca/ossna26.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>aws</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Kubernetes v1.36 Drops April 22: What Platform Engineers Actually Need to Know</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 18 Apr 2026 04:54:58 +0000</pubDate>
      <link>https://dev.to/mateenali66/kubernetes-v136-drops-april-22-what-platform-engineers-actually-need-to-know-3l81</link>
      <guid>https://dev.to/mateenali66/kubernetes-v136-drops-april-22-what-platform-engineers-actually-need-to-know-3l81</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Kubernetes v1.36 releases April 22, 2026. The headline features are DRA GPU partitioning, workload-aware preemption for AI/ML jobs, and the permanent removal of the gitRepo volume plugin. Ingress-nginx is also officially retired. If you run AI inference workloads or care about cluster security, this release is not optional reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Release Matters More Than Most
&lt;/h2&gt;

&lt;p&gt;The CNCF's 2025 annual survey dropped a number that stopped a lot of people mid-scroll: 66% of organizations hosting generative AI models now use Kubernetes for some or all of their inference workloads. That's not a trend, that's a fait accompli. Kubernetes is the AI compute substrate whether you planned for it or not.&lt;/p&gt;

&lt;p&gt;v1.36 is the release that leans into that reality. The bulk of the new work is in Dynamic Resource Allocation (DRA), gang scheduling, and topology-aware placement, all of which exist because running distributed AI/ML jobs on Kubernetes has historically been painful. This release makes it less painful.&lt;/p&gt;

&lt;p&gt;But there are also breaking changes and security fixes that affect everyone, not just the ML crowd. Let me walk through what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Breaking Changes First
&lt;/h2&gt;

&lt;h3&gt;
  
  
  gitRepo Volume Plugin: Gone for Good
&lt;/h3&gt;

&lt;p&gt;If you're still using &lt;code&gt;gitRepo&lt;/code&gt; volumes, stop reading and go fix that right now. The plugin has been deprecated since v1.11 and is now permanently disabled in v1.36. No feature flag, no workaround.&lt;/p&gt;

&lt;p&gt;The reason it's gone is serious: gitRepo allowed attackers to run code as root on the node. It was a known attack vector for years. The right replacement is an init container running &lt;code&gt;git clone&lt;/code&gt;, or a git-sync sidecar. Both are well-documented and production-proven.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (broken in v1.36)&lt;/span&gt;
&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;code&lt;/span&gt;
    &lt;span class="na"&gt;gitRepo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/example/repo"&lt;/span&gt;
      &lt;span class="na"&gt;revision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main"&lt;/span&gt;

&lt;span class="c1"&gt;# After: use an init container&lt;/span&gt;
&lt;span class="na"&gt;initContainers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;git-sync&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.k8s.io/git-sync/git-sync:v4.2.1&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--repo=https://github.com/example/repo&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--branch=main&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--root=/git&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--one-time&lt;/span&gt;
    &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;code&lt;/span&gt;
        &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/git&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Ingress-NGINX Is Retired
&lt;/h3&gt;

&lt;p&gt;SIG Network and the Security Response Committee retired ingress-nginx on March 24, 2026. No more releases, no more security patches. Existing deployments keep running, but you're on your own for CVEs from here.&lt;/p&gt;

&lt;p&gt;The community's recommended alternatives are Envoy Gateway (CNCF graduated), Cilium Gateway API, and Traefik. If you're on ingress-nginx in production, this is your migration window. Don't wait for the next CVE to force your hand.&lt;/p&gt;

&lt;h3&gt;
  
  
  service.spec.externalIPs Deprecated
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;externalIPs&lt;/code&gt; field in Service specs is being deprecated (full removal planned for v1.43). It's been a known vector for man-in-the-middle attacks since CVE-2020-8554. You'll see deprecation warnings starting in v1.36. Migrate to LoadBalancer services, NodePort, or Gateway API.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI/ML Features That Actually Change How You Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DRA: Partitionable Devices (Beta)
&lt;/h3&gt;

&lt;p&gt;This is the one I'm most excited about. v1.36 promotes DRA support for partitionable devices to beta, meaning it's enabled by default. A single GPU can now be split into multiple logical units and allocated to different workloads.&lt;/p&gt;

&lt;p&gt;Before this, if you had an H100 and a workload that only needed 20% of it, you either wasted 80% or ran a separate MIG configuration outside Kubernetes. Now the scheduler handles it natively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;resource.k8s.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResourceClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;partial-gpu&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;devices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-slice&lt;/span&gt;
      &lt;span class="na"&gt;deviceClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="c1"&gt;# Request a partition, not the whole device&lt;/span&gt;
      &lt;span class="na"&gt;selectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;device.attributes["nvidia.com/gpu"].partitionable == &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For platform teams running shared GPU clusters, this is a significant cost lever. You can pack more inference workloads onto the same hardware without sacrificing isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workload-Aware Preemption (Alpha)
&lt;/h3&gt;

&lt;p&gt;Standard Kubernetes preemption works pod-by-pod. For distributed AI/ML jobs, that's a disaster: preempt one pod from a training job and the whole job stalls, wasting all the resources it's still holding.&lt;/p&gt;

&lt;p&gt;v1.36 introduces workload-aware preemption via &lt;code&gt;PodGroups&lt;/code&gt;. The scheduler now treats a group of related pods as a single entity. When it needs to make room for a high-priority job, it preempts entire groups rather than individual pods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scheduling.k8s.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodGroup&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;training-job-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minMember&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;priorityClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high-priority&lt;/span&gt;
  &lt;span class="na"&gt;gangSchedulingPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;disruptionMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodGroup&lt;/span&gt;  &lt;span class="c1"&gt;# preempt the whole group, not individual pods&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is alpha, so it's off by default. But if you're running Kueue or JobSet for batch AI workloads, this is worth enabling in a test cluster now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pod-Level Resource Managers (Alpha)
&lt;/h3&gt;

&lt;p&gt;For HPC and AI/ML workloads, NUMA alignment matters. Previously, the Topology Manager only worked at the container level. If you had a training container plus logging and monitoring sidecars in the same pod, you couldn't guarantee they all landed on the same NUMA node.&lt;/p&gt;

&lt;p&gt;v1.36 adds pod-scope resource management: you can now set &lt;code&gt;pod.spec.resources&lt;/code&gt; and have the Topology Manager treat the entire pod as a single scheduling unit. All containers get resources from the same NUMA node.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16"&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64Gi"&lt;/span&gt;
  &lt;span class="na"&gt;topologySpreadConstraints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;maxSkew&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;topology.kubernetes.io/numa-node&lt;/span&gt;
      &lt;span class="na"&gt;whenUnsatisfiable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DoNotSchedule&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  DRA Resource Availability Visibility (Alpha)
&lt;/h3&gt;

&lt;p&gt;Finally, a native way to answer "how many GPUs are free in this cluster?" without writing custom tooling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create &lt;span class="nt"&gt;-f&lt;/span&gt; - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
apiVersion: resource.k8s.io/v1alpha1
kind: ResourcePoolStatusRequest
metadata:
  name: check-gpus
spec:
  driver: nvidia.com/gpu
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;kubectl get rpsr/check-gpus &lt;span class="nt"&gt;-o&lt;/span&gt; yaml
&lt;span class="c"&gt;# Returns: totalDevices, allocatedDevices, availableDevices per node&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is alpha, but it's the kind of operational visibility that platform teams have been hacking around for years.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stability Improvements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SELinux Volume Labeling: Now GA
&lt;/h3&gt;

&lt;p&gt;Faster pod startup on SELinux-enforcing systems. This replaces recursive file relabeling with a single mount-time label, which can cut pod startup time significantly on large volumes. It's been in beta since v1.28 and is now stable and on by default.&lt;/p&gt;

&lt;p&gt;If you're running RHEL or any SELinux-enforcing OS, you'll notice this immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  External ServiceAccount Token Signing: GA
&lt;/h3&gt;

&lt;p&gt;The kube-apiserver can now delegate token signing to external KMS or HSM systems. For clusters with strict key management requirements (financial services, healthcare, government), this removes a significant compliance gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graceful Leader Transition (Alpha)
&lt;/h3&gt;

&lt;p&gt;Control plane components (kube-controller-manager, kube-scheduler) used to call &lt;code&gt;os.Exit()&lt;/code&gt; when losing leader election, forcing a full restart. v1.36 introduces graceful transitions: the component moves to follower state and re-enters the election without restarting. Faster failover, less noise in your control plane logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stale Controller Mitigation (Alpha)
&lt;/h3&gt;

&lt;p&gt;Large clusters with high churn have always had a subtle bug: a controller creates a resource, its cache hasn't updated yet, and it tries to create the same resource again. v1.36 adds cache freshness tracking so controllers check whether their local state is current before reconciling. Fewer duplicate creates, fewer spurious errors in busy clusters.&lt;/p&gt;

&lt;h3&gt;
  
  
  HPA Scale-to-Zero (Alpha)
&lt;/h3&gt;

&lt;p&gt;The Horizontal Pod Autoscaler can now scale deployments to zero replicas based on external metrics (queue depth, custom metrics). When the queue is empty, the deployment goes to zero. When work arrives, it scales back up. This is the missing piece for event-driven workloads that don't need to run 24/7.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do Before April 22
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit gitRepo volumes.&lt;/strong&gt; Run &lt;code&gt;kubectl get pods -A -o json | jq '.items[].spec.volumes[]? | select(.gitRepo != null)'&lt;/code&gt;. If you get output, you have work to do.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plan your ingress-nginx migration.&lt;/strong&gt; Check &lt;code&gt;kubectl get ingressclass&lt;/code&gt; and &lt;code&gt;kubectl get pods -A | grep ingress-nginx&lt;/code&gt;. If you're running it, pick a replacement and start testing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check for externalIPs usage.&lt;/strong&gt; &lt;code&gt;kubectl get svc -A -o json | jq '.items[] | select(.spec.externalIPs != null) | .metadata.name'&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enable DRA partitionable devices in staging.&lt;/strong&gt; If you run GPU workloads, this is worth testing before it becomes the default everywhere.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Read the full changelog.&lt;/strong&gt; The &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.36.md" rel="noopener noreferrer"&gt;CHANGELOG-1.36.md&lt;/a&gt; is dense but worth scanning for anything specific to your stack.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;v1.36 isn't a flashy release. There's no single feature that rewrites how Kubernetes works. What it is, is a release that takes the AI/ML workload story seriously at the scheduler and resource allocation level, while cleaning up years of accumulated security debt.&lt;/p&gt;

&lt;p&gt;The gitRepo removal and ingress-nginx retirement are overdue. The DRA work is genuinely new capability. And the gang scheduling improvements are the kind of thing that makes distributed training jobs actually reliable on Kubernetes instead of just theoretically possible.&lt;/p&gt;

&lt;p&gt;If you're running AI inference at scale, v1.36 is the release you've been waiting for. If you're running anything else, it's a solid maintenance release with a few security items you can't ignore.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/blog/2026/03/30/kubernetes-v1-36-sneak-peek/" rel="noopener noreferrer"&gt;Kubernetes v1.36 Sneak Peek&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://palark.com/blog/kubernetes-1-36-release-features/" rel="noopener noreferrer"&gt;Palark: Deep Dive into v1.36 Alpha Features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cncf.io/announcements/2026/01/20/kubernetes-established-as-the-de-facto-operating-system-for-ai-as-production-use-hits-82-in-2025-cncf-annual-cloud-native-survey/" rel="noopener noreferrer"&gt;CNCF 2025 Annual Survey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/" rel="noopener noreferrer"&gt;Ingress-NGINX Retirement Announcement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/" rel="noopener noreferrer"&gt;DRA Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudcomputing</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>ingress-nginx Is Dead: How I Migrated to Gateway API Before It Became a Liability</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Tue, 07 Apr 2026 18:15:05 +0000</pubDate>
      <link>https://dev.to/mateenali66/ingress-nginx-is-dead-how-i-migrated-to-gateway-api-before-it-became-a-liability-2815</link>
      <guid>https://dev.to/mateenali66/ingress-nginx-is-dead-how-i-migrated-to-gateway-api-before-it-became-a-liability-2815</guid>
      <description>&lt;p&gt;ingress-nginx was archived on March 24, 2026 after a string of critical CVEs including a 9.8 CVSS unauthenticated RCE. Gateway API v1.4 is the CNCF-graduated replacement. I used ingress2gateway 1.0 to convert 40+ Ingress resources to HTTPRoutes, validated the output, and cut over with zero downtime. Here's exactly how I did it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Happened
&lt;/h2&gt;

&lt;p&gt;In March 2025, CVE-2025-1974 (dubbed "IngressNightmare") dropped: a CVSS 9.8 unauthenticated remote code execution vulnerability in ingress-nginx's admission webhook. Any attacker with network access to the webhook could execute arbitrary code inside the controller pod, which typically has broad cluster permissions. That was bad enough on its own.&lt;/p&gt;

&lt;p&gt;Then came 2026. Four more HIGH-severity CVEs landed in quick succession:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CVE&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2025-1974&lt;/td&gt;
&lt;td&gt;CRITICAL 9.8&lt;/td&gt;
&lt;td&gt;Unauthenticated RCE via admission webhook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-1580&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Config injection leading to privilege escalation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24512&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Path injection through nginx config manipulation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24513&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Authentication bypass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24514&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Annotation abuse for unauthorized access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On March 24, 2026, the ingress-nginx repository was officially archived. Read-only. No more patches. No more CVE fixes. If you're still running it, you're running unpatched software with known critical vulnerabilities.&lt;/p&gt;

&lt;p&gt;This wasn't a surprise deprecation. The Kubernetes community had been building Gateway API for years as the successor to the Ingress resource. But the CVE storm turned "migrate when convenient" into "migrate now."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75nc6whgvdkzbi6s9rja.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75nc6whgvdkzbi6s9rja.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Gateway API: What Actually Changed
&lt;/h2&gt;

&lt;p&gt;Gateway API isn't just "Ingress v2." It fundamentally changes how traffic routing is modeled in Kubernetes by splitting responsibilities across three layers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff104yqs3037jx07ljf8j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff104yqs3037jx07ljf8j.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: GatewayClass (Infrastructure Admin)
&lt;/h3&gt;

&lt;p&gt;The infrastructure team defines what gateway implementation is available. Think of it as the "which load balancer technology" decision.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GatewayClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-gateway&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;controllerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.envoyproxy.io/gatewayclass-controller&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 2: Gateway (Cluster Operator)
&lt;/h3&gt;

&lt;p&gt;The platform team creates Gateway resources that bind to a GatewayClass. This is where you define listeners, ports, TLS certificates, and which namespaces can attach routes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Gateway&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main-gateway&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-infra&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;gatewayClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-gateway&lt;/span&gt;
  &lt;span class="na"&gt;listeners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPS&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
      &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Terminate&lt;/span&gt;
        &lt;span class="na"&gt;certificateRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;wildcard-tls&lt;/span&gt;
      &lt;span class="na"&gt;allowedRoutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Selector&lt;/span&gt;
          &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;gateway-access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 3: HTTPRoute (Application Developer)
&lt;/h3&gt;

&lt;p&gt;Application teams define their own routing rules without touching the gateway configuration. They just reference the Gateway they want to attach to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPRoute&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parentRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main-gateway&lt;/span&gt;
      &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-infra&lt;/span&gt;
  &lt;span class="na"&gt;hostnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api.example.com"&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PathPrefix&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/v1&lt;/span&gt;
      &lt;span class="na"&gt;backendRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-service&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation matters because it maps to how teams actually operate. Infrastructure admins pick the implementation. Platform engineers configure the gateway. App developers define their routes. Nobody steps on each other's toes, and RBAC enforces the boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Is Better Than Annotations
&lt;/h3&gt;

&lt;p&gt;With ingress-nginx, everything was shoved into annotations. Rate limiting, CORS, timeouts, rewrites, all of it crammed into &lt;code&gt;nginx.ingress.kubernetes.io/*&lt;/code&gt; strings that were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-standard&lt;/strong&gt;: Every controller had its own annotation format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unvalidated&lt;/strong&gt;: Typo an annotation name? Silent failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured&lt;/strong&gt;: Complex configs as string values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-portable&lt;/strong&gt;: Locked to one implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gateway API uses typed CRD fields. Your IDE autocompletes them. The API server validates them. They work across implementations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration: Using ingress2gateway 1.0
&lt;/h2&gt;

&lt;p&gt;On March 20, 2026, ingress2gateway 1.0 shipped with support for 30+ ingress-nginx annotations. This was the tool that made bulk migration practical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ingress2gateway
&lt;span class="c"&gt;# or&lt;/span&gt;
go &lt;span class="nb"&gt;install &lt;/span&gt;github.com/kubernetes-sigs/ingress2gateway@v1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Scan and Convert
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Convert everything cluster-wide&lt;/span&gt;
ingress2gateway print &lt;span class="nt"&gt;--providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ingress-nginx &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gwapi.yaml

&lt;span class="c"&gt;# Or target a specific namespace&lt;/span&gt;
ingress2gateway print &lt;span class="nt"&gt;--namespace&lt;/span&gt; my-api &lt;span class="nt"&gt;--providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ingress-nginx &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gwapi.yaml

&lt;span class="c"&gt;# If you've chosen your implementation, use emitter flags&lt;/span&gt;
ingress2gateway print &lt;span class="nt"&gt;--emitter&lt;/span&gt; envoy-gateway &lt;span class="nt"&gt;--providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ingress-nginx &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gwapi.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Review the Output
&lt;/h3&gt;

&lt;p&gt;Here's what a typical translation looks like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (Ingress with ingress-nginx annotations):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/cors-allow-origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://app.example.com"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/cors-allow-methods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;POST,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OPTIONS"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/cors-enable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-read-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/use-regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api.example.com&lt;/span&gt;
      &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-tls&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.example.com&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api/v[0-9]+/users&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ImplementationSpecific&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users-service&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After (Gateway API HTTPRoute):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPRoute&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parentRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main-gateway&lt;/span&gt;
      &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-infra&lt;/span&gt;
  &lt;span class="na"&gt;hostnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api.example.com"&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RegularExpression&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/v[0-9]+/users"&lt;/span&gt;
      &lt;span class="na"&gt;filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResponseHeaderModifier&lt;/span&gt;
          &lt;span class="na"&gt;responseHeaderModifier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Access-Control-Allow-Origin&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://app.example.com"&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Access-Control-Allow-Methods&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;POST,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OPTIONS"&lt;/span&gt;
      &lt;span class="na"&gt;timeouts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;backendRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
      &lt;span class="na"&gt;backendRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users-service&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The structure is cleaner. CORS headers are explicit. The regex path type is a first-class field instead of being toggled by an annotation. Timeouts are typed durations, not string-encoded integers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcsmgeji89sk2o5v22s0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcsmgeji89sk2o5v22s0g.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What ingress2gateway Cannot Translate
&lt;/h2&gt;

&lt;p&gt;The tool is good, but it's not magic. Watch for these:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom Lua snippets.&lt;/strong&gt; If you used &lt;code&gt;nginx.ingress.kubernetes.io/server-snippet&lt;/code&gt; or &lt;code&gt;configuration-snippet&lt;/code&gt; with custom Lua or raw nginx config, those have no Gateway API equivalent. You'll need to reimplement that logic in your application or use implementation-specific policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting.&lt;/strong&gt; ingress-nginx rate limiting annotations don't map to standard Gateway API fields. Most implementations offer their own rate limiting CRDs (like Envoy Gateway's &lt;code&gt;BackendTrafficPolicy&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ModSecurity / WAF rules.&lt;/strong&gt; If you had ModSecurity enabled via annotations, you'll need a separate WAF solution or an implementation that supports it natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session affinity.&lt;/strong&gt; Cookie-based session affinity annotations need implementation-specific configuration in Gateway API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom error pages.&lt;/strong&gt; These were nginx-specific and need to be handled at the application level or through implementation extensions.&lt;/p&gt;

&lt;p&gt;ingress2gateway will print warnings for annotations it can't convert. Read every warning. I found three services silently losing rate limiting configs that would have caused issues in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing a Gateway API Implementation
&lt;/h2&gt;

&lt;p&gt;Gateway API is a spec. You need an implementation. Here's how I evaluated the main options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;th&gt;Backed By&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Envoy Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Envoy Proxy / CNCF&lt;/td&gt;
&lt;td&gt;General purpose, feature-rich&lt;/td&gt;
&lt;td&gt;Strong community, good docs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;kgateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Solo.io&lt;/td&gt;
&lt;td&gt;Advanced traffic management&lt;/td&gt;
&lt;td&gt;Commercial support available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cilium Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Isovalent/Cisco&lt;/td&gt;
&lt;td&gt;eBPF-native networking&lt;/td&gt;
&lt;td&gt;Great if you already run Cilium CNI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NGINX Gateway Fabric&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;F5/NGINX&lt;/td&gt;
&lt;td&gt;Familiar nginx users&lt;/td&gt;
&lt;td&gt;Uses nginx under the hood&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Istio Waypoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google/Solo.io&lt;/td&gt;
&lt;td&gt;Service mesh integration&lt;/td&gt;
&lt;td&gt;If you're already on Istio&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I went with Envoy Gateway. It's CNCF-backed, has broad feature coverage, and doesn't require buying into a service mesh. The &lt;code&gt;--emitter envoy-gateway&lt;/code&gt; flag in ingress2gateway generates implementation-specific extensions where needed, which saved manual work.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Migration Checklist
&lt;/h2&gt;

&lt;p&gt;Here's the checklist I followed. Steal it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pre-migration:
[ ] Inventory all Ingress resources: kubectl get ingress --all-namespaces
[ ] Document custom annotations per Ingress
[ ] Identify any custom nginx configs (ConfigMap, snippets)
[ ] Install Gateway API CRDs: kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml
[ ] Deploy chosen Gateway API implementation

Conversion:
[ ] Run ingress2gateway print and capture output
[ ] Review ALL warnings from ingress2gateway
[ ] Manually handle untranslatable annotations
[ ] Create GatewayClass and Gateway resources
[ ] Create ReferenceGrant resources for cross-namespace refs

Validation:
[ ] Apply HTTPRoutes to staging cluster
[ ] Test every endpoint (automated: curl + expected status codes)
[ ] Verify TLS termination works
[ ] Check CORS headers in browser dev tools
[ ] Validate regex paths match correctly
[ ] Load test to confirm no performance regression

Cutover:
[ ] Update DNS or switch load balancer target
[ ] Monitor error rates for 30 minutes
[ ] Keep old Ingress resources (don't delete yet)
[ ] After 48 hours stable: remove old Ingress resources
[ ] Uninstall ingress-nginx controller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After migrating 40+ Ingress resources across 12 namespaces:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Known CVEs&lt;/td&gt;
&lt;td&gt;5 (1 critical)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Annotation sprawl&lt;/td&gt;
&lt;td&gt;180+ annotations&lt;/td&gt;
&lt;td&gt;0 (typed fields)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-namespace routing&lt;/td&gt;
&lt;td&gt;Manual workarounds&lt;/td&gt;
&lt;td&gt;Native ReferenceGrant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downtime during migration&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to complete&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;3 days (including validation)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Don't wait for the archive notice.&lt;/strong&gt; Gateway API has been stable since v1.0 (October 2023). I should have started earlier. The CVE pressure made this more stressful than it needed to be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ingress2gateway is a starting point, not a finish line.&lt;/strong&gt; It handled about 85% of our config automatically. The remaining 15% required understanding both the old nginx annotations and the new Gateway API model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three-layer model pays off immediately.&lt;/strong&gt; Within a week of the migration, our app teams were creating their own HTTPRoutes without filing tickets to the platform team. That alone justified the effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test regex paths carefully.&lt;/strong&gt; The regex syntax between nginx and Gateway API implementations can differ subtly. I caught two path patterns that matched differently under Envoy than they did under nginx.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep the old Ingress resources around.&lt;/strong&gt; Don't delete them the moment Gateway API routes are working. Give yourself a rollback window. I kept ours for 48 hours before cleanup.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/" rel="noopener noreferrer"&gt;Gateway API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes-sigs/ingress2gateway" rel="noopener noreferrer"&gt;ingress2gateway GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nvd.nist.gov/vuln/detail/CVE-2025-1974" rel="noopener noreferrer"&gt;CVE-2025-1974 Advisory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/blog/" rel="noopener noreferrer"&gt;Gateway API v1.4 Release Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway.envoyproxy.io/docs/" rel="noopener noreferrer"&gt;Envoy Gateway Quickstart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes/ingress-nginx" rel="noopener noreferrer"&gt;ingress-nginx Archive Notice&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudnative</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Your Security Scanner Was the Weapon: Inside the Trivy Supply Chain Attack</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 28 Mar 2026 17:40:45 +0000</pubDate>
      <link>https://dev.to/mateenali66/your-security-scanner-was-the-weapon-inside-the-trivy-supply-chain-attack-2gc</link>
      <guid>https://dev.to/mateenali66/your-security-scanner-was-the-weapon-inside-the-trivy-supply-chain-attack-2gc</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Trivy, the most widely used container scanning action in GitHub Actions, was compromised on March 19, 2026. A threat actor poisoned 76 of its 77 version tags. Every pipeline that ran a scan silently handed over SSH keys, cloud credentials, Kubernetes tokens, and more. The scan appeared to succeed. You'd never know.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I've had Trivy in my pipelines for years. Container scanning on every PR, every merge, every deploy. It's one of those things you set up once and stop thinking about, which is exactly what makes this attack so effective.&lt;/p&gt;

&lt;p&gt;On March 19, 2026, a threat actor group called TeamPCP force-pushed malicious commits to 76 of the 77 version tags in the &lt;code&gt;aquasecurity/trivy-action&lt;/code&gt; GitHub repository. All 7 tags in &lt;code&gt;aquasecurity/setup-trivy&lt;/code&gt; were also compromised. If your workflow referenced Trivy by a tag (which is how basically everyone references GitHub Actions), you were running their code.&lt;/p&gt;

&lt;p&gt;The scanner still ran. Your pipeline still went green. You had no idea.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Actually Happened
&lt;/h2&gt;

&lt;p&gt;This attack didn't start on March 19. It started weeks earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gg7vrokeqnj1aspiraw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gg7vrokeqnj1aspiraw.png" alt=" " width="800" height="207"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Late February 2026:&lt;/strong&gt; An automated bot called "hackerbot-claw" exploited a misconfigured GitHub Actions workflow and stole a privileged Personal Access Token from Aqua Security's CI environment. The attacker used this to push malware to the Trivy VS Code extension on Open VSX.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 1:&lt;/strong&gt; Aqua Security disclosed the incident publicly via a GitHub discussion and rotated credentials. Except the rotation was incomplete. One service account, one PAT, one residual access path, still live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 19, 17:43 UTC:&lt;/strong&gt; Using the still-valid credentials, TeamPCP force-pushed malicious commits to 76 of 77 tags in &lt;code&gt;trivy-action&lt;/code&gt; and all 7 tags in &lt;code&gt;setup-trivy&lt;/code&gt;. The compromised commits spoofed legitimate maintainer identities. GitHub itself flagged them with "This commit does not belong to any branch on this repository" but that warning is easy to miss in a workflow log.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 19, 18:22 UTC:&lt;/strong&gt; A rogue commit published a malicious Trivy binary as &lt;code&gt;v0.69.4&lt;/code&gt; across every distribution channel simultaneously: GitHub Releases, GHCR, Docker Hub, ECR Public, deb/rpm repositories, and get.trivy.dev.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 20, 05:40 UTC:&lt;/strong&gt; Aqua remediated the trivy-action tags. The window was roughly 12 hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 22:&lt;/strong&gt; The attacker pushed additional malicious Docker Hub images (&lt;code&gt;v0.69.5&lt;/code&gt;, &lt;code&gt;v0.69.6&lt;/code&gt;, &lt;code&gt;latest&lt;/code&gt;) using separately compromised Docker Hub credentials, bypassing all GitHub controls. Same day, 44 repositories in Aqua's &lt;code&gt;aquasec-com&lt;/code&gt; GitHub org were defaced using a stolen service account token that bridged both orgs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 24:&lt;/strong&gt; The campaign expanded to Checkmarx KICS and LiteLLM PyPI packages (&lt;code&gt;1.82.7&lt;/code&gt;, &lt;code&gt;1.82.8&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The takeaway here is not just that a tool got compromised. It's that incomplete remediation turned a single breach into a three-week campaign.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Payload Did
&lt;/h2&gt;

&lt;p&gt;This is the part that should make you uncomfortable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcfedlzk2zryp9rtt623m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcfedlzk2zryp9rtt623m.png" alt=" " width="800" height="2042"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The malicious &lt;code&gt;entrypoint.sh&lt;/code&gt; prepended about 105 lines of attack code before the legitimate Trivy scanner logic. The scan completed normally. Your logs looked fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Process enumeration.&lt;/strong&gt; The script scanned &lt;code&gt;/proc/*/environ&lt;/code&gt; across all runner processes, extracting environment-level secrets, filtering for anything with &lt;code&gt;env&lt;/code&gt; or &lt;code&gt;ssh&lt;/code&gt; in the name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Memory scraping.&lt;/strong&gt; Here's where it gets clever. On GitHub-hosted runners, a base64-encoded Python script located the &lt;code&gt;Runner.Worker&lt;/code&gt; process, read its memory maps via &lt;code&gt;/proc/{PID}/maps&lt;/code&gt;, and scraped raw process memory via &lt;code&gt;/proc/{PID}/mem&lt;/code&gt;. It was targeting GitHub Actions secrets specifically, looking for JSON structures matching &lt;code&gt;"name":{"value":"...","isSecret":true}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This matters because it bypassed GitHub Actions' secret masking. Values that would show as &lt;code&gt;***&lt;/code&gt; in your logs were extracted in plaintext from process memory. There's no log artifact. Nothing looked wrong.&lt;/p&gt;

&lt;p&gt;On self-hosted runners, a different payload did broad filesystem harvesting across 50+ target paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SSH private keys and configs&lt;/li&gt;
&lt;li&gt;Cloud credentials (AWS, GCP, Azure)&lt;/li&gt;
&lt;li&gt;Kubernetes configs and service account tokens&lt;/li&gt;
&lt;li&gt;Docker registry credentials&lt;/li&gt;
&lt;li&gt;Database credentials (MySQL, PostgreSQL, MongoDB, Redis)&lt;/li&gt;
&lt;li&gt;Terraform state files and variables&lt;/li&gt;
&lt;li&gt;Ansible vaults and Helm values&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.env&lt;/code&gt; files and API keys&lt;/li&gt;
&lt;li&gt;Git credentials and shell history&lt;/li&gt;
&lt;li&gt;Cryptocurrency wallet keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Encryption.&lt;/strong&gt; Collected data was encrypted with AES-256-CBC using a randomly generated session key, then the session key was wrapped with a hardcoded 4096-bit RSA public key. Only the attacker's server could decrypt it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 4: Exfiltration.&lt;/strong&gt; Primary channel was an HTTPS POST to &lt;code&gt;scan.aquasecurtiy[.]org&lt;/code&gt;, a typosquatted domain designed to look like Aqua Security's legitimate domain (note the extra 'i' in 'security'). Resolves to &lt;code&gt;45.148.10.212&lt;/code&gt; in Amsterdam.&lt;/p&gt;

&lt;p&gt;Fallback: if the C2 was unreachable and a GitHub PAT was in the environment, the malware created a public repository named &lt;code&gt;tpcp-docs&lt;/code&gt; and uploaded stolen data as release assets, turning GitHub itself into the exfiltration channel.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are You Affected?
&lt;/h2&gt;

&lt;p&gt;Check these specific exposure windows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Affected Versions&lt;/th&gt;
&lt;th&gt;Exposure Window&lt;/th&gt;
&lt;th&gt;Safe&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;trivy binary&lt;/td&gt;
&lt;td&gt;v0.69.4&lt;/td&gt;
&lt;td&gt;~3h (Mar 19)&lt;/td&gt;
&lt;td&gt;v0.69.3 or earlier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;trivy Docker Hub&lt;/td&gt;
&lt;td&gt;v0.69.5, v0.69.6, latest&lt;/td&gt;
&lt;td&gt;~10h (Mar 22–24)&lt;/td&gt;
&lt;td&gt;v0.69.3 or earlier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;trivy-action&lt;/td&gt;
&lt;td&gt;Tags 0.0.1–0.34.2&lt;/td&gt;
&lt;td&gt;~12h (Mar 19–20)&lt;/td&gt;
&lt;td&gt;v0.35.0+ or SHA-pinned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;setup-trivy&lt;/td&gt;
&lt;td&gt;All 7 tags&lt;/td&gt;
&lt;td&gt;~12h (Mar 19–20)&lt;/td&gt;
&lt;td&gt;SHA-pinned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM PyPI&lt;/td&gt;
&lt;td&gt;1.82.7, 1.82.8&lt;/td&gt;
&lt;td&gt;Mar 24+&lt;/td&gt;
&lt;td&gt;1.82.6 or earlier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you ran Trivy in any pipeline during those windows and weren't pinning to a commit SHA, you have to assume secrets were stolen. All of them. Every secret accessible from that runner environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Need to Change
&lt;/h2&gt;

&lt;p&gt;This is the remediation checklist, ordered by priority.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Rotate first, investigate second
&lt;/h3&gt;

&lt;p&gt;If you were in the exposure window, rotate everything the runner could have touched. Don't wait for confirmation. Treat every secret as compromised:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS access keys and IAM roles&lt;/li&gt;
&lt;li&gt;GCP service account keys&lt;/li&gt;
&lt;li&gt;Azure service principals&lt;/li&gt;
&lt;li&gt;Kubernetes service account tokens&lt;/li&gt;
&lt;li&gt;Docker registry credentials&lt;/li&gt;
&lt;li&gt;SSH keys&lt;/li&gt;
&lt;li&gt;Database credentials&lt;/li&gt;
&lt;li&gt;GitHub PATs and tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Pin actions to commit SHAs
&lt;/h3&gt;

&lt;p&gt;This is the single most effective structural change. Tags are mutable. Commit SHAs are not.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad — this is what everyone does, and what got compromised&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@0.24.0&lt;/span&gt;

&lt;span class="c1"&gt;# Good — SHA-pinned, immutable&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@57a97c7843d7da7a7b4f8ce2a0c4e3b7f0c2e1d&lt;/span&gt;  &lt;span class="c1"&gt;# 0.35.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, it's more work to update. That's the point. Renovatebot or Dependabot can automate SHA updates if you configure them for Actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Switch to OIDC for cloud authentication
&lt;/h3&gt;

&lt;p&gt;Long-lived cloud credentials in CI are a liability. OIDC lets your runner authenticate to AWS, GCP, or Azure without storing static keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# AWS example&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::ACCOUNT:role/github-actions-role&lt;/span&gt;
    &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing to steal if there's nothing stored. The credentials are ephemeral and scoped to the job.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Restrict runner permissions
&lt;/h3&gt;

&lt;p&gt;GitHub Actions runners get &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; by default. Scope it down:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
  &lt;span class="na"&gt;security-events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="c1"&gt;# Nothing else&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most workflows need far less than the default. Less permission means smaller blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Audit non-human identities
&lt;/h3&gt;

&lt;p&gt;The Trivy attack persisted because one service account credential wasn't rotated. Audit all machine identities in your org:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub PATs: Who issued them? When do they expire? Are they scoped minimally?&lt;/li&gt;
&lt;li&gt;Service accounts: Which ones have write access to release infrastructure?&lt;/li&gt;
&lt;li&gt;Bot accounts: Are any shared across orgs or repositories?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Long-lived, over-privileged service accounts are how a one-time breach becomes a three-week campaign.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Use secret scanning
&lt;/h3&gt;

&lt;p&gt;GitGuardian, GitHub's native secret scanning, or both. The Trivy attacker used GitHub as a fallback exfiltration channel. If your credentials ever end up in a public repo, you want to know in minutes, not days.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Verify binaries before running them
&lt;/h3&gt;

&lt;p&gt;For direct binary downloads (not GitHub Actions), verify checksums:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download the official checksums&lt;/span&gt;
curl &lt;span class="nt"&gt;-sSL&lt;/span&gt; https://github.com/aquasecurity/trivy/releases/download/v0.69.3/trivy_0.69.3_checksums.txt &lt;span class="nt"&gt;-o&lt;/span&gt; checksums.txt

&lt;span class="c"&gt;# Verify your binary&lt;/span&gt;
&lt;span class="nb"&gt;sha256sum&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; checksums.txt &lt;span class="nt"&gt;--ignore-missing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your pipeline downloads and runs binaries from the internet, add checksum verification as a step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;The Trivy attack was technically sophisticated, but the root cause is unglamorous: incomplete credential rotation.&lt;/p&gt;

&lt;p&gt;Aqua disclosed the initial breach on March 1 and rotated credentials. One PAT, one service account, one residual access path was left active. That's what TeamPCP used on March 19. The March 22 Docker Hub compromise used yet another separate credential that wasn't in scope of the original remediation.&lt;/p&gt;

&lt;p&gt;When you rotate secrets after a breach, you need to be exhaustive. Enumerate every credential that could have been exposed, every service account that had access, every integration that used a compromised token. Rotation is not a task you do until it feels complete. It's a task you do until you've verified every access path is severed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz4n6nsa5ydv92kuv24i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz4n6nsa5ydv92kuv24i.png" alt=" " width="671" height="2678"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The other lesson: the attack surface for CI/CD is enormous. Your pipeline runs with access to secrets, cloud credentials, internal infrastructure. When you add a third-party action, you're trusting that maintainer's entire security posture, including their CI, their service accounts, and their credential management practices. SHA pinning doesn't eliminate that trust but it gives you a stable, auditable point you can reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Immediate Checklist
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ ] Check pipeline logs for trivy-action usage between March 19–20
[ ] Check pipeline logs for trivy binary v0.69.4 usage on March 19
[ ] Check for Docker image usage of v0.69.5, v0.69.6, or latest between Mar 22–24
[ ] Rotate all secrets accessible from affected runners
[ ] Update trivy-action to v0.35.0 or pin to SHA
[ ] Check for LiteLLM usage of 1.82.7 or 1.82.8
[ ] Switch cloud auth to OIDC
[ ] Pin all third-party actions to commit SHAs
[ ] Restrict workflow permissions to minimum required
[ ] Audit service accounts and PATs for expiry and scope
[ ] Enable secret scanning on your org
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.crowdstrike.com/en-us/blog/from-scanner-to-stealer-inside-the-trivy-action-supply-chain-compromise/" rel="noopener noreferrer"&gt;CrowdStrike: From Scanner to Stealer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.gitguardian.com/trivys-march-supply-chain-attack-shows-where-secret-exposure-hurts-most/" rel="noopener noreferrer"&gt;GitGuardian: Trivy's March Supply Chain Attack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.legitsecurity.com/blog/the-trivy-supply-chain-compromise-what-happened-and-playbooks-to-respond" rel="noopener noreferrer"&gt;Legit Security: Playbooks to Respond&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/security/blog/2026/03/24/detecting-investigating-defending-against-trivy-supply-chain-compromise/" rel="noopener noreferrer"&gt;Microsoft Security Blog: Detecting and Defending&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arcticwolf.com/resources/blog/teampcp-supply-chain-attack-campaign-targets-trivy-checkmarx-kics-and-litellm-potential-downstream-impact-to-additional-projects/" rel="noopener noreferrer"&gt;Arctic Wolf: TeamPCP Campaign Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/aquasecurity/trivy/discussions/10425" rel="noopener noreferrer"&gt;Aqua Security: Official Disclosure&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>cicd</category>
    </item>
    <item>
      <title>GitHub Actions costs are leaking, and most teams don't notice until it's too late</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Mon, 16 Mar 2026 05:24:12 +0000</pubDate>
      <link>https://dev.to/mateenali66/github-actions-costs-are-leaking-and-most-teams-dont-notice-until-its-too-late-27d1</link>
      <guid>https://dev.to/mateenali66/github-actions-costs-are-leaking-and-most-teams-dont-notice-until-its-too-late-27d1</guid>
      <description>&lt;p&gt;Two years ago I was working on a connected vehicles platform running 40+ microservices on Kubernetes. CI was healthy, tests were passing, and nobody was paying attention to the GitHub Actions bill until it hit $4,200 in a single month.&lt;/p&gt;

&lt;p&gt;The culprit was a matrix build that someone had extended to cover six Node versions. Nobody noticed because the cost didn't show up anywhere obvious. It wasn't flagged in any alert. The engineers who added the matrix jobs weren't thinking about cost. By the time finance asked the question, the pattern had been running for three months.&lt;/p&gt;

&lt;p&gt;I started looking for a tool that could give us per-workflow cost visibility. Something that would let us answer "which workflows cost the most" and "did this PR make CI more expensive." I didn't find anything that fit, so I built CICosts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;CICosts installs as a GitHub App and receives a webhook event every time a workflow run completes. It multiplies the runner minutes by GitHub's published pricing for that runner type (Linux, Windows, macOS, self-hosted) and stores the result.&lt;/p&gt;

&lt;p&gt;From there you get a dashboard showing cost by workflow, by repository, by branch, and over time. You can set alerts when a workflow exceeds a threshold. You can see trends, spot regressions after PRs merge, and compare costs across environments.&lt;/p&gt;

&lt;p&gt;The math is straightforward. GitHub charges $0.008/minute for Linux runners, $0.016 for Windows, $0.08 for macOS. If a workflow runs for 12 minutes on Linux, that's $0.096. Not much in isolation. Run it 500 times a day across 30 repositories and it adds up fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The common patterns I see
&lt;/h2&gt;

&lt;p&gt;After watching enough CI pipelines, a few patterns account for most of the waste:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Matrix explosions.&lt;/strong&gt; A workflow that tests across 3 OS versions and 4 runtime versions runs 12 times per push. If the matrix was added incrementally over time, nobody may have thought through the cumulative cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;macOS runners for non-macOS work.&lt;/strong&gt; macOS runners cost 10x more than Linux. They're necessary for iOS builds and sometimes for Homebrew. They're not necessary for most backend services, but they show up there sometimes because someone copied a workflow template.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test parallelism without caching.&lt;/strong&gt; Running tests in parallel is good. Running them in parallel while re-downloading 200MB of dependencies on every run because the cache key is wrong is expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nightly builds that nobody needs.&lt;/strong&gt; Workflows scheduled to run nightly that were set up to catch a specific class of bug that was fixed 18 months ago. The schedule never got cleaned up.&lt;/p&gt;

&lt;p&gt;None of these are difficult to fix once you can see them. The problem is visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's now open source and free
&lt;/h2&gt;

&lt;p&gt;I built this as a paid SaaS originally. The pricing was too restrictive for a product without an established reputation. If you're asking engineers to add a GitHub App to their organization and trust it with their CI data, "trust us, it's $29/month" is a hard sell when nobody's heard of you.&lt;/p&gt;

&lt;p&gt;The honest version: the product was good and nobody knew about it. That's a distribution problem, not a product problem.&lt;/p&gt;

&lt;p&gt;So the model is now simple. CICosts is MIT licensed, the code is on GitHub, and the hosted version at app.cicosts.dev is free with no usage limits. If your organization needs an SLA or wants a private deployment, that's the enterprise tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;Install it from GitHub:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/phonotechnologies/cicosts-app
https://github.com/phonotechnologies/cicosts-api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the hosted version directly at &lt;a href="https://app.cicosts.dev" rel="noopener noreferrer"&gt;app.cicosts.dev&lt;/a&gt;. Add the GitHub App to your organization, and cost data starts flowing within a few minutes of your next workflow run.&lt;/p&gt;

&lt;p&gt;The setup takes about five minutes. There's no code change required in your repos. The GitHub App receives webhook events automatically once installed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;If I were starting from zero, I'd make it open source from day one and focus entirely on getting the GitHub App installation experience right. The hardest part of a tool like this isn't the cost calculation. It's getting someone to trust it enough to install it.&lt;/p&gt;

&lt;p&gt;Open source makes that easier. You can read the code. You can see exactly what data is being stored and what isn't. That matters when you're asking someone to add an app to their GitHub organization.&lt;/p&gt;




&lt;p&gt;The code is on GitHub under the phonotechnologies organization. PRs welcome, especially around runner pricing updates and new alert types. If you run into something, open an issue.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>github</category>
      <category>cicd</category>
      <category>opensource</category>
    </item>
    <item>
      <title>GitOps for ML in 2026: Treat Your AI Models Like Microservices (Or Watch Them Drift Into Production Chaos)</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 14 Mar 2026 21:46:50 +0000</pubDate>
      <link>https://dev.to/mateenali66/gitops-for-ml-in-2026-treat-your-ai-models-like-microservices-or-watch-them-drift-into-production-40m2</link>
      <guid>https://dev.to/mateenali66/gitops-for-ml-in-2026-treat-your-ai-models-like-microservices-or-watch-them-drift-into-production-40m2</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Apply the same GitOps discipline you use for application code to ML model deployments, and you get version history, rollback, and promotion gates that actually work, instead of the SSH-and-pray workflow most teams are still running.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;There's a model running in production right now that nobody on your team can explain. It was trained six weeks ago, deployed by someone who's since moved to a different team, and the only record of what version it is lives in a Slack message that's been buried under 4,000 other messages.&lt;/p&gt;

&lt;p&gt;When it starts making bad predictions, what's your rollback plan? If your answer involves SSHing into a server, editing a config file by hand, and hoping the right weights get loaded, you're in the majority. That doesn't make it less of a disaster.&lt;/p&gt;

&lt;p&gt;I spent the better part of last year helping platform teams get their ML deployment story straight. The pattern I kept seeing: teams had decent model training pipelines, reasonable experiment tracking in MLflow, and then a complete gap between "model registered" and "model serving traffic." The gap got filled with shell scripts, manual steps, and a whole lot of tribal knowledge.&lt;/p&gt;

&lt;p&gt;The fix isn't a new tool. It's applying discipline you already have from application deployments to the model deployment layer.&lt;/p&gt;

&lt;p&gt;Before we moved to GitOps for model deployments, a typical promotion cycle looked like this. A data scientist trains a new version, registers it in MLflow, then files a ticket. A platform engineer picks up the ticket, SSH-es into the model server, updates the model path, restarts the serving process, and manually validates that predictions look reasonable. Start to finish: 4 to 6 hours on a good day, longer when the engineer is in meetings or the server is being weird.&lt;/p&gt;

&lt;p&gt;Rollback? There was no rollback. The best-case scenario was that someone remembered what the previous model path was.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Most Teams Try First (And Why It Fails)
&lt;/h2&gt;

&lt;p&gt;The first instinct is usually scripts. Someone writes a deploy.sh that takes a model version as an argument, connects to the serving infrastructure, and handles the update. This is better than pure manual steps, but it fails in a few predictable ways.&lt;/p&gt;

&lt;p&gt;First, scripts don't have memory. You can run deploy.sh with model version 47, then run it again with version 51, and there's no audit trail of who ran what or why. When something goes wrong, you're back to grep-ing through logs and asking around.&lt;/p&gt;

&lt;p&gt;Second, scripts don't handle promotion gates. You can't encode "this model can only go to production if it passed staging validation for 24 hours" in a shell script without it becoming a sprawling mess that nobody wants to maintain.&lt;/p&gt;

&lt;p&gt;Third, and this one bites hardest: scripts assume the current state. If someone manually changes something on the serving infrastructure, your script has no way of detecting that drift. The next run might succeed or fail unpredictably depending on what changed and when.&lt;/p&gt;

&lt;p&gt;MLflow solves the experiment tracking and model registry side well. You get version numbers, artifact storage in S3, stage transitions (Staging, Production), and a clean API. What MLflow doesn't give you is a Kubernetes-native way to declare "this cluster should be running model version 47 right now" and enforce that continuously.&lt;/p&gt;

&lt;p&gt;That's where KServe and ArgoCD come in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The full stack has five layers working together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsla8yxyreadpi4g0h7zr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsla8yxyreadpi4g0h7zr.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MLflow + S3&lt;/strong&gt; handle model artifacts. Every trained model version gets registered with MLflow, which stores the artifact URI pointing to a path in S3. The URI looks something like &lt;code&gt;s3://ml-models-prod/fraud-detector/v47/model.pkl&lt;/code&gt;. MLflow's registry gives you a version number and stage metadata. The actual weights live in S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KServe InferenceService&lt;/strong&gt; is the Kubernetes abstraction for serving. Instead of managing a Pod or Deployment by hand, you define an InferenceService custom resource that describes what model to load, from where, and how to scale. KServe handles the rest: downloading the artifact from S3, loading it into the serving framework (Triton, TorchServe, SKLearn Server), and exposing an HTTP endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Git&lt;/strong&gt; holds the desired state. A &lt;code&gt;values.yaml&lt;/code&gt; file in your repository specifies which model version each environment should run. Promoting from staging to production is a PR that bumps a version number. The PR is the change review, the approval gate, and the audit trail all at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArgoCD&lt;/strong&gt; reconciles the cluster to match what's in Git. When the PR merges, ArgoCD detects the change and applies the updated KServe InferenceService. If someone manually changes the InferenceService on the cluster, ArgoCD detects the drift and reverts it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Istio&lt;/strong&gt; manages traffic splitting. During canary promotion, a VirtualService routes 10% of traffic to the new model version while 90% continues to the stable version. If metrics look good after a soak period, you update the weights and do a full cutover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus&lt;/strong&gt; collects serving metrics. Latency (p99 in particular), throughput, and prediction distribution histograms give you the signals needed to decide whether a canary is healthy or needs to be rolled back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workflow
&lt;/h2&gt;

&lt;p&gt;Here's how a model promotion actually works end to end.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftwb59lufwlvzqmev2ya1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftwb59lufwlvzqmev2ya1.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A data scientist trains a new model, evaluates it against the validation set, and if it passes threshold, registers it in MLflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metrics&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f1_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.97&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;active_run&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tracking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MlflowClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model_uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;runs:/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;mv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_model_version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fraud-detector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# mv.version == "47"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That registration triggers a CI pipeline (GitHub Actions or Tekton, depending on your setup) that opens a pull request bumping the version in the dev environment's values file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;values.yaml structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;environments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;47"&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v47"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
      &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

  &lt;span class="na"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;45"&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v45"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
      &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;

  &lt;span class="na"&gt;prod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;43"&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v43"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
      &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;KServe InferenceService (stable):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;sklearn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v43"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
    &lt;span class="na"&gt;scaleTarget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
    &lt;span class="na"&gt;scaleMetric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;concurrency&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;KServe InferenceService (canary variant):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;sklearn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v47"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="na"&gt;canaryTrafficPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ArgoCD ApplicationSet for multi-environment management:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-serving&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;
            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev-cluster&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-dev&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging-cluster&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-staging&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-cluster&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fraud-detector-{{env}}"&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/org/ml-gitops&lt;/span&gt;
        &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environments/{{env}}"&lt;/span&gt;
        &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;valueFiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;values.yaml&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}"&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{namespace}}"&lt;/span&gt;
      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;RespectIgnoreDifferences=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Istio VirtualService for canary traffic split:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-vs&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;fraud-detector.ml-serving-prod.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;x-canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;exact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
      &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-predictor-canary&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-predictor-default&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-predictor-canary&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the PR merges to dev, ArgoCD picks up the change within 3 minutes (the default sync interval) and applies the updated InferenceService. The model downloads from S3, the serving pod comes up, and the endpoint starts responding. At this point you can run your automated evaluation suite against the dev endpoint.&lt;/p&gt;

&lt;p&gt;Promoting to staging is another PR. A human reviews it, checks the dev evaluation results, and approves. Merge, ArgoCD syncs, done. Production promotion follows the same pattern but includes an additional step: the canary InferenceService gets deployed first with 10% traffic, and a GitHub Actions workflow monitors Prometheus metrics for a configured soak period (we use 2 hours for most models) before opening the full-cutover PR automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drift Detection
&lt;/h2&gt;

&lt;p&gt;Prediction drift is the sneaky failure mode. The model is technically serving, latency looks fine, but the distribution of predictions has shifted because the input data changed. You won't catch this with a liveness probe.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fya0whws7chpxtyg3q0me.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fya0whws7chpxtyg3q0me.png" alt=" " width="800" height="622"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;KServe's sklearn server exposes prediction histograms as Prometheus metrics out of the box. You define alerting rules that fire when the distribution deviates beyond a threshold from the baseline captured at deployment time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus PrometheusRule for drift alerting:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PrometheusRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-drift&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-prometheus&lt;/span&gt;
    &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alert-rules&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector.drift&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
      &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PredictionDriftDetected&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;abs(&lt;/span&gt;
              &lt;span class="s"&gt;avg_over_time(fraud_detector_prediction_mean[10m])&lt;/span&gt;
              &lt;span class="s"&gt;- avg_over_time(fraud_detector_prediction_mean[60m] offset 1d)&lt;/span&gt;
            &lt;span class="s"&gt;) &amp;gt; 0.15&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
            &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prediction&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;distribution&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;shift&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;detected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fraud-detector"&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mean&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;shifted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;humanizePercentage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;yesterday's&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;baseline.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Check&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;changes."&lt;/span&gt;

        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ModelLatencyHigh&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;histogram_quantile(0.99,&lt;/span&gt;
              &lt;span class="s"&gt;sum(rate(fraud_detector_request_duration_seconds_bucket[5m])) by (le)&lt;/span&gt;
            &lt;span class="s"&gt;) &amp;gt; 0.5&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
            &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;500ms&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fraud-detector"&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}s.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SLA&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;500ms."&lt;/span&gt;

        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ModelErrorRateHigh&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;rate(fraud_detector_request_total{status_code=~"5.."}[5m])&lt;/span&gt;
            &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="s"&gt;rate(fraud_detector_request_total[5m]) &amp;gt; 0.01&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
            &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fraud-detector"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this alert fires, it sends to PagerDuty (or your alert routing of choice via AlertManager). The on-call engineer's first action is to check whether a canary is active. If it is, rolling back is a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git revert HEAD~1
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD detects the revert within 3 minutes and redeploys the previous InferenceService version. In practice, our rollbacks averaged 4 minutes from decision to stable serving.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time to deploy new model version&lt;/td&gt;
&lt;td&gt;4 to 6 hours&lt;/td&gt;
&lt;td&gt;8 minutes to production canary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback capability&lt;/td&gt;
&lt;td&gt;None (manual rebuild)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;git revert&lt;/code&gt;, avg 4 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drift detection time&lt;/td&gt;
&lt;td&gt;6 hours (user reports)&lt;/td&gt;
&lt;td&gt;15 minutes (automated alert)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment audit trail&lt;/td&gt;
&lt;td&gt;Slack messages&lt;/td&gt;
&lt;td&gt;Full Git history with PR reviews&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment parity&lt;/td&gt;
&lt;td&gt;Best effort&lt;/td&gt;
&lt;td&gt;Enforced via ApplicationSet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config drift prevention&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;ArgoCD selfHeal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The number that surprised me most was the drift detection improvement. We caught a data schema change within 15 minutes on the new system. The same type of change previously went undetected for 6 hours before a user complaint surfaced it. That's not a monitoring win, it's a business outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with the values.yaml contract.&lt;/strong&gt; The shape of that file is the most important design decision you'll make. Get the team to agree on it before writing any ArgoCD config. Everything else follows from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;S3 artifact URIs in the InferenceService spec, not model names.&lt;/strong&gt; MLflow stage names ("Production", "Staging") are mutable. If you reference a stage name in your InferenceService spec, two different model versions could map to the same stage name over time, and your Git history loses meaning. Reference the explicit S3 URI with the version number baked in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;selfHeal is non-negotiable.&lt;/strong&gt; Turn it on in your ArgoCD sync policy. Without selfHeal, a manual kubectl edit on the InferenceService will drift silently and nobody will notice until it matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Canary soak time depends on your traffic volume.&lt;/strong&gt; For a high-volume fraud model processing 50k requests per minute, 30 minutes of canary is enough to get statistically significant signal. For a low-volume model processing 100 requests per day, 2 hours of canary at 10% gives you 20 requests through the new version. Adjust accordingly, or route specific customers to the canary instead of random percentage splitting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model cold start affects canary rollouts.&lt;/strong&gt; Large models take time to download from S3 and load into memory. A 2GB model on a cold node might take 3 to 4 minutes before it's ready to serve. Account for this in your readiness probe timeouts and don't let your monitoring system flag the canary as failing during the startup window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The repository structure I've described looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ml-gitops/
├── environments/
│   ├── dev/
│   │   ├── values.yaml
│   │   └── templates/
│   │       ├── inference-service.yaml
│   │       └── virtual-service.yaml
│   ├── staging/
│   │   ├── values.yaml
│   │   └── templates/
│   └── prod/
│       ├── values.yaml
│       └── templates/
├── base/
│   ├── inference-service-template.yaml
│   └── prometheus-rules.yaml
└── applicationset.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prerequisites before you start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes cluster (1.28 or newer)&lt;/li&gt;
&lt;li&gt;KServe 0.12 or newer installed&lt;/li&gt;
&lt;li&gt;ArgoCD 2.9 or newer installed&lt;/li&gt;
&lt;li&gt;Istio 1.20 or newer installed&lt;/li&gt;
&lt;li&gt;MLflow tracking server accessible from the cluster&lt;/li&gt;
&lt;li&gt;S3 bucket with appropriate IRSA or Workload Identity configured for KServe pods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ArgoCD ApplicationSet in this post assumes a Helm-based templating approach where each environment folder contains a values.yaml and a templates directory with the InferenceService and VirtualService manifests. You could also use Kustomize overlays. The concepts are identical.&lt;/p&gt;

&lt;p&gt;Start with dev only. Get one model version deploying cleanly through ArgoCD before adding staging and prod. Add the canary workflow only after the basic promotion gate is working reliably.&lt;/p&gt;

&lt;p&gt;The jump from "it works in dev" to "it's reliable in prod" is mostly about the Prometheus alerting and the canary soak automation. Those two pieces are what make the system trustworthy enough for the team to stop second-guessing every deployment.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kserve.github.io/website/" rel="noopener noreferrer"&gt;KServe Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://argo-cd.readthedocs.io/en/stable/user-guide/application-set/" rel="noopener noreferrer"&gt;ArgoCD ApplicationSets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlflow.org/docs/latest/model-registry.html" rel="noopener noreferrer"&gt;MLflow Model Registry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://istio.io/latest/docs/concepts/traffic-management/" rel="noopener noreferrer"&gt;Istio Traffic Management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus-operator.dev/docs/operator/api/" rel="noopener noreferrer"&gt;Prometheus Operator API&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>GitOps for ML Model Deployment: A Real Pipeline, Not a Toy Demo</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sun, 08 Mar 2026 06:27:15 +0000</pubDate>
      <link>https://dev.to/mateenali66/gitops-for-ml-model-deployment-a-real-pipeline-not-a-toy-demo-1lk8</link>
      <guid>https://dev.to/mateenali66/gitops-for-ml-model-deployment-a-real-pipeline-not-a-toy-demo-1lk8</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I replaced ad-hoc model deployments with a fully declarative GitOps pipeline using KServe and ArgoCD. Every model version lives in Git, every change goes through a PR, and rollbacks take one &lt;code&gt;git revert&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Every ML team I've worked with has the same dirty secret: their model deployments are snowflakes.&lt;/p&gt;

&lt;p&gt;The Python script that "works on the data scientist's machine." The Slack message that says "hey can you deploy the new model." The SSH session into the GPU node that nobody documented. Meanwhile, the same team's microservices are humming along with ArgoCD, automated rollbacks, PR-gated deploys, full audit trails.&lt;/p&gt;

&lt;p&gt;That gap is embarrassing, and it's completely unnecessary.&lt;/p&gt;

&lt;p&gt;KServe got accepted into CNCF as an Incubating project in September 2025. The tooling to close this gap is mature enough for production. Here's what the actual problem looks like in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Someone manually SSHes into a node and runs a deployment script. No record of what version went live.&lt;/li&gt;
&lt;li&gt;A model update silently replaces the previous one. There's no rollback path.&lt;/li&gt;
&lt;li&gt;Two data scientists think different model versions are running in staging. Both are right, sort of.&lt;/li&gt;
&lt;li&gt;An incident happens. Nobody can tell what changed or when.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've lived through all of these. The fix isn't a better runbook or more Slack discipline. It's treating model deployments the same way we treat application deployments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6a0tytpn5lm76ukafy5i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6a0tytpn5lm76ukafy5i.png" alt=" " width="800" height="979"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Tried First (And Why It Failed)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Attempt 1: Wrapping deployments in shell scripts
&lt;/h3&gt;

&lt;p&gt;The first instinct was to write a &lt;code&gt;deploy_model.sh&lt;/code&gt; that calls &lt;code&gt;kubectl apply&lt;/code&gt; with the right image tag. This is better than nothing, but it's not GitOps. The script lives somewhere, gets edited ad-hoc, and there's still no PR-gated workflow. The script is the new snowflake.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 2: Baking models into Docker images
&lt;/h3&gt;

&lt;p&gt;The idea: train the model, package the weights into a Docker image, deploy the image via a normal &lt;code&gt;Deployment&lt;/code&gt;. This works surprisingly well for small models under a few hundred MB. It breaks down fast when the model is 2GB or 14GB. Your Docker build times blow up, your registry costs climb, and now your CI pipeline is bottlenecked on model artifact size.&lt;/p&gt;

&lt;p&gt;More importantly, you lose the semantic layer. Your Git history shows &lt;code&gt;model:sha256-abc123&lt;/code&gt; instead of &lt;code&gt;fraud-detector/v2.5.0 sklearn 2 replicas 50 RPS target&lt;/code&gt;. The config and the artifact are fused. That's hard to review and harder to reason about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 3: What actually worked
&lt;/h3&gt;

&lt;p&gt;Separate the artifact from the config. The model weights live in S3, content-addressed and immutable. Git holds the pointer and all the serving configuration. A Kubernetes controller keeps the cluster in sync with what Git says. That's it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;The stack I use and recommend:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model serving&lt;/td&gt;
&lt;td&gt;KServe v0.14+&lt;/td&gt;
&lt;td&gt;Kubernetes-native CRD, multi-framework, built-in canary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps controller&lt;/td&gt;
&lt;td&gt;ArgoCD&lt;/td&gt;
&lt;td&gt;Declarative sync, health checks, rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model storage&lt;/td&gt;
&lt;td&gt;S3&lt;/td&gt;
&lt;td&gt;Content-addressable, versioned, immutable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model versioning&lt;/td&gt;
&lt;td&gt;MLflow&lt;/td&gt;
&lt;td&gt;Tracks lineage from training to deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ingress&lt;/td&gt;
&lt;td&gt;Istio&lt;/td&gt;
&lt;td&gt;Traffic splitting for canary rollouts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets&lt;/td&gt;
&lt;td&gt;AWS IRSA&lt;/td&gt;
&lt;td&gt;No credentials in Git, ever&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;KServe is the linchpin. It exposes a single &lt;code&gt;InferenceService&lt;/code&gt; CRD that ArgoCD manages like any other Kubernetes resource.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install KServe
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# cert-manager is a prerequisite&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml

kubectl create ns kserve

helm &lt;span class="nb"&gt;install &lt;/span&gt;kserve-crd oci://ghcr.io/kserve/charts/kserve-crd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; v0.14.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kserve

helm &lt;span class="nb"&gt;install &lt;/span&gt;kserve oci://ghcr.io/kserve/charts/kserve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; v0.14.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kserve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; kserve.controller.deploymentMode&lt;span class="o"&gt;=&lt;/span&gt;RawDeployment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I use &lt;code&gt;RawDeployment&lt;/code&gt; mode. It uses standard Kubernetes Deployments and Services instead of Knative, which means fewer moving parts, better compatibility with existing Prometheus and HPA setups, and no cold-start complexity on the critical path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Structure your Git repo
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;models/
├── base/
│   └── kustomization.yaml
├── fraud-detector/
│   ├── kustomization.yaml
│   ├── inference-service.yaml
│   └── service-account.yaml
├── image-classifier/
│   ├── kustomization.yaml
│   └── inference-service.yaml
└── overlays/
    ├── staging/
    │   └── kustomization.yaml
    └── production/
        └── kustomization.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kustomize overlays let you parameterize resource limits, replica counts, and model URIs per environment without duplicating YAML.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Define the InferenceService
&lt;/h3&gt;

&lt;p&gt;This is the core resource. Here's a real example for a scikit-learn fraud detection model stored in S3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# models/fraud-detector/inference-service.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
    &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-platform&lt;/span&gt;
    &lt;span class="na"&gt;model-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.4.1"&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;serving.kserve.io/deploymentMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RawDeployment&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;scaleTarget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
    &lt;span class="na"&gt;scaleMetric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rps&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;modelFormat&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sklearn&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://prod-ml-models/fraud-detector/v2.4.1"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
      &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SKLEARN_SERVER_WORKERS&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;storageUri&lt;/code&gt; is the version pointer. Bumping &lt;code&gt;v2.4.1&lt;/code&gt; to &lt;code&gt;v2.5.0&lt;/code&gt; and raising a PR is your deploy-new-model workflow.&lt;/p&gt;

&lt;p&gt;For GPU workloads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# models/image-classifier/inference-service.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;image-classifier&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.3.0"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;modelFormat&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://prod-ml-models/image-classifier/v1.3.0"&lt;/span&gt;
      &lt;span class="na"&gt;runtimeVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;23.08-py3"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
          &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
          &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;accelerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-a10g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Wire up the S3 service account
&lt;/h3&gt;

&lt;p&gt;Don't put AWS credentials in manifests. Use IRSA on EKS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# models/fraud-detector/service-account.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/kserve-model-reader&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The IAM role needs &lt;code&gt;s3:GetObject&lt;/code&gt; and &lt;code&gt;s3:ListBucket&lt;/code&gt; on your model bucket. KServe's storage initializer picks up the IRSA token automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Create the ArgoCD Application
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# argocd/apps/ml-models.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-models&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
  &lt;span class="na"&gt;finalizers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;resources-finalizer.argocd.argoproj.io&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-platform&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/phonotech/ml-manifests&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;models/overlays/production&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;RespectIgnoreDifferences=true&lt;/span&gt;
    &lt;span class="na"&gt;retry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;backoff&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
        &lt;span class="na"&gt;factor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
        &lt;span class="na"&gt;maxDuration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3m&lt;/span&gt;
  &lt;span class="na"&gt;ignoreDifferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
      &lt;span class="na"&gt;jsonPointers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/status&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/metadata/annotations/serving.kserve.io~1deploymentMode&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ignoreDifferences&lt;/code&gt; block is critical. KServe's controller writes back to the &lt;code&gt;InferenceService&lt;/code&gt; status and some annotations. Without it, ArgoCD will perpetually detect drift and attempt to re-sync, creating a noisy feedback loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: The deployment workflow
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfpkxi8khe6ktal86qg1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfpkxi8khe6ktal86qg1.png" alt=" " width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's what a model update looks like end to end:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data scientist trains a new model, registers the artifact in MLflow, uploads weights to &lt;code&gt;s3://prod-ml-models/fraud-detector/v2.5.0/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;They open a PR updating &lt;code&gt;storageUri&lt;/code&gt; and the &lt;code&gt;model-version&lt;/code&gt; label in &lt;code&gt;inference-service.yaml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;PR gets reviewed and merged to &lt;code&gt;main&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;ArgoCD detects the diff within 3 minutes (or immediately with webhooks), syncs the new &lt;code&gt;InferenceService&lt;/code&gt; spec&lt;/li&gt;
&lt;li&gt;KServe's storage initializer pulls the new weights into the pod&lt;/li&gt;
&lt;li&gt;New revision comes up healthy, traffic cuts over&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model version is in Git history. You can &lt;code&gt;git revert&lt;/code&gt; it. You can see exactly what changed between &lt;code&gt;v2.4.1&lt;/code&gt; and &lt;code&gt;v2.5.0&lt;/code&gt; in the PR diff.&lt;/p&gt;

&lt;p&gt;To trigger ArgoCD immediately via webhook from GitHub Actions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/sync-models.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Notify ArgoCD on model manifest change&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;models/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;sync&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trigger ArgoCD sync&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -s -X POST \&lt;/span&gt;
            &lt;span class="s"&gt;-H "Authorization: Bearer ${{ secrets.ARGOCD_TOKEN }}" \&lt;/span&gt;
            &lt;span class="s"&gt;https://argocd.internal.ca/api/v1/applications/ml-models/sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Canary rollouts
&lt;/h3&gt;

&lt;p&gt;KServe's built-in canary support is where this pattern earns its keep.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Deploy canary at 10% traffic&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;canaryTrafficPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;modelFormat&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sklearn&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://prod-ml-models/fraud-detector/v2.5.0"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;KServe automatically routes 90% to the last stable revision and 10% to v2.5.0. If the new model performs well, merge another PR bumping &lt;code&gt;canaryTrafficPercent&lt;/code&gt; to 50, then promote to 100 by removing the field. If the canary is bad, set &lt;code&gt;canaryTrafficPercent: 0&lt;/code&gt; to pin back to stable immediately.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;RawDeployment&lt;/code&gt; mode, you handle canary at the Istio level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# istio/virtualservice-fraud-detector.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;fraud-detector.ml-serving.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-v2-4-1-predictor&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-v2-5-0-predictor&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both the &lt;code&gt;InferenceService&lt;/code&gt; and the &lt;code&gt;VirtualService&lt;/code&gt; are in Git. The traffic split is in Git. Everything is auditable and revertible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt50j7pwdqb5nzjsnm57.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt50j7pwdqb5nzjsnm57.png" alt=" " width="800" height="1459"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;I won't pretend I have clean before/after numbers from a single project because this pattern spans multiple engagements. Here's what consistently holds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model deployment method&lt;/td&gt;
&lt;td&gt;Manual SSH or ad-hoc scripts&lt;/td&gt;
&lt;td&gt;PR-gated, Git-backed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;None or Slack history&lt;/td&gt;
&lt;td&gt;Full Git history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback time&lt;/td&gt;
&lt;td&gt;30 minutes to hours&lt;/td&gt;
&lt;td&gt;One &lt;code&gt;git revert&lt;/code&gt;, seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canary traffic split&lt;/td&gt;
&lt;td&gt;Not possible without Istio knowledge&lt;/td&gt;
&lt;td&gt;Config field in YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to detect config drift&lt;/td&gt;
&lt;td&gt;Never (no baseline)&lt;/td&gt;
&lt;td&gt;Continuous, ArgoCD UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secret management&lt;/td&gt;
&lt;td&gt;Often hard-coded or in &lt;code&gt;.env&lt;/code&gt; files&lt;/td&gt;
&lt;td&gt;IRSA, no credentials in Git&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The operational improvement that surprises people most: the on-call burden drops significantly when you can answer "what version is running, what changed, who approved it" in under 30 seconds by looking at Git.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The &lt;code&gt;ignoreDifferences&lt;/code&gt; config is not optional.&lt;/strong&gt; Skip it and you'll spend a weekend wondering why ArgoCD is perpetually out of sync when nothing real has changed. KServe mutates its own resources. Tell ArgoCD which fields to ignore.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Model size determines your storage strategy.&lt;/strong&gt; Under 500MB, the default S3 init container approach is fine. Over a few GB, you need a shared model cache PVC or a pre-baked image. Planning this up front saves a painful migration later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Always set &lt;code&gt;nodeSelector&lt;/code&gt; for GPU workloads.&lt;/strong&gt; Without it, your &lt;code&gt;InferenceService&lt;/code&gt; might land on a CPU node and silently fall back to CPU inference. Set the affinity, set the tolerations, pin it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Start with &lt;code&gt;RawDeployment&lt;/code&gt; mode.&lt;/strong&gt; Knative is powerful but it adds complexity. Get the core pattern working first, then add Knative if you genuinely need scale-to-zero economics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. GitOps creates friction on purpose.&lt;/strong&gt; The PR workflow adds a step that direct &lt;code&gt;kubectl apply&lt;/code&gt; doesn't. That step is the point. If your team resents the friction, they haven't lived through the 2am incident where nobody knows what changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The five things you actually need to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;KServe installed (Helm, RawDeployment mode, cert-manager prerequisite)&lt;/li&gt;
&lt;li&gt;A models-manifests repo with &lt;code&gt;InferenceService&lt;/code&gt; YAML per model, Kustomize overlays for environments&lt;/li&gt;
&lt;li&gt;ArgoCD Application pointing at &lt;code&gt;overlays/production&lt;/code&gt;, &lt;code&gt;selfHeal: true&lt;/code&gt;, with &lt;code&gt;ignoreDifferences&lt;/code&gt; on KServe status fields&lt;/li&gt;
&lt;li&gt;IRSA or Workload Identity for S3 access&lt;/li&gt;
&lt;li&gt;Branch protection on &lt;code&gt;main&lt;/code&gt; so model version bumps require PR review&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The canary rollout and GitHub Actions webhook are enhancements. Get the core working first.&lt;/p&gt;




</description>
      <category>kubernetes</category>
      <category>mlops</category>
      <category>gitops</category>
      <category>argocd</category>
    </item>
    <item>
      <title>I Migrated a Real Production Codebase from Terraform to OpenTofu (Here's What Broke)</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sun, 08 Mar 2026 06:25:03 +0000</pubDate>
      <link>https://dev.to/mateenali66/i-migrated-a-real-production-codebase-from-terraform-to-opentofu-heres-what-broke-4j1b</link>
      <guid>https://dev.to/mateenali66/i-migrated-a-real-production-codebase-from-terraform-to-opentofu-heres-what-broke-4j1b</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Migrating a standard AWS Terraform codebase to OpenTofu took half a day, most of which was CI pipeline updates. The S3 native locking alone made it worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I've been writing Terraform since version 0.8. Watched it grow from a scrappy infrastructure tool into the de-facto standard for cloud automation. I've migrated teams from CloudFormation to Terraform, written custom providers, debugged state corruption at 2 AM. Terraform is baked into how I think about infrastructure.&lt;/p&gt;

&lt;p&gt;So when HashiCorp switched to the Business Source License in August 2023, I did what most practitioners did: I shrugged, bookmarked the OpenTofu repo, and went back to building.&lt;/p&gt;

&lt;p&gt;That bookmark sat there for two years.&lt;/p&gt;

&lt;p&gt;The BSL doesn't prevent you from using Terraform. It prevents you from building a product or service that's "substantially similar" to Terraform Cloud or Terraform Enterprise. For most teams running internal infrastructure, the risk is low. But once you're building a platform team that exposes self-service infrastructure to internal customers, or packaging IaC automation as part of a managed service, your legal team might want a conversation. And once "get legal sign-off on our IaC toolchain" is on the agenda, you've already lost an afternoon you'll never get back.&lt;/p&gt;

&lt;p&gt;For a Phono Technologies project, we were building a lightweight CI/CD orchestration layer for client infrastructure. The moment I tried to describe it, I realized I was describing exactly what the BSL restricts. The ambiguity was real enough that I wanted it gone.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Tried First (And Why It Failed)
&lt;/h2&gt;

&lt;p&gt;My first instinct was to just drop in the &lt;code&gt;tofu&lt;/code&gt; binary and run &lt;code&gt;tofu init&lt;/code&gt;. Simple enough.&lt;/p&gt;

&lt;p&gt;It almost worked. Until I checked where providers were being pulled from.&lt;/p&gt;

&lt;p&gt;OpenTofu fetches providers from &lt;code&gt;registry.opentofu.org&lt;/code&gt;, not &lt;code&gt;registry.terraform.io&lt;/code&gt;. The registries mirror each other for HashiCorp providers, but your existing &lt;code&gt;.terraform.lock.hcl&lt;/code&gt; was generated against Terraform's registry. The provider hashes don't match.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Failed to install provider

To install this provider, OpenTofu needs to verify that the checksums in
.terraform.lock.hcl match the provider packages downloaded from the registry.
The following packages are required but the checksums don't match:
  registry.opentofu.org/hashicorp/aws v5.82.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also ran into teammates who still had the old Terraform-generated lock files. Some ran &lt;code&gt;tofu plan&lt;/code&gt; on their local branches and got hash mismatches in the other direction. The lesson: this has to be a coordinated team migration, not a quiet swap on your own laptop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;The codebase: a mid-sized AWS platform for a SaaS client. Around 8,000 lines of Terraform across 12 modules. Standard providers: &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;kubernetes&lt;/code&gt;, &lt;code&gt;helm&lt;/code&gt;, &lt;code&gt;random&lt;/code&gt;, &lt;code&gt;tls&lt;/code&gt;. S3 backend for state, one workspace per environment. CI via GitHub Actions. No Terraform Cloud, no HCP.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fon677jiral0tagg9875q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fon677jiral0tagg9875q.png" alt=" " width="800" height="1043"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Back up everything
&lt;/h3&gt;

&lt;p&gt;Before touching anything, tag the current state in git and pull a snapshot of your state file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git tag pre-opentofu-migration

terraform state pull &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; terraform.tfstate.backup-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're on S3, enable versioning before you start. You want a timestamped rollback point. Non-negotiable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Install tofu alongside terraform
&lt;/h3&gt;

&lt;p&gt;The two binaries coexist without conflict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;opentofu
tofu &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# OpenTofu v1.11.4&lt;/span&gt;
&lt;span class="c"&gt;# on darwin_arm64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep &lt;code&gt;terraform&lt;/code&gt; installed until you're confident the migration is complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Delete the lock file and re-init
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; .terraform.lock.hcl
tofu init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tofu init&lt;/code&gt; regenerates the lock file with hashes for both &lt;code&gt;registry.opentofu.org&lt;/code&gt; and &lt;code&gt;registry.terraform.io&lt;/code&gt; providers, signed by OpenTofu's key infrastructure. Commit the new lock file and announce to your team to re-run &lt;code&gt;tofu init&lt;/code&gt; on their local copies.&lt;/p&gt;

&lt;p&gt;Once you commit the new lock file, treat the repo as an OpenTofu project. Don't run &lt;code&gt;terraform init&lt;/code&gt; on the same directory afterward. The two binaries will fight over hashes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Check your &lt;code&gt;terraform {}&lt;/code&gt; block
&lt;/h3&gt;

&lt;p&gt;You don't have to rename it. OpenTofu still accepts the &lt;code&gt;terraform {}&lt;/code&gt; block. Your existing HCL works without modification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This works fine in OpenTofu, no changes needed&lt;/span&gt;
&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;required_version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&amp;gt;= 1.5.0"&lt;/span&gt;

  &lt;span class="nx"&gt;required_providers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hashicorp/aws"&lt;/span&gt;
      &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 5.0"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"my-terraform-state"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production/terraform.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-state-locks"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can leave it as &lt;code&gt;terraform {}&lt;/code&gt; or rename it to &lt;code&gt;tofu {}&lt;/code&gt;. Both work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Verify with &lt;code&gt;tofu plan&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tofu plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;migration-test.tfplan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected result: no changes. If you see changes, do not apply. Investigate first. It usually means a provider version difference or a schema update.&lt;/p&gt;

&lt;p&gt;I got zero changes across all three environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Drop DynamoDB for S3 native locking
&lt;/h3&gt;

&lt;p&gt;This is where OpenTofu pulls ahead. OpenTofu 1.10.0 added native conditional writes for S3 state locking. No DynamoDB table required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"my-state-bucket"&lt;/span&gt;
  &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod/terraform.tfstate"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
  &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-locks"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"my-state-bucket"&lt;/span&gt;
  &lt;span class="nx"&gt;key&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod/terraform.tfstate"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
  &lt;span class="nx"&gt;encrypt&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;use_lockfile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fewer moving parts. One less AWS service to manage. Simpler IAM permissions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujqnfnrt73mfc7mow595.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujqnfnrt73mfc7mow595.png" alt=" " width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7: Update your CI pipeline
&lt;/h3&gt;

&lt;p&gt;Every place your pipeline runs &lt;code&gt;terraform&lt;/code&gt;, you need &lt;code&gt;tofu&lt;/code&gt;. In GitHub Actions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hashicorp/setup-terraform@v3&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;terraform_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.9.5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;opentofu/setup-opentofu@v1&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tofu_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.11.4"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;opentofu/setup-opentofu&lt;/code&gt; action is the official GitHub Action. Clean swap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;State locking dependencies&lt;/td&gt;
&lt;td&gt;S3 + DynamoDB&lt;/td&gt;
&lt;td&gt;S3 only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB tables&lt;/td&gt;
&lt;td&gt;3 (one per environment)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migration time&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;4 hours (including CI updates)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plan output differences&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sensitive values in state&lt;/td&gt;
&lt;td&gt;Persisted&lt;/td&gt;
&lt;td&gt;Ephemeral (with 1.11 features)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The operational simplicity of dropping DynamoDB is hard to quantify in a table. It's one less service in IAM policies, one less resource to manage in the state backend module, one less thing that can drift or get misconfigured.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Coordinate the lock file migration as a team.&lt;/strong&gt; If half your team is still running &lt;code&gt;terraform init&lt;/code&gt;, you'll get hash conflicts. Announce the cutover date, have everyone delete and regenerate their lock files on the same day.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pin your OpenTofu version in CI.&lt;/strong&gt; The 1.11.x patch cycle had a notable regression in 1.11.0 that was fixed in 1.11.2. The team moves fast. Pin to a specific minor version in CI and upgrade deliberately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The &lt;code&gt;terraform {}&lt;/code&gt; block is fine.&lt;/strong&gt; Don't waste time renaming it. The binary changed; the HCL didn't.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The point of no return is &lt;code&gt;tofu apply&lt;/code&gt;.&lt;/strong&gt; After you run apply, the state metadata reflects OpenTofu's version. You can still read the state with Terraform, but you'll get warnings. Decide before you apply whether you're committed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ephemeral values are worth understanding.&lt;/strong&gt; OpenTofu 1.11.0 introduced ephemeral resources and write-only attributes. Sensitive credentials can be used without ever landing in state. If you've been papering over this with Vault workarounds, it's worth reading the docs before you finish the migration.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;ephemeral&lt;/span&gt; &lt;span class="s2"&gt;"aws_secretsmanager_secret_version"&lt;/span&gt; &lt;span class="s2"&gt;"db_password"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;secret_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_secretsmanager_secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"kubernetes_secret_v1"&lt;/span&gt; &lt;span class="s2"&gt;"db_credentials"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;metadata&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db-credentials"&lt;/span&gt;
    &lt;span class="nx"&gt;namespace&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;data_wo&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;password&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ephemeral&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_secretsmanager_secret_version&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db_password&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;secret_string&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;data_wo_revision&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenTofu Migration Guide:&lt;/strong&gt; &lt;a href="https://opentofu.org/docs/intro/migration/migration-guide/" rel="noopener noreferrer"&gt;opentofu.org/docs/intro/migration&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>opensource</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Drift Detection in Air-Gapped Workloads: What Nobody Tells You</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 21 Feb 2026 06:32:18 +0000</pubDate>
      <link>https://dev.to/mateenali66/drift-detection-in-air-gapped-workloads-what-nobody-tells-you-3eb9</link>
      <guid>https://dev.to/mateenali66/drift-detection-in-air-gapped-workloads-what-nobody-tells-you-3eb9</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Standard drift detection breaks in air-gapped environments because every major tool assumes cloud API access. The fix is decentralized reconciliation with local state management, not trying to force connected tools into disconnected networks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Assumption That Breaks Everything
&lt;/h2&gt;

&lt;p&gt;Every popular drift detection tool makes the same assumption: your infrastructure can reach the internet.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;terraform plan&lt;/code&gt; calls AWS APIs. Argo CD pulls from remote Git repos. Spacelift runs scans from a SaaS control plane. These tools work brilliantly in connected environments. The moment you drop them into an air-gapped network, they go silent.&lt;/p&gt;

&lt;p&gt;I've spent the better part of a decade building infrastructure for organizations where connectivity isn't optional, it's forbidden. Government agencies, defense contractors, healthcare systems, financial trading floors. These environments are disconnected by design, not by accident. And drift detection in these networks is a fundamentally different problem than what most DevOps engineers encounter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Air-Gapped Workloads Drift Differently
&lt;/h2&gt;

&lt;p&gt;In a connected environment, drift happens and gets caught relatively fast. Someone clicks through the console, Terraform Cloud flags it on the next scan, you fix it. The feedback loop is tight.&lt;/p&gt;

&lt;p&gt;In air-gapped environments, drift accumulates silently.&lt;/p&gt;

&lt;p&gt;A sysadmin patches a node manually because the automated pipeline can't reach the package mirror. A developer tweaks a ConfigMap directly because the GitOps controller lost sync with the local Git server. An operator scales a deployment by hand during an incident and forgets to commit the change.&lt;/p&gt;

&lt;p&gt;These changes compound. By the time anyone runs a manual audit, the gap between declared state and actual state can be enormous.&lt;/p&gt;

&lt;p&gt;The core problem: &lt;strong&gt;connected drift detection is continuous and automated. Disconnected drift detection is episodic and manual.&lt;/strong&gt; That gap is where compliance violations, security incidents, and late night pages live.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Doesn't Work (And Why Teams Keep Trying)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Terraform Plan Over VPN
&lt;/h3&gt;

&lt;p&gt;The most common first attempt: tunnel &lt;code&gt;terraform plan&lt;/code&gt; through a VPN into the air-gapped network.&lt;/p&gt;

&lt;p&gt;Problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency kills the feedback loop.&lt;/strong&gt; Provider API calls that take milliseconds on the internet take seconds over a restricted VPN. A plan that runs in 30 seconds now takes 15 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial connectivity isn't air-gapped.&lt;/strong&gt; If your "air-gapped" network has a VPN tunnel to SaaS tooling, your security team has questions. Valid ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State file synchronization becomes a bottleneck.&lt;/strong&gt; Remote state backends (S3, Consul) need connectivity. Local state files create merge conflicts when multiple operators work simultaneously.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GitOps Controllers Pointed at External Repos
&lt;/h3&gt;

&lt;p&gt;Flux CD and Argo CD are excellent GitOps tools. But pointing them at a GitHub repo from an air-gapped cluster means... you don't have an air-gapped cluster anymore.&lt;/p&gt;

&lt;p&gt;Running a local Git server (Gitea, GitLab) inside the perimeter fixes the connectivity problem but creates a new one: keeping the local repo in sync with the source of truth requires a deliberate, auditable transfer process. USB drives, data diodes, or scheduled one-way syncs all introduce delay. That delay is where drift happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Periodic Manual Audits
&lt;/h3&gt;

&lt;p&gt;The fallback everyone hates: someone SSHes in, runs a bunch of comparison scripts, and writes a report.&lt;/p&gt;

&lt;p&gt;This catches drift after the fact. In regulated environments, "we check quarterly" doesn't satisfy auditors who want continuous compliance evidence. And manual audits miss things. Every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;After iterating through the failures above across multiple engagements, three patterns consistently work in production air-gapped environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Decentralized Policy Agents
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5e6so0bjwpy9ojikbd5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5e6so0bjwpy9ojikbd5a.png" alt=" " width="800" height="1225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead of a central control plane that reaches into clusters, deploy autonomous policy agents inside each air-gapped cluster.&lt;/p&gt;

&lt;p&gt;Each agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stores the desired state locally (pulled in during the last approved sync window)&lt;/li&gt;
&lt;li&gt;Runs a continuous reconciliation loop comparing desired vs. actual state&lt;/li&gt;
&lt;li&gt;Logs every deviation to a local audit store&lt;/li&gt;
&lt;li&gt;Remediates automatically when configured to do so, or raises alerts for manual review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the pattern that Spectro Cloud Palette uses, and it's the right mental model. The cluster enforces its own policy. It doesn't need to phone home.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: OPA Gatekeeper constraint running locally&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;constraints.gatekeeper.sh/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8sRequiredLabels&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-team-label&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Namespace"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost-center"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gatekeeper runs entirely inside the cluster. No external connectivity needed. Violations are logged locally and can be exported during sync windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Local State Snapshots with Diff-on-Sync
&lt;/h3&gt;

&lt;p&gt;For Terraform managed infrastructure, maintain state snapshots inside the air-gapped environment.&lt;/p&gt;

&lt;p&gt;The workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Declare state&lt;/strong&gt; in your IaC repo outside the air gap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer the repo&lt;/strong&gt; into the environment through your approved media (data diode, approved USB, one-way sync)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run &lt;code&gt;terraform plan&lt;/code&gt;&lt;/strong&gt; inside the air gap against local provider endpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot the actual state&lt;/strong&gt; after each apply&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff the snapshot&lt;/strong&gt; against the expected state on a cron schedule&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export the diff report&lt;/strong&gt; during the next sync window&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: the state file and the provider APIs both live inside the perimeter. &lt;code&gt;terraform plan&lt;/code&gt; works fine when everything it needs to reach is local.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# drift_check.sh - runs inside the air-gapped environment&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d_%H%M%S&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/var/log/drift-reports"&lt;/span&gt;

terraform plan &lt;span class="nt"&gt;-detailed-exitcode&lt;/span&gt; &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/plan_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.tfplan"&lt;/span&gt; 2&amp;gt;&amp;amp;1 | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;tee&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/drift_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.log"&lt;/span&gt;

&lt;span class="nv"&gt;EXIT_CODE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PIPESTATUS&lt;/span&gt;&lt;span class="p"&gt;[0]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$EXIT_CODE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 2 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"DRIFT_DETECTED"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/status_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="c"&gt;# Alert local monitoring&lt;/span&gt;
  curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://alertmanager.local:9093/api/v1/alerts &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'[{"labels":{"alertname":"InfrastructureDrift","severity":"warning"}}]'&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 3: Immutable Baselines with Checksum Verification
&lt;/h3&gt;

&lt;p&gt;For the most sensitive environments (defense, critical infrastructure), treat infrastructure state like a software artifact.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build a golden baseline&lt;/strong&gt; of every resource's expected configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate checksums&lt;/strong&gt; (SHA-256) for each configuration artifact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy a lightweight agent&lt;/strong&gt; that periodically recalculates checksums on live resources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Any mismatch triggers an immediate alert&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is coarser than Terraform drift detection, but it works without any provider APIs. It's closer to file integrity monitoring (think AIDE or OSSEC) applied to infrastructure configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# baseline_check.py - infrastructure checksum verification
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_resource_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Capture current state of a Kubernetes resource.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Strip volatile fields that change on every read
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resourceVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;creationTimestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;managedFields&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;checksum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate deterministic checksum of resource state.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify_baseline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compare live state against stored baseline checksums.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;drift_detected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_resource_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;drift_detected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MISSING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;checksum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;drift_detected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MODIFIED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;drift_detected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Choosing the Right Pattern
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Audit Trail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decentralized Agents&lt;/td&gt;
&lt;td&gt;Kubernetes clusters&lt;/td&gt;
&lt;td&gt;Real-time&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local State Snapshots&lt;/td&gt;
&lt;td&gt;Terraform/IaC resources&lt;/td&gt;
&lt;td&gt;Minutes (cron)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Checksum Baselines&lt;/td&gt;
&lt;td&gt;High-security environments&lt;/td&gt;
&lt;td&gt;Minutes (cron)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practice, most air-gapped environments use a combination. Gatekeeper handles Kubernetes policy enforcement in real time. Terraform drift checks run on a cron inside the perimeter. Checksum baselines provide an additional layer for the security team.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compliance Angle
&lt;/h2&gt;

&lt;p&gt;Auditors care about three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Can you prove your infrastructure matches the declared state?&lt;/strong&gt; Drift reports with timestamps answer this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How quickly do you detect deviations?&lt;/strong&gt; "Within minutes" beats "at the next quarterly audit."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens when drift is detected?&lt;/strong&gt; Automated remediation or documented manual review processes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Air-gapped environments often have stricter compliance requirements than connected ones. The irony is that their tooling for meeting those requirements is worse. Building local drift detection infrastructure closes that gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons From the Field
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Treat sync windows as deployment events.&lt;/strong&gt; When new policy or desired state enters the air-gapped environment, that transfer should go through the same review process as a production deployment. Because it is one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Log everything locally, export periodically.&lt;/strong&gt; Build a local ELK or Loki stack inside the perimeter. Drift events, remediation actions, audit logs. Export summaries during sync windows for central visibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Test your drift detection in staging first.&lt;/strong&gt; Introduce intentional drift in a staging cluster and verify your agents catch it. I've seen teams deploy Gatekeeper and assume it works, only to discover six months later that their constraints had a typo that prevented enforcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Don't fight the air gap.&lt;/strong&gt; The biggest mistake is trying to poke holes in the network boundary to make connected tools work. Every hole is an attack surface. Build for disconnection. It's simpler in the long run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Version your baselines.&lt;/strong&gt; When the approved state changes (through a sync window), update the baseline checksums and keep the old ones. This gives you a historical record of what the environment should have looked like at any point in time.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>security</category>
    </item>
    <item>
      <title>OpenClaw for SRE: Self-Hosted AI Agents That Actually Respond to Incidents</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 21 Feb 2026 06:27:44 +0000</pubDate>
      <link>https://dev.to/mateenali66/openclaw-for-sre-self-hosted-ai-agents-that-actually-respond-to-incidents-5279</link>
      <guid>https://dev.to/mateenali66/openclaw-for-sre-self-hosted-ai-agents-that-actually-respond-to-incidents-5279</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; OpenClaw is a self-hosted AI agent framework that connects to Slack, Teams, and other channels. For SRE teams, it's a way to build incident response automation that runs entirely on your infrastructure, with custom skills for runbook execution, alert triage, and operational context.&lt;/p&gt;




&lt;h2&gt;
  
  
  The SRE Automation Gap
&lt;/h2&gt;

&lt;p&gt;Every SRE team I've worked with has the same problem: too many alerts, not enough context, and runbooks that exist but don't get followed at 3 AM.&lt;/p&gt;

&lt;p&gt;The typical incident response flow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;PagerDuty fires an alert&lt;/li&gt;
&lt;li&gt;On-call engineer wakes up, opens laptop&lt;/li&gt;
&lt;li&gt;Checks Slack for context (is anyone else awake?)&lt;/li&gt;
&lt;li&gt;Opens Grafana, tries to find the relevant dashboard&lt;/li&gt;
&lt;li&gt;Searches Confluence for the runbook&lt;/li&gt;
&lt;li&gt;Realizes the runbook is outdated&lt;/li&gt;
&lt;li&gt;Starts troubleshooting from scratch&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 2 through 6 consume 15 to 30 minutes before any real diagnosis begins. For a P1 incident at scale, that's the difference between a blip and an outage that hits the status page.&lt;/p&gt;

&lt;p&gt;SaaS tools like PagerDuty's AIOps and Rootly have started addressing this with AI-powered incident assistants. They work well, but they require sending your operational data to third-party services. For organizations with strict data residency requirements, that's a non-starter.&lt;/p&gt;

&lt;p&gt;OpenClaw fills that gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What OpenClaw Actually Is
&lt;/h2&gt;

&lt;p&gt;OpenClaw is an open-source, self-hosted framework for running AI agents across messaging platforms. It launched in late 2025 as a personal AI assistant project and has rapidly grown into something more interesting: a platform for building operational automation.&lt;/p&gt;

&lt;p&gt;The core architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-channel gateway&lt;/strong&gt;: Connects to Slack, Microsoft Teams, Discord, WhatsApp, Telegram. Messages from any channel get normalized into a unified format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM provider abstraction&lt;/strong&gt;: Works with multiple model providers. You bring your own API keys. Switch providers without changing your skills or workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory&lt;/strong&gt;: Maintains conversational context across interactions. The agent remembers what happened in the last incident, what commands were run, what the outcome was.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills framework&lt;/strong&gt;: A plugin system that lets you extend the agent with custom capabilities. This is where the SRE value lives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything runs on your infrastructure. Docker Compose for simple setups, Kubernetes for production. Your data stays on your servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SRE Teams Should Care
&lt;/h2&gt;

&lt;p&gt;The skills framework is what makes OpenClaw interesting for operations work. A "skill" in OpenClaw is essentially a structured capability with defined inputs, outputs, and permissions.&lt;/p&gt;

&lt;p&gt;For SRE, that means you can build skills like:&lt;/p&gt;

&lt;h3&gt;
  
  
  Incident Triage
&lt;/h3&gt;

&lt;p&gt;An agent that automatically pulls context when an alert fires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SKILL.md: incident-triage

Inputs: alert_name, service, severity
Actions:
  1. Query Prometheus for related metrics (last 30 min)
  2. Check recent deployments from deploy tracker
  3. Pull relevant runbook from internal wiki
  4. Summarize findings in incident channel

Permissions: read-only access to Prometheus API, deploy API, wiki API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When PagerDuty fires an alert and posts to Slack, the OpenClaw agent picks it up, runs the triage skill, and drops a summary into the incident channel before the on-call engineer has finished logging in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runbook Execution
&lt;/h3&gt;

&lt;p&gt;Instead of linking to a Confluence page that may or may not be current, encode runbooks as executable skills:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SKILL.md: restart-service

Inputs: service_name, environment
Actions:
  1. Verify service exists in target environment
  2. Check current health status
  3. Execute rolling restart via Kubernetes API
  4. Monitor health checks for 5 minutes
  5. Report success/failure to incident channel

Permissions: kubernetes API (limited to restart operations)
Guardrails: requires confirmation for production, auto-approve for staging
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The on-call engineer says "restart the payment service in staging" in Slack, and the agent executes the runbook step by step, reporting progress as it goes. No SSH-ing into bastion hosts. No copy-pasting commands from a wiki.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alert Correlation
&lt;/h3&gt;

&lt;p&gt;Connect the agent to your monitoring stack and let it correlate across signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SKILL.md: correlate-alerts

Inputs: primary_alert
Actions:
  1. Query AlertManager for alerts fired within +/- 5 minutes
  2. Query deployment tracker for recent changes
  3. Check dependent service health
  4. Identify common root cause patterns
  5. Suggest investigation path

Permissions: read-only AlertManager API, deploy tracker, service catalog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of an engineer manually checking five dashboards to figure out why the checkout service is slow, the agent correlates: "Three alerts fired in the last 10 minutes: high latency on checkout, connection pool exhaustion on payments DB, and a deployment to the payments service 12 minutes ago. Likely cause: the payments deploy."&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting It Up for SRE
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8911qxmaysp2hcwsyv6i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8911qxmaysp2hcwsyv6i.png" alt=" " width="800" height="695"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Deploy the Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml (simplified)&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.8"&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openclaw&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openclaw/openclaw:latest&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./config:/home/openclaw/.openclaw&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./skills:/home/openclaw/skills&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000:3000"&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure Messaging Channels
&lt;/h3&gt;

&lt;p&gt;Point it at your Slack workspace. The agent appears as a bot user in your incident channels. Teams that use Microsoft Teams or Discord can connect those instead, same agent, different channel.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Build SRE Skills
&lt;/h3&gt;

&lt;p&gt;Each skill is a directory with a &lt;code&gt;SKILL.md&lt;/code&gt; that defines its behavior and a set of supporting scripts or API integrations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;skills/
├── incident-triage/
│   ├── SKILL.md
│   ├── prometheus_query.py
│   └── deploy_check.py
├── restart-service/
│   ├── SKILL.md
│   └── k8s_restart.py
├── correlate-alerts/
│   ├── SKILL.md
│   └── alertmanager_client.py
└── status-page-update/
    ├── SKILL.md
    └── statuspage_api.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Connect to Your Monitoring Stack
&lt;/h3&gt;

&lt;p&gt;The agent needs read access to your observability tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Integration&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Access Level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus/VictoriaMetrics&lt;/td&gt;
&lt;td&gt;Metrics queries&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AlertManager&lt;/td&gt;
&lt;td&gt;Alert correlation&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes API&lt;/td&gt;
&lt;td&gt;Service health, restarts&lt;/td&gt;
&lt;td&gt;Scoped RBAC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy tracker&lt;/td&gt;
&lt;td&gt;Recent changes&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal wiki&lt;/td&gt;
&lt;td&gt;Runbooks&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StatusPage&lt;/td&gt;
&lt;td&gt;Incident communication&lt;/td&gt;
&lt;td&gt;Write&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Principle of least privilege applies. The agent should have the minimum permissions needed for each skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;Here's a realistic incident timeline with OpenClaw:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;00:00&lt;/strong&gt; - AlertManager fires: "Checkout latency &amp;gt; 2s for 5 minutes"&lt;br&gt;
&lt;strong&gt;00:01&lt;/strong&gt; - PagerDuty pages on-call, posts to #incident-checkout in Slack&lt;br&gt;
&lt;strong&gt;00:01&lt;/strong&gt; - OpenClaw agent detects the alert, runs incident-triage skill&lt;br&gt;
&lt;strong&gt;00:02&lt;/strong&gt; - Agent posts triage summary:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident Triage: checkout-latency-high&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related Alerts (last 10 min):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;payments-db-connection-pool-exhaustion (fired 00:00)&lt;/li&gt;
&lt;li&gt;payments-service-error-rate-high (fired 00:01)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recent Deployments:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;payments-service v2.14.3 deployed 12 min ago by &lt;a class="mentioned-user" href="https://dev.to/sarah"&gt;@sarah&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Relevant Runbook:&lt;/strong&gt; Payments DB Connection Pool&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Suggested Action:&lt;/strong&gt; The payments deploy correlates with connection pool exhaustion. Consider rolling back payments-service to v2.14.2.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;00:03&lt;/strong&gt; - On-call engineer logs in, sees the full context already assembled&lt;br&gt;
&lt;strong&gt;00:04&lt;/strong&gt; - Engineer: "rollback payments-service to v2.14.2 in production"&lt;br&gt;
&lt;strong&gt;00:04&lt;/strong&gt; - Agent: "Rolling back payments-service to v2.14.2 in production. This will trigger a rolling update. Confirm? (yes/no)"&lt;br&gt;
&lt;strong&gt;00:04&lt;/strong&gt; - Engineer: "yes"&lt;br&gt;
&lt;strong&gt;00:05&lt;/strong&gt; - Agent executes rollback, monitors health checks&lt;br&gt;
&lt;strong&gt;00:08&lt;/strong&gt; - Agent: "Rollback complete. Checkout latency back to normal (avg 180ms). Payments DB connection pool utilization dropped from 98% to 45%."&lt;/p&gt;

&lt;p&gt;Total time from alert to resolution: 8 minutes. Without the agent, that same incident takes 25 to 40 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guardrails Matter
&lt;/h2&gt;

&lt;p&gt;Letting an AI agent interact with production infrastructure requires guardrails. OpenClaw's skill framework supports this through permission scoping and confirmation gates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production safeguards:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skills that modify production require explicit confirmation&lt;/li&gt;
&lt;li&gt;Read-only skills execute automatically (triage, correlation)&lt;/li&gt;
&lt;li&gt;Write operations go through a confirmation flow in the messaging channel&lt;/li&gt;
&lt;li&gt;All actions are logged with who triggered them and what the agent did&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scope limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each skill declares its required permissions&lt;/li&gt;
&lt;li&gt;Kubernetes RBAC limits what the agent can actually do&lt;/li&gt;
&lt;li&gt;API keys are scoped to specific operations&lt;/li&gt;
&lt;li&gt;No "do anything" root access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a replacement for your incident commander or your on-call engineers. It's a tool that handles the first 5 minutes of context gathering so humans can focus on the hard parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Falls Short
&lt;/h2&gt;

&lt;p&gt;OpenClaw is still young. A few things to be aware of:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill development is manual.&lt;/strong&gt; There's no marketplace or library of pre-built SRE skills. You're building integrations from scratch. If you've built Slack bots or PagerDuty integrations before, the effort is similar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM costs add up.&lt;/strong&gt; Every incident interaction consumes API tokens. For high-alert-volume environments, the cost of LLM calls during incidents needs to be factored into the budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering is real work.&lt;/strong&gt; The quality of the agent's triage and correlation depends heavily on how well the skills are designed. Poorly defined skills produce noisy, unhelpful outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not a replacement for observability.&lt;/strong&gt; The agent is only as good as the data it can access. If your monitoring has gaps, the agent inherits those gaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use It
&lt;/h2&gt;

&lt;p&gt;OpenClaw for SRE makes sense when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your organization has data residency or security requirements that rule out SaaS incident tools&lt;/li&gt;
&lt;li&gt;You already have a solid observability stack (Prometheus, Grafana, AlertManager) and want to add an intelligence layer on top&lt;/li&gt;
&lt;li&gt;Your team has the engineering capacity to build and maintain custom skills&lt;/li&gt;
&lt;li&gt;Incident response time is a critical metric you're trying to improve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn't make sense when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're a small team that can handle alerts manually&lt;/li&gt;
&lt;li&gt;You don't have a mature observability foundation yet (fix that first)&lt;/li&gt;
&lt;li&gt;You want a turnkey solution with no custom development&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>opensource</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
